## Day 29 Lecture 1 Assignment

In this assignment, we will learn about decision trees. We will use the Chicago salary dataset loaded below.

In [None]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor, plot_tree

  import pandas.util.testing as tm


In [None]:
chicago = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Current_Employee_Names__Salaries__and_Position_Titles.csv')

In [None]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32658 entries, 0 to 32657
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Name               32658 non-null  object 
 1   Job Titles         32658 non-null  object 
 2   Department         32658 non-null  object 
 3   Full or Part-Time  32658 non-null  object 
 4   Salary or Hourly   32658 non-null  object 
 5   Typical Hours      7883 non-null   float64
 6   Annual Salary      24775 non-null  object 
 7   Hourly Rate        7883 non-null   object 
dtypes: float64(1), object(7)
memory usage: 2.0+ MB


To simplify this problem, we will limit our model to only salary employees. Create a new dataset that does not contain any hourly employees.

In [None]:
salary = chicago[chicago['Salary or Hourly'] == 'Salary']

Next, we will look at the count of all values for both job titles and department to ensure that we don't use features that are too sparse in our model.

In [None]:
job_counts = salary['Job Titles'].value_counts()

In [None]:
job_index = job_counts[job_counts > 100].index

In [None]:
chicago = salary[salary['Job Titles'].isin(job_index)]

In [None]:
chicago['Job Titles'].value_counts()

POLICE OFFICER                                   9393
FIREFIGHTER-EMT                                  1424
SERGEANT                                         1118
POLICE OFFICER (ASSIGNED AS DETECTIVE)            845
FIREFIGHTER                                       564
LIEUTENANT-EMT                                    398
LIEUTENANT                                        356
FIREFIGHTER-EMT (RECRUIT)                         319
PARAMEDIC I/C                                     291
FIREFIGHTER/PARAMEDIC                             278
PARAMEDIC                                         252
AVIATION SECURITY OFFICER                         251
POLICE COMMUNICATIONS OPERATOR I                  245
POLICE COMMUNICATIONS OPERATOR II                 227
DETENTION AIDE                                    226
FIRE ENGINEER-EMT                                 226
ASST CORPORATION COUNSEL                          136
SENIOR DATA ENTRY OPERATOR                        135
CAPTAIN-EMT                 

In [None]:
chicago['Department'].value_counts()

POLICE                   12173
FIRE                      4264
OEMC                       476
AVIATION                   299
LAW                        144
PUBLIC LIBRARY             140
FINANCE                    134
CITY COUNCIL               107
HEALTH                      38
TRANSPORTN                  36
BUILDINGS                   29
FAMILY & SUPPORT            26
ADMIN HEARNG                24
BUSINESS AFFAIRS            23
WATER MGMNT                 21
COMMUNITY DEVELOPMENT       16
GENERAL SERVICES            13
STREETS & SAN               13
CITY CLERK                   8
PROCUREMENT                  5
IPRA                         2
HUMAN RELATIONS              1
LICENSE APPL COMM            1
INSPECTOR GEN                1
DISABILITIES                 1
ANIMAL CONTRL                1
DoIT                         1
BOARD OF ETHICS              1
TREASURER                    1
Name: Department, dtype: int64

In [None]:
dep_counts = chicago['Department'].value_counts()
dep_index = dep_counts[dep_counts > 5].index
chicago = chicago[chicago['Department'].isin(dep_index)]

Remove all irrelevant columns (Name, Job Titles, Salary or Hourly, Typical Hours, Hourly Rate)

In [None]:
chicago = chicago.drop(columns=['Name', 'Job Titles', 'Salary or Hourly', 'Typical Hours', 'Hourly Rate'])



In [None]:
chicago['Annual Salary'] = chicago['Annual Salary'].str.replace('$', '')
chicago['Annual Salary'] = pd.to_numeric(chicago['Annual Salary'])

In [None]:
chicago.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17984 entries, 0 to 32656
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Department         17984 non-null  object 
 1   Full or Part-Time  17984 non-null  object 
 2   Annual Salary      17984 non-null  float64
dtypes: float64(1), object(2)
memory usage: 562.0+ KB


Choose the between department and job title and use the variable with the smallest number of values to for one hot encoding. Additionally, create dummy variables for full ot part-time.

In [None]:
chicago = pd.get_dummies(chicago)

In [None]:
chicago.head()

Unnamed: 0,Annual Salary,Department_ADMIN HEARNG,Department_AVIATION,Department_BUILDINGS,Department_BUSINESS AFFAIRS,Department_CITY CLERK,Department_CITY COUNCIL,Department_COMMUNITY DEVELOPMENT,Department_FAMILY & SUPPORT,Department_FINANCE,Department_FIRE,Department_GENERAL SERVICES,Department_HEALTH,Department_LAW,Department_OEMC,Department_POLICE,Department_PUBLIC LIBRARY,Department_STREETS & SAN,Department_TRANSPORTN,Department_WATER MGMNT,Full or Part-Time_F,Full or Part-Time_P
0,107790.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
1,104628.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
2,114324.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
3,76932.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
4,111474.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0


In [None]:
chicago = chicago.drop(columns= ['Department_POLICE', 'Full or Part-Time_F'])

Check that none of the remaining columns are of object type and convert them to numeric if they are of object type.

In [None]:
chicago.dtypes

Annual Salary                       float64
Department_ADMIN HEARNG               uint8
Department_AVIATION                   uint8
Department_BUILDINGS                  uint8
Department_BUSINESS AFFAIRS           uint8
Department_CITY CLERK                 uint8
Department_CITY COUNCIL               uint8
Department_COMMUNITY DEVELOPMENT      uint8
Department_FAMILY & SUPPORT           uint8
Department_FINANCE                    uint8
Department_FIRE                       uint8
Department_GENERAL SERVICES           uint8
Department_HEALTH                     uint8
Department_LAW                        uint8
Department_OEMC                       uint8
Department_PUBLIC LIBRARY             uint8
Department_STREETS & SAN              uint8
Department_TRANSPORTN                 uint8
Department_WATER MGMNT                uint8
Full or Part-Time_P                   uint8
dtype: object

Split the data into a test and train sample. Use annual salary as the dependent variable. 20% of the data should be assigned to the test sample.

In [None]:
X = chicago.drop(columns=['Annual Salary'])

y = chicago['Annual Salary']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

Generate a regession decision tree using `DecisionTreeRegressor` in sklearn. Fit the model on the training set and calculate the score for both train and test.

In [None]:
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [None]:
model.score(X_train, y_train)

0.18841512522726878

In [None]:
model.score(X_test, y_test)

0.21136043788978542