## Day 29 Lecture 1 Assignment

In this assignment, we will learn about decision trees. We will use the Chicago salary dataset loaded below.

In [26]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics

In [2]:
chicago = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Current_Employee_Names__Salaries__and_Position_Titles.csv')

In [3]:
chicago.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"ALLISON, PAUL W",LIEUTENANT,FIRE,F,Salary,,$107790.00,
1,"BRUNO, KEVIN D",SERGEANT,POLICE,F,Salary,,$104628.00,
2,"COOPER, JOHN E",LIEUTENANT-EMT,FIRE,F,Salary,,$114324.00,
3,"CRESPO, VILMA I",STAFF ASST,LAW,F,Salary,,$76932.00,
4,"DOLAN, ROBERT J",SERGEANT,POLICE,F,Salary,,$111474.00,


To simplify this problem, we will limit our model to only salary employees. Create a new dataset that does not contain any hourly employees.

In [4]:
# answer below:
chic = chicago.loc[chicago['Salary or Hourly'] == 'Salary']


Next, we will look at the count of all values for both job titles and department to ensure that we don't use features that are too sparse in our model.

In [5]:
chicago['Job Titles'].value_counts()

POLICE OFFICER                               9393
FIREFIGHTER-EMT                              1424
SERGEANT                                     1118
POOL MOTOR TRUCK DRIVER                       996
POLICE OFFICER (ASSIGNED AS DETECTIVE)        845
                                             ... 
MANAGER-O'HARE MAINTENANCE CONTROL CENTER       1
MECHANICAL ENGINEER V                           1
DEVELOPMENT DIR                                 1
GENERAL FOREMAN OF PAINTERS                     1
MOBILE UNIT OPERATOR                            1
Name: Job Titles, Length: 1095, dtype: int64

In [6]:
# answer below:
chicago['Department'].value_counts()


POLICE                   12973
FIRE                      4800
STREETS & SAN             2194
OEMC                      2044
WATER MGMNT               1878
AVIATION                  1612
TRANSPORTN                1103
GENERAL SERVICES           972
PUBLIC LIBRARY             932
FAMILY & SUPPORT           621
FINANCE                    575
HEALTH                     516
LAW                        405
CITY COUNCIL               400
BUILDINGS                  266
COMMUNITY DEVELOPMENT      214
BUSINESS AFFAIRS           168
BOARD OF ELECTION          112
DoIT                       101
PROCUREMENT                 86
CITY CLERK                  85
MAYOR'S OFFICE              85
CULTURAL AFFAIRS            76
ANIMAL CONTRL               73
HUMAN RESOURCES             68
INSPECTOR GEN               63
IPRA                        56
BUDGET & MGMT               44
ADMIN HEARNG                38
DISABILITIES                29
TREASURER                   24
COPA                        17
HUMAN RE

Choose the between department and job title and use the variable with the smallest number of values to for one hot encoding. Additionally, create dummy variables for full ot part-time.

In [7]:
pd.get_dummies(chic['Department'], prefix='Department', drop_first=True)

Unnamed: 0,Department_ANIMAL CONTRL,Department_AVIATION,Department_BOARD OF ELECTION,Department_BOARD OF ETHICS,Department_BUDGET & MGMT,Department_BUILDINGS,Department_BUSINESS AFFAIRS,Department_CITY CLERK,Department_CITY COUNCIL,Department_COMMUNITY DEVELOPMENT,Department_COPA,Department_CULTURAL AFFAIRS,Department_DISABILITIES,Department_DoIT,Department_FAMILY & SUPPORT,Department_FINANCE,Department_FIRE,Department_GENERAL SERVICES,Department_HEALTH,Department_HUMAN RELATIONS,Department_HUMAN RESOURCES,Department_INSPECTOR GEN,Department_IPRA,Department_LAW,Department_LICENSE APPL COMM,Department_MAYOR'S OFFICE,Department_OEMC,Department_POLICE,Department_POLICE BOARD,Department_PROCUREMENT,Department_PUBLIC LIBRARY,Department_STREETS & SAN,Department_TRANSPORTN,Department_TREASURER,Department_WATER MGMNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32653,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
32654,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
32655,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
32656,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [8]:
pd.get_dummies(chic['Full or Part-Time'], prefix='FoP', drop_first=True)

Unnamed: 0,FoP_P
0,0
1,0
2,0
3,0
4,0
...,...
32653,0
32654,0
32655,0
32656,0


In [9]:
chic.shape

(24775, 8)

In [10]:
# answer below:
chic = pd.concat([chic.drop(['Department','Full or Part-Time'], axis=1),pd.get_dummies(chic['Department'], prefix='Department', drop_first=True),pd.get_dummies(chic['Full or Part-Time'], prefix='FoP', drop_first=True)],axis=1)


Remove all irrelevant columns (Name, Job Titles, Salary or Hourly, Typical Hours, Hourly Rate)

In [11]:
# answer below:
chic = chic.drop(['Name', 'Job Titles','Salary or Hourly', 'Typical Hours', 'Hourly Rate'], axis=1)

In [28]:
chic

Unnamed: 0,Annual Salary,Department_ANIMAL CONTRL,Department_AVIATION,Department_BOARD OF ELECTION,Department_BOARD OF ETHICS,Department_BUDGET & MGMT,Department_BUILDINGS,Department_BUSINESS AFFAIRS,Department_CITY CLERK,Department_CITY COUNCIL,Department_COMMUNITY DEVELOPMENT,Department_COPA,Department_CULTURAL AFFAIRS,Department_DISABILITIES,Department_DoIT,Department_FAMILY & SUPPORT,Department_FINANCE,Department_FIRE,Department_GENERAL SERVICES,Department_HEALTH,Department_HUMAN RELATIONS,Department_HUMAN RESOURCES,Department_INSPECTOR GEN,Department_IPRA,Department_LAW,Department_LICENSE APPL COMM,Department_MAYOR'S OFFICE,Department_OEMC,Department_POLICE,Department_POLICE BOARD,Department_PROCUREMENT,Department_PUBLIC LIBRARY,Department_STREETS & SAN,Department_TRANSPORTN,Department_TREASURER,Department_WATER MGMNT,FoP_P
0,107790.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,104628.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,114324.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,76932.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
4,111474.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32653,90024.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
32654,48078.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
32655,87006.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
32656,93354.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [13]:
#chic = chic.drop('Full or Part-Time',axis=1)

Check that none of the remaining columns are of object type and convert them to numeric if they are of object type.

In [16]:
chic['Annual Salary'] = chic['Annual Salary'].str.replace('$','') 
chic['Annual Salary'] = chic['Annual Salary'].astype(np.float)


In [17]:
# answer below:
chic.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 24775 entries, 0 to 32657
Data columns (total 37 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Annual Salary                     24775 non-null  float64
 1   Department_ANIMAL CONTRL          24775 non-null  uint8  
 2   Department_AVIATION               24775 non-null  uint8  
 3   Department_BOARD OF ELECTION      24775 non-null  uint8  
 4   Department_BOARD OF ETHICS        24775 non-null  uint8  
 5   Department_BUDGET & MGMT          24775 non-null  uint8  
 6   Department_BUILDINGS              24775 non-null  uint8  
 7   Department_BUSINESS AFFAIRS       24775 non-null  uint8  
 8   Department_CITY CLERK             24775 non-null  uint8  
 9   Department_CITY COUNCIL           24775 non-null  uint8  
 10  Department_COMMUNITY DEVELOPMENT  24775 non-null  uint8  
 11  Department_COPA                   24775 non-null  uint8  
 12  Depa

Split the data into a test and train sample. Use annual salary as the dependent variable. 20% of the data should be assigned to the test sample.

In [21]:
# answer below:
X = chic.drop('Annual Salary',axis=1)
y = chic['Annual Salary']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Generate a regession decision tree using `DecisionTreeRegressor` in sklearn. Fit the model on the training set and calculate the score for both train and test.

In [24]:
# answer below:
model = DecisionTreeRegressor(max_depth=10, min_samples_split=100)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [29]:
from sklearn.metrics import mean_absolute_error
from statsmodels.tools.eval_measures import mse, rmse

print('Mean Absolute Error: ', mean_absolute_error(y_test,y_pred))
print('Mean Absolute Percentage Error: ', np.mean(np.abs((y_test - y_pred) / y_test)) * 100)
print('Mean Squared Error: ', mse(y_test,y_pred))
print('Root Mean Squared Error: ', rmse(y_test,y_pred))

Mean Absolute Error:  14090.1628597015
Mean Absolute Percentage Error:  1682.2587052958588
Mean Squared Error:  395672472.40601945
Root Mean Squared Error:  19891.517599369323


  import pandas.util.testing as tm
