## Day 29 Lecture 1 Assignment

In this assignment, we will learn about decision trees. We will use the Chicago salary dataset loaded below.

In [70]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from pydotplus import graph_from_dot_data
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from IPython.display import Image

In [25]:
chicago = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Current_Employee_Names__Salaries__and_Position_Titles.csv')

In [26]:
chicago.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"ALLISON, PAUL W",LIEUTENANT,FIRE,F,Salary,,$107790.00,
1,"BRUNO, KEVIN D",SERGEANT,POLICE,F,Salary,,$104628.00,
2,"COOPER, JOHN E",LIEUTENANT-EMT,FIRE,F,Salary,,$114324.00,
3,"CRESPO, VILMA I",STAFF ASST,LAW,F,Salary,,$76932.00,
4,"DOLAN, ROBERT J",SERGEANT,POLICE,F,Salary,,$111474.00,


To simplify this problem, we will limit our model to only salary employees. Create a new dataset that does not contain any hourly employees.

In [27]:
# creating new dataset with just salaried employees
sal = chicago.loc[(chicago['Salary or Hourly'] == 'Salary')]
sal.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"ALLISON, PAUL W",LIEUTENANT,FIRE,F,Salary,,$107790.00,
1,"BRUNO, KEVIN D",SERGEANT,POLICE,F,Salary,,$104628.00,
2,"COOPER, JOHN E",LIEUTENANT-EMT,FIRE,F,Salary,,$114324.00,
3,"CRESPO, VILMA I",STAFF ASST,LAW,F,Salary,,$76932.00,
4,"DOLAN, ROBERT J",SERGEANT,POLICE,F,Salary,,$111474.00,


Next, we will look at the count of all values for both job titles and department to ensure that we don't use features that are too sparse in our model.

In [50]:
# looking at values for job titles and dept
# sal[['Department', 'Job Titles']].value_counts()
sal['Department'].nunique()
# sal['Job Titles'].nunique()
sal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24775 entries, 0 to 32657
Data columns (total 43 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Name                              24775 non-null  object 
 1   Job Titles                        24775 non-null  object 
 2   Department                        24775 non-null  object 
 3   Full or Part-Time                 24775 non-null  object 
 4   Salary or Hourly                  24775 non-null  object 
 5   Typical Hours                     0 non-null      float64
 6   Annual Salary                     24775 non-null  object 
 7   Hourly Rate                       0 non-null      object 
 8   department_ANIMAL CONTRL          24775 non-null  uint8  
 9   department_AVIATION               24775 non-null  uint8  
 10  department_BOARD OF ELECTION      24775 non-null  uint8  
 11  department_BOARD OF ETHICS        24775 non-null  uint8  
 12  depa

Choose the between department and job title and use the variable with the smallest number of values to for one hot encoding. Additionally, create dummy variables for full ot part-time.

In [37]:
# answer below:
dummies = pd.get_dummies(sal['Department'], prefix='department', drop_first=True)
dummies.head()

Unnamed: 0,department_ANIMAL CONTRL,department_AVIATION,department_BOARD OF ELECTION,department_BOARD OF ETHICS,department_BUDGET & MGMT,department_BUILDINGS,department_BUSINESS AFFAIRS,department_CITY CLERK,department_CITY COUNCIL,department_COMMUNITY DEVELOPMENT,department_COPA,department_CULTURAL AFFAIRS,department_DISABILITIES,department_DoIT,department_FAMILY & SUPPORT,department_FINANCE,department_FIRE,department_GENERAL SERVICES,department_HEALTH,department_HUMAN RELATIONS,department_HUMAN RESOURCES,department_INSPECTOR GEN,department_IPRA,department_LAW,department_LICENSE APPL COMM,department_MAYOR'S OFFICE,department_OEMC,department_POLICE,department_POLICE BOARD,department_PROCUREMENT,department_PUBLIC LIBRARY,department_STREETS & SAN,department_TRANSPORTN,department_TREASURER,department_WATER MGMNT
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


Remove all irrelevant columns (Name, Job Titles, Salary or Hourly, Typical Hours, Hourly Rate)

In [39]:
# answer below:
time = pd.get_dummies(sal['Full or Part-Time'], prefix='status', drop_first=True)
time.head()

Unnamed: 0,status_P
0,0
1,0
2,0
3,0
4,0


Check that none of the remaining columns are of object type and convert them to numeric if they are of object type.

In [43]:
# answer below:
dummies.info()

In [54]:
sal['Annual Salary'] = sal['Annual Salary'].str.strip('$')
sal['Annual Salary'] = pd.to_numeric(sal['Annual Salary'], errors='raise', downcast='integer')
sal['Annual Salary']

0        107790.0
1        104628.0
2        114324.0
3         76932.0
4        111474.0
           ...   
32653     90024.0
32654     48078.0
32655     87006.0
32656     93354.0
32657    115932.0
Name: Annual Salary, Length: 24775, dtype: float64

Split the data into a test and train sample. Use annual salary as the dependent variable. 20% of the data should be assigned to the test sample.

In [55]:
# answer below:
X = pd.concat([dummies, time], axis=1)
y = sal['Annual Salary']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

Generate a regession decision tree using `DecisionTreeRegressor` in sklearn. Fit the model on the training set and calculate the score for both train and test.

In [64]:
model = DecisionTreeRegressor(max_depth=30, min_samples_leaf=5)
model.fit(X,y)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=30,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=5, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [65]:
y_preds_train = model.predict(X_train)
y_preds_test = model.predict(X_test)

In [66]:
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.1424404950631052
0.15877735249349056


In [67]:
from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(y_train, y_preds_train))
print(mean_absolute_error(y_test, y_preds_test))

13543.586551720326
13150.523676326671


In [72]:
import graphviz

dot_data = model.export_graphviz(reg_tree, out_file=None, 
                     feature_names=X_train.columns,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph