## Day 29 Lecture 1 Assignment

In this assignment, we will learn about decision trees. We will use the Chicago salary dataset loaded below.

In [0]:
%matplotlib inline
import pydotplus as pdp
import graphviz as gv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [0]:
chicago = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/Current_Employee_Names__Salaries__and_Position_Titles.csv')

In [0]:
chicago.head()

Unnamed: 0,Name,Job Titles,Department,Full or Part-Time,Salary or Hourly,Typical Hours,Annual Salary,Hourly Rate
0,"ALLISON, PAUL W",LIEUTENANT,FIRE,F,Salary,,$107790.00,
1,"BRUNO, KEVIN D",SERGEANT,POLICE,F,Salary,,$104628.00,
2,"COOPER, JOHN E",LIEUTENANT-EMT,FIRE,F,Salary,,$114324.00,
3,"CRESPO, VILMA I",STAFF ASST,LAW,F,Salary,,$76932.00,
4,"DOLAN, ROBERT J",SERGEANT,POLICE,F,Salary,,$111474.00,


To simplify this problem, we will limit our model to only salary employees. Create a new dataset that does not contain any hourly employees.

In [0]:
# answer below:

chicago_sal = chicago.loc[chicago['Hourly Rate'].isnull()]

In [0]:
chicago_sal['Hourly Rate'].unique()

array([nan], dtype=object)

Next, we will look at the count of all values for both job titles and department to ensure that we don't use features that are too sparse in our model.

In [0]:
chicago_sal['Department'].nunique()

36

In [0]:
chicago_sal['Job Titles'].nunique()

954

Choose the between department and job title and use the variable with the smallest number of values to for one hot encoding. Additionally, create dummy variables for full ot part-time.

In [0]:
# answer below:
chicago_sal = pd.concat(
    [
        chicago_sal,
        pd.get_dummies(
            chicago_sal['Department'],
        ),
    ],
    axis=1,
)

In [0]:
pd.get_dummies(
            chicago_sal['Department'],
        ).columns

Index(['ADMIN HEARNG', 'ANIMAL CONTRL', 'AVIATION', 'BOARD OF ELECTION',
       'BOARD OF ETHICS', 'BUDGET & MGMT', 'BUILDINGS', 'BUSINESS AFFAIRS',
       'CITY CLERK', 'CITY COUNCIL', 'COMMUNITY DEVELOPMENT', 'COPA',
       'CULTURAL AFFAIRS', 'DISABILITIES', 'DoIT', 'FAMILY & SUPPORT',
       'FINANCE', 'FIRE', 'GENERAL SERVICES', 'HEALTH', 'HUMAN RELATIONS',
       'HUMAN RESOURCES', 'INSPECTOR GEN', 'IPRA', 'LAW', 'LICENSE APPL COMM',
       'MAYOR'S OFFICE', 'OEMC', 'POLICE', 'POLICE BOARD', 'PROCUREMENT',
       'PUBLIC LIBRARY', 'STREETS & SAN', 'TRANSPORTN', 'TREASURER',
       'WATER MGMNT'],
      dtype='object')

Remove all irrelevant columns (Name, Job Titles, Salary or Hourly, Typical Hours, Hourly Rate)

In [0]:
# answer below:

chicago_sal = chicago_sal.drop(columns=['Name', 'Job Titles', 'Salary or Hourly', 'Typical Hours', 'Hourly Rate'], axis=1)

In [0]:
chicago_sal.head()

Unnamed: 0,Department,Full or Part-Time,Annual Salary,ADMIN HEARNG,ANIMAL CONTRL,AVIATION,BOARD OF ELECTION,BOARD OF ETHICS,BUDGET & MGMT,BUILDINGS,BUSINESS AFFAIRS,CITY CLERK,CITY COUNCIL,COMMUNITY DEVELOPMENT,COPA,CULTURAL AFFAIRS,DISABILITIES,DoIT,FAMILY & SUPPORT,FINANCE,FIRE,GENERAL SERVICES,HEALTH,HUMAN RELATIONS,HUMAN RESOURCES,INSPECTOR GEN,IPRA,LAW,LICENSE APPL COMM,MAYOR'S OFFICE,OEMC,POLICE,POLICE BOARD,PROCUREMENT,PUBLIC LIBRARY,STREETS & SAN,TRANSPORTN,TREASURER,WATER MGMNT
0,FIRE,F,$107790.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,POLICE,F,$104628.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,FIRE,F,$114324.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,LAW,F,$76932.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,POLICE,F,$111474.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


Check that none of the remaining columns are of object type and convert them to numeric if they are of object type.

In [0]:
# answer below:

chicago_sal.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24775 entries, 0 to 32657
Data columns (total 39 columns):
Department               24775 non-null object
Full or Part-Time        24775 non-null object
Annual Salary            24775 non-null object
ADMIN HEARNG             24775 non-null uint8
ANIMAL CONTRL            24775 non-null uint8
AVIATION                 24775 non-null uint8
BOARD OF ELECTION        24775 non-null uint8
BOARD OF ETHICS          24775 non-null uint8
BUDGET & MGMT            24775 non-null uint8
BUILDINGS                24775 non-null uint8
BUSINESS AFFAIRS         24775 non-null uint8
CITY CLERK               24775 non-null uint8
CITY COUNCIL             24775 non-null uint8
COMMUNITY DEVELOPMENT    24775 non-null uint8
COPA                     24775 non-null uint8
CULTURAL AFFAIRS         24775 non-null uint8
DISABILITIES             24775 non-null uint8
DoIT                     24775 non-null uint8
FAMILY & SUPPORT         24775 non-null uint8
FINANCE       

In [0]:
# Because department is already represented by the dummies
chicago_sal = chicago_sal.drop(columns='Department', axis=1)

In [0]:
# Turning Full or Part-Time into a binary
chicago_sal['Full or Part-Time'] = pd.get_dummies(chicago_sal["Full or Part-Time"], drop_first=True)

In [0]:
chicago_sal['Full or Part-Time'].value_counts()

0    24770
1        5
Name: Full or Part-Time, dtype: int64

In [0]:
chicago_sal['Annual Salary'] = [x.strip('$') for x in chicago_sal['Annual Salary']]

Split the data into a test and train sample. Use annual salary as the dependent variable. 20% of the data should be assigned to the test sample.

In [0]:
X = chicago_sal.drop('Annual Salary', axis=1)
y = chicago_sal['Annual Salary']

In [0]:
# answer below:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

Generate a regession decision tree using `DecisionTreeRegressor` in sklearn. Fit the model on the training set and calculate the score for both train and test.

In [0]:
from sklearn.tree import DecisionTreeRegressor

In [0]:
dtr = DecisionTreeRegressor(random_state=1337)
dtr.fit(X_train, y_train)

DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=1337, splitter='best')

In [0]:
dtr_train_score = dtr.score(X_train, y_train)

In [0]:
print(dtr_train_score)

0.14424193355950865


In [0]:
dtr_test_score = dtr.score(X_test, y_test)

In [0]:
print(dtr_test_score)

0.1540197265147052
