<a href="https://colab.research.google.com/github/timcsmith/MIS536-Public/blob/master/Notebooks/Class08a_decision_tree_defaults.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Class08a - Prediction using Decision Tree (using Default Parameters)

## Introduction and Overview




In this project, we will be using a dataset containing census information from [UCI's Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/census+income).

By using this census data with a decision tree, we will try to predict whether or not a person income using the following variables: age, sex, capital-gain, capital-loss, hours-per-week.
Let's get started!



# Predicting Income with Decision Tree



## Step 1: Install and import necessary packages

In [None]:
# import packages
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import numpy as np

## Step 2: Load, clean and prepare data


### 2.1 Read data (income.csv)

In [None]:
income_df = pd.read_csv("https://raw.githubusercontent.com/timcsmith/MIS536-Public/master/Data/income.csv", engine='python', delimiter=", ")

### 2.2 Explore the dataset

In [None]:
# Explore the dataset
# read the first row of the dataset 
print(income_df.head())
print(income_df.columns)
print(income_df.describe())
print(income_df.info())

   age         workclass  fnlwgt  ... hours-per-week  native-country income
0   39         State-gov   77516  ...             40   United-States  <=50K
1   50  Self-emp-not-inc   83311  ...             13   United-States  <=50K
2   38           Private  215646  ...             40   United-States  <=50K
3   53           Private  234721  ...             40   United-States  <=50K
4   28           Private  338409  ...             40            Cuba  <=50K

[5 rows x 15 columns]
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')
                age        fnlwgt  ...  capital-loss  hours-per-week
count  32561.000000  3.256100e+04  ...  32561.000000    32561.000000
mean      38.581647  1.897784e+05  ...     87.303830       40.437456
std       13.640433  1.055500e+05  ...    402.960219       12.

### 2.3 Clean/transform data (where necessary)

In [None]:
# based on findings from data exploration, we need to clean up colum names, as there are some leading whitespace characters
income_df.columns = [s.strip() for s in income_df.columns] 
income_df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [None]:
# clean the datast: sex is not numeric.
income_df.sex = income_df.sex.replace("Male", 0, regex=True)
income_df.sex = income_df.sex.replace("Female", 1, regex=True)
income_df.sex

0        0
1        0
2        0
3        0
4        1
        ..
32556    1
32557    0
32558    1
32559    0
32560    1
Name: sex, Length: 32561, dtype: int64

In [None]:
# Transform our predictors into integers. This is necessary if we later want to test precision and recall. 
income_df.income.unique()
income_df.income = income_df.income.replace("<=50K", 0, regex=True)
income_df.income = income_df.income.replace(">50K", 1, regex=True)


## Step 3 Split data intro training and validation sets

In [None]:
# construct datasets for analysis
target = 'income'
predictors = ['age', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week']
X = income_df[predictors]
y = income_df[target]
print(X)
print(y)

       age  sex  capital-gain  capital-loss  hours-per-week
0       39    0          2174             0              40
1       50    0             0             0              13
2       38    0             0             0              40
3       53    0             0             0              40
4       28    1             0             0              40
...    ...  ...           ...           ...             ...
32556   27    1             0             0              38
32557   40    0             0             0              40
32558   58    1             0             0              40
32559   22    0             0             0              20
32560   52    1         15024             0              40

[32561 rows x 5 columns]
0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    1
32558    0
32559    0
32560    1
Name: income, Length: 32561, dtype: int64


In [None]:
# create the training set and the test set 
train_X, valid_X, train_y, valid_y = train_test_split(X,y, test_size=0.3, random_state=1)
print(train_X)
print(valid_X)

       age  sex  capital-gain  capital-loss  hours-per-week
16525   44    0             0             0              60
14551   22    1             0             0              30
518     21    1             0             0              35
22524   46    0             0             0              40
11425   17    0             0             0              20
...    ...  ...           ...           ...             ...
32511   25    1             0             0              40
5192    32    0         15024             0              45
12172   27    0             0             0              40
235     59    0             0             0              40
29733   33    0             0          1902              45

[22792 rows x 5 columns]
       age  sex  capital-gain  capital-loss  hours-per-week
9646    62    1             0             0              66
709     18    0             0             0              25
7385    25    0         27828             0              50
16671   33    

## Step 4: Create and train model


You can find details about the DecisionTreeClassifier [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

### 4.1 Create a decision tree using the default parameters

In [None]:
dtree=DecisionTreeClassifier(random_state=1)

### 4.2 Fit the model to the training data

In [None]:
dtree.fit(train_X, train_y)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=1, splitter='best')

### 4.3 Review of the performance of the model on the validation/test data

In [None]:
validation_predictions = dtree.predict(valid_X)

print('Confusion Matrix: ', confusion_matrix(valid_y, validation_predictions))
print('Accuracy score', accuracy_score(valid_y, validation_predictions))
print('Precision score', precision_score(valid_y, validation_predictions))
print('Recall score', recall_score(valid_y, validation_predictions))

Confusion Matrix:  [[7128  422]
 [1289  930]]
Accuracy score 0.8248541304125294
Precision score 0.6878698224852071
Recall score 0.4191077061739522


## Step 5: Deploy model

In this notebook we develop a model and test its performance on the validation data. In this exercise (predicting income), there is no model deployment. 

What does "deploying" a model mean? Up to this point, we've trained a model to our training data and then estimated the performance of this model on new data by testing its performance on validation data.

In this course, we finish after building the model. In practice, the model is used by an organization/company in some way. Using the model is often referred to as "deploying" the model.

How a model is deployed can vary. It may simply be deployed as a notebook that reads the latest predictor data and uses the developed model to make predictions. The model can also be deployed inside enterprise decision support software that automatically makes predictions on incoming data.