<a href="https://colab.research.google.com/github/rammeshulam/89-570/blob/master/Exercise_dt_50k.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier

from statistics import mean
from matplotlib import pyplot as plt

%matplotlib inline

# Dataset
The data is taken from [UCI ML Repository](http://archive.ics.uci.edu/ml/datasets/Adult). Each row desrcibes one US citizen.
The target (the column we would like to classify) is whether income exceeds $50K/yr. The label is in the last columns.

## Target: 
*  \>50K, <=50K. 

## Features:
* age: continuous. 
* workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. 
* fnlwgt: continuous. 
* education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. 
* education-num: continuous. 
* marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. 
* occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. 
* relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. 
* race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. 
* sex: Female, Male. 
* capital-gain: continuous. 
* capital-loss: continuous. 
* hours-per-week: continuous. 
* native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

# Load data

In [0]:
cols = ['age',
  'workclass',
  'fnlwgt',
  'education',
  'education-num',
  'marital-status',
  'occupation',
  'relationship',
  'race',
  'sex',
  'capital-gain',
  'capital-loss',
  'hours-per-week',
  'native-country',
  'target']
data_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'
data_raw = pd.read_csv(data_url,names=cols, converters = {'target' : lambda s: s.strip()})
data_raw.head(3)

In [0]:
data_raw.describe()

## Clean data, fill na etc'
This is out of the scope of this assignment. Assume the data is clean

In [0]:
fig = plt.figure(figsize=(25, 15))
cols = 5
rows = np.ceil(float(data_raw.shape[1]) / cols)
for i, column in enumerate(data_raw.columns):
    ax = fig.add_subplot(rows, cols, i + 1)
    ax.set_title(column)
    if data_raw.dtypes[column] == np.object:
        data_raw[column].value_counts().plot(kind="bar", axes=ax)
    else:
        data_raw[column].hist(axes=ax)
        plt.xticks(rotation="vertical")
plt.subplots_adjust(hspace=0.7, wspace=0.2)

# Prepare data for training

## Convert target to 0 and 1

In [0]:
print( f"values before conversion:\n{data_raw.target.value_counts()}")

# replace '<=50K' and '>50K' with 0 and 1 respectively
<ADD CODE HERE>

print( f"\nvalues after conversion:\n{data_raw.target.value_counts()}")

## Convert categorical features
The decision tree classifier we are about to use cannot work with categorical features. Thus, we must convert them into boolan columns.

In [0]:
data_raw.info()

In [0]:
# identify categorical vs. numerical columns:
categorical_columns = data_raw.select_dtypes(include=object).columns.values
numerical_columns = data_raw.select_dtypes(exclude=object).columns.values

print('categorical_columns:', categorical_columns)
print('numerical_columns:', numerical_columns)

Convert the categorical columns into dummy variables. 
For example, 'workclass' column which has the values 'Federal-gov',	'Local-gov', 'Never-worked'... should be transformed to multiple boolean columns, one for each value: 
'workclass_ Federal-gov',	'workclass_ Local-gov',	'workclass_ Never-worked',.. . See [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) for reference

2. concat the numerical columns to the newly created columns of the dummy variables. The new dataframe is ready for training.

In [0]:
<ADD CODE HERE>
print(data.head(1).T)

Define X as all columns excluding target; define y as target


In [0]:
X = <ADD CODE HERE>
y = data['target']

# Train a model

## First, lets split the data to train and test data sets.
Split the data into 90% train and 10% test data sets. For consistency, use random_state=42. see [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) for reference

In [0]:
#create X_train, X_test, y_train, y_test
<ADD CODE HERE>

1. On the training set, create a decision tree classifier 
2. Use 10-fold cross-validation to calculate the performance of the model.

See [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) and  [cross_val_score](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) for reference.


In [0]:
clf = <ADD CODE HERE>

#print score:
<ADD CODE HERE>

# Tune model hyper parameters
Lets find the optimal tree depth based on the training set data:
1. Loop over depths from 1 to 15
2. for each depth, create a decision-tree with max_depth=depth and calcualate the average cross_val_score (use 10-fold cross-validation).

In [0]:
# returns a list of performance for each depth in depths
def check_depths(depths, X, y):
  <ADD CODE HERE>
  
depths = range(1,16)
performance = check_depths(depths, X_train, y_train)
best_p = max(performance)
best_pi = performance.index(best_p)

print((f"max performance ({best_p}) achieved at depth {best_pi}"))
plt.plot(depths, performance)

# Report performance
For final performance report, 
1. create a decision-tree classifier with max_detph=best_pi
2. Use all the training set to train it
3. calculate the accuracy on the test set (use DecisionTreeClassifier.score() function). 

In [0]:
<ADD CODE HERE>