## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [2]:
NAME = "Antonios Keremidis"
AEM = "9717"

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [3]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [4]:
# BEGIN CODE HERE
dataset = load_breast_cancer()
X, y = dataset.data, dataset.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_VARIABLE)

#END CODE HERE

In [5]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [6]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(criterion="gini")
classifier_igain = DecisionTreeClassifier(criterion="entropy")

classifier_gini.fit(X_train, y_train)
classifier_igain.fit(X_train, y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(y_test, prediction_gini)
f_measure_igain = f1_score(y_test, prediction_igain)

#END CODE HERE

In [7]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.95
F-Measure Information Gain: 0.967479674796748


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [8]:
# BEGIN CODE HERE
depth = classifier_gini.tree_.max_depth
fscores = {}
fscores['train'] = []
fscores['test'] = []

for i in range(1, depth + 1):
    classifier_gini = DecisionTreeClassifier(criterion="gini",max_depth=i)
    classifier_gini.fit(X_train,y_train)

    prediction_gini = classifier_gini.predict(X_test)
    prediction_gini_train = classifier_gini.predict(X_train)

    f_measure_gini =  f1_score(prediction_gini,y_test)
    f_measure_gini_train = f1_score(prediction_gini_train,y_train)

    f_measure_gini = np.around(f_measure_gini*100,2)
    f_measure_gini_train = np.around(f_measure_gini_train*100,2)

    fscores['test'] += [f_measure_gini]
    fscores['train'] += [f_measure_gini_train]
#END CODE HERE

In [9]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))


Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 93.28, 96.2, 92.77, 95.0, 93.56]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

YOUR ANSWER HERE

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.  

In [15]:
# BEGIN CODE HERE
from sklearn.preprocessing import StandardScaler

train_set = pd.read_csv('income.csv')
X_train = train_set.drop('income',axis=1)
y_train = train_set['income']

test_set = pd.read_csv('income_test.csv')
X_test = test_set.drop('income',axis=1)
y_test = test_set['income']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(fill_value='missing',strategy='most_frequent')),
    ('encoder', OrdinalEncoder())
])

numeric_features = ['age', 'fnlwgt', 'capital-gain', 'capital-loss', 'hours-per-week']
categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']

preprocessor = ColumnTransformer(transformers=[
    ('numeric', numeric_transformer, numeric_features),
    ('categorical', categorical_transformer, categorical_features)
], n_jobs=-1)

# End CODE HERE

**2.2** Create and test your pipeline

In [11]:
#Your pipeline
clf = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', DecisionTreeClassifier(criterion='gini', max_depth=10, max_leaf_nodes=15, min_samples_leaf=8,
min_samples_split=6, splitter='best'))
])

clf.fit(X_train, y_train)
y_predict = clf.predict(X_test)

In [12]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.844
Model score F1 Weighted: 0.833


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [13]:
param_grid = {
            'regressor__criterion': ['gini', 'entropy'],
            'regressor__splitter': ['best', 'random'],
            'regressor__max_depth': [6, 8, 10, 12, 14, 16],
            'regressor__min_samples_leaf': [6, 8, 10, 12],
            'regressor__max_leaf_nodes': [14, 16, 18, 20, 22, 24, 26, 28],
            'preprocessor__numeric__imputer__strategy': ['mean', 'median', 'most_frequent']
            }

grid_search = GridSearchCV(clf, param_grid, scoring='accuracy', n_jobs=-1, verbose=5)
grid_search.fit(X_train, y_train)
y_predict =  grid_search.predict(X_test)

print("Best params:")
print(grid_search.best_params_)

Fitting 5 folds for each of 2304 candidates, totalling 11520 fits
[CV 1/5] END preprocessor__numeric__imputer__strategy=mean, regressor__criterion=gini, regressor__max_depth=6, regressor__max_leaf_nodes=14, regressor__min_samples_leaf=6, regressor__splitter=best;, score=0.844 total time=   0.1s
[CV 2/5] END preprocessor__numeric__imputer__strategy=mean, regressor__criterion=gini, regressor__max_depth=6, regressor__max_leaf_nodes=14, regressor__min_samples_leaf=6, regressor__splitter=best;, score=0.846 total time=   0.1s
[CV 3/5] END preprocessor__numeric__imputer__strategy=mean, regressor__criterion=gini, regressor__max_depth=6, regressor__max_leaf_nodes=14, regressor__min_samples_leaf=6, regressor__splitter=best;, score=0.851 total time=   0.1s
[CV 4/5] END preprocessor__numeric__imputer__strategy=mean, regressor__criterion=gini, regressor__max_depth=6, regressor__max_leaf_nodes=14, regressor__min_samples_leaf=6, regressor__splitter=best;, score=0.853 total time=   0.1s
[CV 1/5] END p

In [14]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))

Model score Accuracy: 0.852
Model score F1 Weighted: 0.841


**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

YOUR ANSWER HERE

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [17]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
accuracy_score = accuracy_score(y_test,y_predict)
print("Model score accuracy: %.3f" % accuracy_score)

TypeError: Labels in y_true and y_pred should be of the same type. Got y_true=[0 1] and y_pred=['<=50K' '>50K']. Make sure that the predictions provided by the classifier coincides with the true labels.

**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [None]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score
y_predict = classifier.predict(X_test)

# BEGIN CODE HERE
metric1 = ...
metric2 = ...
metric3 = ...
#END CODE HERE

In [None]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

YOUR ANSWER HERE

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [None]:
# BEGIN CODE HERE
final_score = ""

#END CODE HERE

In [None]:
print("Model score accuracy: %.3f" % accuracy_score)
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

YOUR ANSWER HERE