## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [26]:
NAME = "Theodoros Ioannidis"
AEM = "3626"

---

# Assignment 2 - Decision Trees #

Welcome to your second assignment. This exercise gives you an introduction to [scikit-learn](https://scikit-learn.org/stable/). A simple but efficient machine learning library in Python. It also gives you a wide understanding on how decision trees work. 

After this assignment you will:
- Be able to use the scikit-learn library and train your own model from scratch.
- Be able to train and understand decision trees.

In [27]:
# Always run this cell
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score, accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# USE THIS RANDOM VARIABLE TO PRODUCE THE SAME RESULTS
RANDOM_VARIABLE = 42

## 1. Scikit-Learn and Decision Trees ##

You are going to use the scikit-learn library to train a model for detecting breast cancer using the [Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) (+ [Additional information](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-dataset)) by training a model using [decision trees](https://scikit-learn.org/stable/modules/tree.html).

**1.1** Load the breast cancer dataset using the scikit learn library and split the dataset into train and test set using the appropriate function. Use 33% of the dataset as the test set. Define as X the attributes and as y the target values. Do not forget to set the random_state parameter as the *RANDOM_VARIABLE* defined above. Use this variable for all the random_state parameters in this assignment.

In [28]:
# BEGIN CODE HERE
breast_cancer = load_breast_cancer()
X, y = breast_cancer.data, breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=RANDOM_VARIABLE)

#END CODE HERE

In [29]:
print("Size of train set:{}".format(len(y_train)))
print("Size of test set:{}".format(len(y_test)))
print("Unique classes:{}".format(len(set(y_test))))

Size of train set:381
Size of test set:188
Unique classes:2


**Expected output**:  

```
Size of train set:381  
Size of test set:188  
Unique classes:2
```



**1.2** Train two DecisionTree classifiers and report the F1 score. Use the information gain for the one classifier and the Gini impurity for the other

In [30]:
# BEGIN CODE HERE
classifier_gini = DecisionTreeClassifier(criterion='gini', random_state=RANDOM_VARIABLE)
classifier_igain = DecisionTreeClassifier(criterion='entropy',random_state=RANDOM_VARIABLE)

classifier_gini.fit(X_train, y_train)
classifier_igain.fit(X_train, y_train)

prediction_gini = classifier_gini.predict(X_test)
prediction_igain = classifier_igain.predict(X_test)

f_measure_gini = f1_score(y_test, prediction_gini)
f_measure_igain = f1_score(y_test, prediction_igain)

#END CODE HERE

In [31]:
print("F-Measure Gini: {}".format(f_measure_gini))
print("F-Measure Information Gain: {}".format(f_measure_igain))

F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386


**Expected output**:  

```
F-Measure Gini: 0.9372384937238494
F-Measure Information Gain: 0.9596774193548386
```

**1.3** Find the maximum depth reached by the tree that used the Gini impurity. Train multiple classifiers by modifying the max_depth within the range from 1 to maximum depth and save the f1 scores to the corresponding list of the *fscores* dictionary (one list for training set and one for test set). Before appending the scores to the corresponding list, multiply them by 100, and round the values to 2 decimals.

In [32]:
# BEGIN CODE HERE
depth = classifier_gini.get_depth()
fscores = {}
fscores['train'] = []
fscores['test'] =  []

for i in range(1,depth+1):
    tree = DecisionTreeClassifier(max_depth=i, random_state=RANDOM_VARIABLE)
    tree.fit(X_train, y_train)
    fscores['train'].append(round(f1_score(y_train,tree.predict(X_train))*100,2))
    fscores['test'].append(round(f1_score(y_test,tree.predict(X_test))*100,2))
#END CODE HERE

In [33]:
print("Fscores Train: {}".format(fscores['train']))
print("Fscores Test:  {}".format(fscores['test']))


Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]


**Expected output**:  
```
Fscores Train: [94.24, 95.46, 97.65, 99.15, 99.37, 99.58, 100.0]
Fscores Test:  [91.14, 93.97, 96.64, 94.12, 95.4, 95.04, 93.72]
```

**1.4** Compare the results from the train set with the results from the test set. What do you notice? How are you going to choose the max_depth of your model?

YOUR ANSWER HERE

Στο σύνολο εκπαίδευσης νομοτελειακά η ακρίβεια θα φτάσει το 100, μιας και αποτελεί ακριβώς το ίδιο σύνολο με αυτό με το οποίο εκπαιδεύτηκε το μοντέλο. Ο Ρυθμός άυξησης της ακρίβειας του μοντέλου στην αρχή είναι μεγάλος, αλλά έπειτα από το μέγιστο βάθος = 4, μειώνεται αισθητά. Παρόμοια είναι και τα αποτελέσματα στον έλεγχο του μοντέλου με το σύνολο ελέγχου, όπου η μέγιστη τιμή της ακρίβειας δίνεται για μέγιστο βάθος = 3 κι έπειτα παρουσιάζει μία αστάθεια, χωρίς φυσικά να ξαναπιάνει τόσο υψηλή ακρίβεια. Αυτό είναι ένα ξεκάθαρο πρόβλημα υπερπροσαρμογής, δημιουργώντας ένα μοντέλο άκαμπτο σε νέα δεδομένα κι αρκετά εύθραυστο. Μία σωστή τακτική επίλυσης του προβλήματος αυτού είναι η δοκιμή διαφόρων μέγιστων βάθων και να χρησιμοποιούμε εν τέλει αυτό από το οποίο πήραμε τα καλύτερα απότελεσματα, έπειτα από ελέγχουν διαφορετικών συνόλων ελέγχου.

## 2.0 Pipelines ##

**2.1** In this part of the exercise you are going to build a pipeline from scratch for a classification problem. Load the **income.csv** file and train a DecisionTree model that will predict the *income* variable. This dataset is a modification of the original Adult Income dataset found [here](http://archive.ics.uci.edu/ml/datasets/Adult). Report the f1-score and accuracy score of the test set found in **income_test.csv**. Your pipeline should be able to handle missing values and categorical features (scikit-learn's decision trees do not handle categorical values). You can preprocess the dataset as you like in order to achieve higher scores.  

In [34]:
# BEGIN CODE HERE

data = pd.read_csv('income.csv')
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
train_set = data.drop(['income'],axis=1)
y_train = data['income'].values
# any other code you need

data_test = pd.read_csv('income_test.csv')
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
test_set = data_test.drop(['income'],axis=1)
y_test = data_test['income'].values
# any other code you need
# End CODE HERE

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**2.2** Create and test your pipeline

In [35]:
#Your pipeline
numeric_features = ['age', 'fnlwgt', 'education_num', 'capital-gain', 'capital-loss', 'hours-per-week']
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean"))]
)

categorical_features = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']
categorical_transformer = OneHotEncoder(handle_unknown="ignore")

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('one_hot_cat', categorical_transformer, categorical_features),
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())]
)
clf.fit(train_set, y_train)
y_predict =  clf.predict(test_set)

In [36]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test, y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test, y_predict,average='weighted'))

Model score Accuracy: 0.809
Model score F1 Weighted: 0.810


**2.3** Perform a gooood grid search to find the best parameters for your pipeline

In [37]:
param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "classifier__max_depth": [2, 5, 10],
    "classifier__criterion": ["gini","entropy"],
    "classifier__max_features": [0.25, 0.5, 0.75, None],
    "classifier__min_samples_leaf": [1,10,20,50],
}

grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(train_set, y_train)
y_predict =  grid_search.predict(test_set)

print("Best params:")
print(grid_search.best_params_)

Best params:
{'classifier__criterion': 'gini', 'classifier__max_depth': 10, 'classifier__max_features': None, 'classifier__min_samples_leaf': 10, 'preprocessor__num__imputer__strategy': 'mean'}


In [38]:
print("Model score Accuracy: %.3f" % accuracy_score(y_test,y_predict))
print("Model score F1 Weighted: %.3f" % f1_score(y_test,y_predict,average='weighted'))

Model score Accuracy: 0.856
Model score F1 Weighted: 0.849


**2.4** Describe the process you followed to achieve the results above. Your description should include, but is not limited to the following 
- How do you handle missing values and why
- How do you handle categorical variables and why
- Any further preprocessing steps
- How do you evaluate your model and how did you choose its parameters 
- Report any additional results and comments on your approach.

You should achieve at least 85% accuracy score and 84% f1 score.

YOUR ANSWER HERE

Στην αρχή χώρισα τα δεδομένα εκπαίδευσης κι ελέγχου σε Χ και Υ(μεταβλητή- στόχος). Στην συνέχεια, τα διαχώρισα μεταξύ αριθμητικών και κατηγορικών, αναλόγως το είδος των τιμών τους. Στα αριθμητικά δεδομένα, χρησιμοποιήθηκε ο μέσος του εκάστοτε χαρακτηριστικού όταν η τιμή του απουσίαζε σε κάποιο γεγονός, ενώ στα κατηγορικά αγνοούμε τις άγνωστες περιπτώσεις, όπου πρακτικά πρόκειται για κάποιο λάθος των δεδομένων, ή έλλειψη όλων των πληροφοριών. Έπειτα, δοκίμασα διάφορους συνδυασμούς χρήσης παραμέτρων, εν τέλει κατέληξα στην χρήση όλων των παραμέτρων του προβλήματος, μιας κι αποδείχθηκε ο βέλτιστος.  

## 3.0 Common Issues ## 

**3.0** Run the following code to define a DecisionTreeModel and load the **income** dataset only with the numerical variables. Then, answer the following questions. 

In [39]:
# Load Data
columns = ['age','fnlwgt','education_num','hours-per-week',"capital-loss","capital-gain","income"]
data = pd.read_csv('income.csv',usecols=columns)
data_test = pd.read_csv('income_test.csv',usecols=columns)
# Convert target variable to 0 and 1
data["income"] = data["income"].map({ "<=50K": 0, ">50K": 1 })
data_test["income"] = data_test["income"].map({ "<=50K": 0, ">50K": 1 })
# Create X and y
X_train = data.drop(["income"],axis=1)
y_train = data['income'].values
X_test = data_test.drop(["income"],axis=1)
y_test = data_test['income'].values
# Classifier
classifier = DecisionTreeClassifier(min_samples_leaf=4)
classifier.fit(X_train,y_train)
accuracy = accuracy_score(y_test,y_predict)
print("Model score accuracy: %.3f" % accuracy)

Model score accuracy: 0.856


**3.1** Evaluate the classifier using at least three evaluation metrics except accuracy_score and f1 (weighted).

In [40]:
from sklearn.metrics import balanced_accuracy_score, average_precision_score, f1_score
y_predict = classifier.predict(X_test)

# BEGIN CODE HERE
metric1 = balanced_accuracy_score(y_test, y_predict)
metric2 = average_precision_score(y_test, y_predict)
metric3 = f1_score(y_test, y_predict,average='micro')
#END CODE HERE

In [41]:
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score Metric 1: 0.687
Model score Metric 2: 0.413
Model score Metric 3: 0.790


**3.2** Do you notice any problems with the classifier? If so, what can you do to change this.

YOUR ANSWER HERE

Το πρόβλημα ήταν ότι τόσο το pipeline, όσο και το grid search που μας έδωσαν το y_predict υπολόγιζαν όλες τις παραμέτρους του income.csv, κι όχι μόνο τα αριθμητικά δεδομένα, όπως έκανε εδώ ο classifier. Αυτό το πρόβλημα θα μπορούσε να αντιμετωπισθεί, προσαρμόζοντας το pipeline και το grid search μόνο στις αριθμητικές τιμές.

**3.3** Implement your solution using the cells below. Report your results and the process you followed. You are reccommended to use stratification and grid search. You should only have to increase a little bit the metrics you calculated above, and also reach an accuracy score higher than 82%!

In [42]:
# BEGIN CODE HERE
numeric_features = ['age', 'fnlwgt', 'education_num', 'capital-gain', 'capital-loss', 'hours-per-week']
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean"))]
)

param_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "classifier__max_depth": [2, 5, 10],
    "classifier__criterion": ["gini","entropy"],
    "classifier__max_features": [0.25, 0.5, 0.75, None],
    "classifier__min_samples_leaf": [1,10,20,50],
}
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)
    ]
)
clf = Pipeline(
    steps=[("preprocessor", preprocessor), ("classifier", DecisionTreeClassifier())]
)

grid_search = GridSearchCV(clf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
y_predict =  grid_search.predict(X_test)
accuracy = accuracy_score(y_test, y_predict)
metric1 = balanced_accuracy_score(y_test, y_predict)
metric2 = average_precision_score(y_test, y_predict)
metric3 = f1_score(y_test, y_predict,average='micro')

#END CODE HERE

In [43]:
print("Model score accuracy: %.3f" % accuracy)
print("Model score Metric 1: %.3f" % metric1)
print("Model score Metric 2: %.3f" % metric2)
print("Model score Metric 3: %.3f" % metric3)

Model score accuracy: 0.827
Model score Metric 1: 0.695
Model score Metric 2: 0.470
Model score Metric 3: 0.827


YOUR ANSWER HERE

Παραπάνω, υλοποιήθηκαν οι κατάλληλες μετατροπές, ώστε να λαμβάνουμε υπόψη μόνο τις αριθμητικές τιμές, όπως έκανε κι ο classifier. Όπως μπορούμε να δούμε, η αλλαγή αυτή είχε ως αποτέλεσμα την βελτίωση όλων των τιμών των metrics που χρησιμοποιήθηκαν νωρίτερα, ενώ και η ακρίβεια βελτιώθηκε, ξεπερνώντας το 82%.