## About iPython Notebooks ##

iPython Notebooks are interactive coding environments embedded in a webpage. You will be using iPython notebooks in this class. Make sure you fill in any place that says `# BEGIN CODE HERE #END CODE HERE`. After writing your code, you can run the cell by either pressing "SHIFT"+"ENTER" or by clicking on "Run" (denoted by a play symbol). Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All). 

 **What you need to remember:**

- Run your cells using SHIFT+ENTER (or "Run cell")
- Write code in the designated areas using Python 3 only
- Do not modify the code outside of the designated areas
- In some cases you will also need to explain the results. There will also be designated areas for that. 

Fill in your **NAME** and **AEM** below:

In [1]:
NAME = "Sotirios Loukas Kampylis"
AEM = "3805"

---

# Assignment 3 - Ensemble Methods #

Welcome to your third assignment. This exercise will test your understanding on Ensemble Methods.

In [2]:
# Always run this cell
import numpy as np
import pandas as pd

# USE THE FOLLOWING RANDOM STATE FOR YOUR CODE
RANDOM_STATE = 42

## Download the Dataset ##
Download the dataset using the following cell or from this [link](https://github.com/sakrifor/public/tree/master/machine_learning_course/EnsembleDataset) and put the files in the same folder as the .ipynb file. 
In this assignment you are going to work with a dataset originated from the [ImageCLEFmed: The Medical Task 2016](https://www.imageclef.org/2016/medical) and the **Compound figure detection** subtask. The goal of this subtask is to identify whether a figure is a compound figure (one image consists of more than one figure) or not. The train dataset consits of 4197 examples/figures and each figure has 4096 features which were extracted using a deep neural network. The *CLASS* column represents the class of each example where 1 is a compoung figure and 0 is not. 


In [3]:
import urllib.request
url_train = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/train_set.csv'
filename_train = 'train_set.csv'
urllib.request.urlretrieve(url_train, filename_train)
url_test = 'https://github.com/sakrifor/public/raw/master/machine_learning_course/EnsembleDataset/test_set_noclass.csv'
filename_test = 'test_set_noclass.csv'
urllib.request.urlretrieve(url_test, filename_test)

('test_set_noclass.csv', <http.client.HTTPMessage at 0x23e23678490>)

In [4]:
# Run this cell to load the data
train_set = pd.read_csv("train_set.csv").sample(frac=1).reset_index(drop=True)
train_set.head()
X = train_set.drop(columns=['CLASS'])
y = train_set['CLASS'].values

## 1.0 Testing different ensemble methods ##
In this part of the assignment you are asked to create and test different ensemble methods using the train_set.csv dataset. You should use **10-fold cross validation** for your tests and report the average f-measure weighted and balanced accuracy of your models. You can use [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate) and select both metrics to be measured during the evaluation. Otherwise, you can use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold).

### !!! Use n_jobs=-1 where is posibble to use all the cores of a machine for running your tests ###

### 1.1 Voting ###
Create a voting classifier which uses three **simple** estimators/classifiers. Test both soft and hard voting and choose the best one. Consider as simple estimators the following:


*   Decision Trees
*   Linear Models
*   Probabilistic Models (Naive Bayes)
*   KNN Models  

In [5]:
# BEGIN CODE HERE
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_validate

cls1 = LogisticRegression(random_state=RANDOM_STATE) # Classifier #1
cls2 = DecisionTreeClassifier(max_depth=10,min_samples_split=25,random_state=RANDOM_STATE) # Classifier #2
cls3 = KNeighborsClassifier(n_neighbors=20) # Classifier #3
soft_vcls = VotingClassifier(estimators=[('lr',cls1),('dt',cls2),('knn',cls3)],voting='soft',n_jobs=-1) # Voting Classifier
hard_vcls = VotingClassifier(estimators=[('lr',cls1),('dt',cls2),('knn',cls3)],voting='hard',n_jobs=-1) # Voting Classifier

svlcs_scores = cross_validate(soft_vcls,X,y,cv=10,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
s_avg_fmeasure = np.mean(svlcs_scores['test_f1_weighted']) # The average f-measure
s_avg_accuracy = np.mean(svlcs_scores['test_balanced_accuracy']) # The average accuracy

hvlcs_scores = cross_validate(hard_vcls,X,y,cv=10,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
h_avg_fmeasure = np.mean(hvlcs_scores['test_f1_weighted']) # The average f-measure
h_avg_accuracy = np.mean(hvlcs_scores['test_balanced_accuracy']) # The average accuracy
#END CODE HERE

In [6]:
print("Classifier:")
print(soft_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(s_avg_fmeasure,4), round(s_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('dt',
                              DecisionTreeClassifier(max_depth=10,
                                                     min_samples_split=25,
                                                     random_state=42)),
                             ('knn', KNeighborsClassifier(n_neighbors=20))],
                 n_jobs=-1, voting='soft')
F1 Weighted-Score: 0.8308 & Balanced Accuracy: 0.822


You should achive above 82% (Soft Voting Classifier)

In [7]:
print("Classifier:")
print(hard_vcls)
print("F1 Weighted-Score: {} & Balanced Accuracy: {}".format(round(h_avg_fmeasure,4), round(h_avg_accuracy,4)))

Classifier:
VotingClassifier(estimators=[('lr', LogisticRegression(random_state=42)),
                             ('dt',
                              DecisionTreeClassifier(max_depth=10,
                                                     min_samples_split=25,
                                                     random_state=42)),
                             ('knn', KNeighborsClassifier(n_neighbors=20))],
                 n_jobs=-1)
F1 Weighted-Score: 0.8227 & Balanced Accuracy: 0.8139


You should achieve above 80% in both! (Hard Voting Classifier)

### 1.2 Stacking ###
Create a stacking classifier which uses two more complex estimators. Try different simple classifiers (like the ones mentioned before) for the combination of the initial estimators. Report your results in the following cell.

Consider as complex estimators the following:

*   Random Forest
*   SVM
*   Gradient Boosting
*   MLP




In [8]:
# BEGIN CODE HERE
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,StackingClassifier
from sklearn.neural_network import MLPClassifier

cls1 = RandomForestClassifier(n_estimators=50,max_depth=15,max_features=0.8,n_jobs=-1,random_state=RANDOM_STATE) # Classifier #1
cls2 = GradientBoostingClassifier(n_estimators=50,random_state=RANDOM_STATE) # Classifier #2
cls3 = MLPClassifier(hidden_layer_sizes=(100,100,100,100),random_state=RANDOM_STATE) # Classifier #3 (Optional)
scls = StackingClassifier(estimators=[('rf',cls1),('gb',cls2),('mlp',cls3)],n_jobs=-1) # Stacking Classifier
scores = cross_validate(scls,X,y,cv=10,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
avg_fmeasure = np.mean(scores['test_f1_weighted']) # The average f-measure
avg_accuracy = np.mean(scores['test_balanced_accuracy']) # The average accuracy
#END CODE HERE

In [9]:
print("Classifier:")
print(scls)
print("F1 Weighted Score: {} & Balanced Accuracy: {}".format(round(avg_fmeasure,4), round(avg_accuracy,4)))

Classifier:
StackingClassifier(estimators=[('rf',
                                RandomForestClassifier(max_depth=15,
                                                       max_features=0.8,
                                                       n_estimators=50,
                                                       n_jobs=-1,
                                                       random_state=42)),
                               ('gb',
                                GradientBoostingClassifier(n_estimators=50,
                                                           random_state=42)),
                               ('mlp',
                                MLPClassifier(hidden_layer_sizes=(100, 100, 100,
                                                                  100),
                                              random_state=42))],
                   n_jobs=-1)
F1 Weighted Score: 0.8508 & Balanced Accuracy: 0.8447


You should achieve above 85% in both

## 2.0 Randomization ##

**2.1** You are asked to create three ensembles of decision trees where each one uses a different method for producing homogeneous ensembles. Compare them with a simple decision tree classifier and report your results in the dictionaries (dict) below using as key the given name of your classifier and as value the f1_weighted/balanced_accuracy score. The dictionaries should contain four different elements.  

In [29]:
# BEGIN CODE HERE
from sklearn.model_selection import StratifiedKFold
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier,AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score,balanced_accuracy_score

# StratifiedKFold
cv = StratifiedKFold(n_splits=10,shuffle=True,random_state=RANDOM_STATE)
# Bagging Classifier
ens1 = BaggingClassifier(DecisionTreeClassifier(random_state=RANDOM_STATE),n_estimators=50,n_jobs=-1,random_state=RANDOM_STATE)
# Random Forest Classifier
ens2 = RandomForestClassifier(n_estimators=50,n_jobs=-1,random_state=RANDOM_STATE)
# AdaBoost Classifier
ens3 = AdaBoostClassifier(n_estimators=50,random_state=RANDOM_STATE)
# Decision Tree Classifier
tree = DecisionTreeClassifier(random_state=RANDOM_STATE)

score1 = cross_validate(ens1,X,y,cv=cv,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
score2 = cross_validate(ens2,X,y,cv=cv,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
score3 = cross_validate(ens3,X,y,cv=cv,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)
score4 = cross_validate(tree,X,y,cv=cv,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)

f_measures = dict({'Simple Desicion':score4['test_f1_weighted'].mean(),
                   'Ensemble with bagging classifier':score1['test_f1_weighted'].mean(),
                   'Ensemble with random forest classifier':score2['test_f1_weighted'].mean(),
                   'Ensemble with ada boost classifier':score3['test_f1_weighted'].mean()})
accuracies = dict({'Simple Desicion':score4['test_balanced_accuracy'].mean(),
                   'Ensemble with bagging classifier':score1['test_balanced_accuracy'].mean(),
                   'Ensemble with random forest classifier':score2['test_balanced_accuracy'].mean(),
                   'Ensemble with ada boost classifier':score3['test_balanced_accuracy'].mean()})
# Example f_measures = {'Simple Decision': 0.8551, 'Ensemble with random ...': 0.92, ...}

#END CODE HERE

In [30]:
print(ens1)
print(ens2)
print(ens3)
print(tree)
for name,score in f_measures.items():
    print("Classifier:{} -  F1 Weighted:{}".format(name,round(score,4)))
for name,score in accuracies.items():
    print("Classifier:{} -  BalancedAccuracy:{}".format(name,round(score,4)))

BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=42),
                  n_estimators=50, n_jobs=-1, random_state=42)
RandomForestClassifier(n_estimators=50, n_jobs=-1, random_state=42)
AdaBoostClassifier(random_state=42)
DecisionTreeClassifier(random_state=42)
Classifier:Simple Desicion -  F1 Weighted:0.7026
Classifier:Ensemble with bagging classifier -  F1 Weighted:0.8064
Classifier:Ensemble with random forest classifier -  F1 Weighted:0.8035
Classifier:Ensemble with ada boost classifier -  F1 Weighted:0.7869
Classifier:Simple Desicion -  BalancedAccuracy:0.6916
Classifier:Ensemble with bagging classifier -  BalancedAccuracy:0.7941
Classifier:Ensemble with random forest classifier -  BalancedAccuracy:0.79
Classifier:Ensemble with ada boost classifier -  BalancedAccuracy:0.7774


**2.2** Describe your classifiers and your results.

1)Bagging Classifier: Bagging ονομάζουμε τη διαδικασία κατα την οποία επιλέγουμε διαφορετικά samples με επανατοποθέτηση από τα αρχικά δεδομένα. O Bagging Classifier είναι μια Ensemble μέθοδος που εκπαιδεύει πολλούς Classifiers (στη συγκεκριμένη περίπτωση DecisionTreeClassifier) κάθε έναν με διαφορετικά τυχαία δείγματα που δημιουργήθηκαν με τη διαδικασία του bagging. Με αυτόν τον τρόπο κάνει randomization στον τρόπο που κατασκευάζονται τα δέντρα. Στη συνέχεια, για να κάνει μια τελική πρόβλεψη ομαδοποιεί τις προβλέψεις από τους επιμέρους κατηγοριοποιητές (voting ή averaging).
 Αποτελέσματα Bagging Classifier:
-Ensemble with bagging classifier -  F1 Weighted:0.8064
-Ensemble with bagging classifier -  BalancedAccuracy:0.7941

2)Random Forest Classifier: Ο RandomForest βασίζεται πάλι στην μέθοδο με τα τυχαία samples του Bagging που αναφέρουμε παραπάνω. Η βασική διαφορά από τον BaggingClassifier είναι στην επιλογή των splits. O BaggingClassifier όταν καλείται να επιλέξει split point μπορεί να λάβει υπόψιν όλα τα features και να επιλέξει το optimal split. Από την άλλη, ο RandomForest μπορεί να επιλέξει κάθε φορά μεταξύ m τυχαία επιλεγμένων features για να κάνει το split. Έτσι, δημιουργούνται διαφορετικά υπό-μοντέλα (δέντρα με διαφορετικά splits) και με το συνδυασμό των predictions τους μπορεί να πετύχουμε καλύτερα αποτελέσματα.
 Αποτελέσματα Random Forest Classifier:
-Ensemble with random forest classifier -  F1 Weighted:0.8035
-Ensemble with random forest classifier -  BalancedAccuracy:0.79

3)Ada Boost Classifier: Στον 3ο classifier χρησιμοποιούμε τη μέθοδο του Boosting, κατά την οποία εκπαιδεύουμε μοντέλα διαδοχικά και κάθε μοντέλο προσπαθεί να διορθώσει τα λάθη του προηγούμενου. Αυτό πραγματοποιείται αναθέτοντας μεγαλύτερο βάρος στα instances που έγιναν missclassified στο προηγούμενο μοντέλο. Συγκεκριμένα χρησιμοποιούμε τον AdaBoost classifier με τις default παραμέτρους, δηλαδή με 50 estimators (Decision Trees με max_depth=1).
 Αποτελέσματα Adaboost Classifier:
-Ensemble with ada boost classifier -  F1 Weighted:0.7869
-Ensemble with ada boost classifier -  BalancedAccuracy:0.7774

4)DecisionTreeClassifier: Θεωρείται ήδη γνωστός από παλιότερα άρα δεν θα εξηγηθεί.
 Αποτελέσματα Decision Tree Classifier:
-Simple Desicion -  F1 Weighted:0.7026
-Simple Desicion -  BalancedAccuracy:0.6916

**2.3** Increasing the number of estimators in a bagging classifier can drastically increase the training time of a classifier. Is there any solution to this problem? Can the same solution be applied to boosting classifiers?

 Με 10 estimators που είναι το default για τον Bagging classifier:
-Training Time: Περίπου 34 sec.
 Ωστόσο, με 100 estimators έχουμε:
-Training Time: Περίπου 335 sec.

Βλεπουμε οτι με 100 estimators θέλουμε 10 φορές περισσότερο χρόνο για την εκπαίδευση από ότι με 10. Ωστόσο, αν είχαμε περισσότερη υπολογιστική δύναμη θα μπορούσαμε να εκπαιδεύσουμε παράλληλα τους επιμέρους classifiers και να συνδυάσουμε στο τέλος τα αποτελέσματα τους εξοικονομώντας έτσι χρόνο. Ακόμα, στους boosting classifiers δε θα μπορούσαμε να εφαρμόσουμε αυτή τη λύση, καθώς κάθε classifier περιμένει τα αποτελέσματα του προηγούμενου και εκπαιδεύεται προσπαθώντας να εστιάσει στα σημεία που έγινε λάθος classification. Αυτή η ακολουθιακή διαδικασία δε μας επιτρέπει να εκπαιδεύσουμε παράλληλα τους επιμέρους classifiers.

## 3.0 Creating the best classifier ##

**3.1** In this part of the assignment you are asked to train the best possible ensemble! Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure (weighted) & balanced accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code. Can you achieve a balanced accuracy over 83-84%?

In [98]:
# BEGIN CODE HERE
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier,StackingClassifier
from sklearn.neural_network import MLPClassifier

cls1 = RandomForestClassifier(n_estimators=50,max_depth=15,max_features=0.8,n_jobs=-1,random_state=RANDOM_STATE) # Classifier #1
cls2 = GradientBoostingClassifier(n_estimators=50,random_state=RANDOM_STATE) # Classifier #2
cls3 = MLPClassifier(hidden_layer_sizes=(100,100,100,100),random_state=RANDOM_STATE) # Classifier #3
best_cls = StackingClassifier(estimators=[('rf',cls1),('gb',cls2),('mlp',cls3)],n_jobs=-1) # Stacking Classifier
scores = cross_validate(best_cls,X,y,cv=10,scoring=('f1_weighted','balanced_accuracy'),n_jobs=-1)

best_fmeasure = np.mean(scores['test_f1_weighted']) # The average f-measure
best_accuracy = np.mean(scores['test_balanced_accuracy']) # The average accuracy
#END CODE HERE

In [99]:
print("Classifier:")
#print(best_cls)
print("F1 Weighted-Score:{} & Balanced Accuracy:{}".format(best_fmeasure, best_accuracy))

Classifier:
F1 Weighted-Score:0.8546981557772426 & Balanced Accuracy:0.8481343826688681


**3.2** Describe the process you followed to achieve this result. How did you choose your classifier and your parameters and why. Report the f-measure & accuracy (10-fold cross validation) of your final classifier and results of classifiers you tried in the cell following the code.

Αρχικά, χρησιμοποιήθηκε το ίδιο Ensemble με την 1.2 μιας που πετύχαινε ποσοστό άνω της τάξεως του 83-84% (όπως αναφέρεται και στην εκφώνηση).

Για το Ensemble χρησιμοποιήθηκαν 4 Classifiers: α)RandomForestClassifier, β)GradientBoostingClassifier, γ)MLPClassifier και δ)StackingClassifier. Οι Classifiers αυτοί παραμετροποιήθηκαν με έναν από τους καλύτερους δυνατούς τρόπους (ύστερα μετά από αρκετά πειράματα) και συνδυάστηκαν έτσι ώστε να παράγουν το ισχυρότερο Ensemble (που σύμφωνα με εσάς πρέπει να τρέξει σε πραγματικό χρόνο και να τελειώνει κάποια στιγμή).Μάλιστα, τα αποτελέσματα που παράγει είναι:F1 Weighted-Score:0.8546981557772426 & Balanced Accuracy:0.8481343826688681. Ωστόσο, επειδή ο χρόνος που παίρνει για να τρέξει και να ολοκληρωθεί η διαδικασία είναι αρκετός, δεν καταγράφηκαν άλλα αποτελέσματα από την πειραματική διαδικασία που προυπήρξε.

**3.3** Create a classifier that is going to be used in production - in a live system. Use the *test_set_noclass.csv* to make predictions. Store the predictions in a list.  

In [100]:
# BEGIN CODE HERE
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,VotingClassifier
from sklearn.linear_model import LogisticRegression

test_set.head()

cls1 = RandomForestClassifier(random_state=RANDOM_STATE)
cls2 = AdaBoostClassifier(n_estimators=50,random_state=RANDOM_STATE)
cls3 = LogisticRegression(random_state=RANDOM_STATE,max_iter=1000)

cls = VotingClassifier(estimators=[('rf',cls1),('ab',cls2),('lg',cls3)],voting='soft',n_jobs=-1)
cls.fit(X,y)
#END CODE HERE
test_set = pd.read_csv("test_set_noclass.csv")
predictions = cls.predict(test_set)

NameError: name 'test_set' is not defined

Θεωρώ πως η γραμμή:"test_set = pd.read_csv("test_set_noclass.csv")" θα έπρεπε να βρίσκεται πριν το:"# BEGIN CODE HERE",αν και πιστεύω πως θα τρέξει σωστά! Ωστόσο, δεν αλλάχθηκε η θέση του γιατί είναι εκτός ορίων (# BEGIN CODE HERE/#END CODE HERE)!

#### This following cell will not be executed. The test_set.csv with the classes will be made available after the deadline and this cell is for testing purposes!!! Do not modify it! ###

In [None]:
if False:
  from sklearn.metrics import f1_score, balanced_accuracy_score
  final_test_set = pd.read_csv('test_set.csv')
  ground_truth = final_test_set['CLASS']
  print("Balanced Accuracy: {}".format(balanced_accuracy_score(predictions, ground_truth)))
  print("F1 Weighted-Score: {}".format(f1_score(predictions, ground_truth, average='weighted')))

Both should aim above 85%!