<a href="https://colab.research.google.com/github/sdgroeve/ML-course-VIB-2020/blob/master/Histone_marks_dt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Histone modifications

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

random_seed = 123
np.random.seed(random_seed)

# 1. Reading the data

In [None]:
import pandas as pd

train = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/ML-course-VIB-2020/master/data_train.csv")
test = pd.read_csv("https://raw.githubusercontent.com/sdgroeve/ML-course-VIB-2020/master/data_test.csv")

In [None]:
train_ids = train.pop("GeneId")
train_labels = train.pop("Label")

In [None]:
test_index_col = test.pop("GeneId")

# 2. Fitting a decision tree model

The scikit-learn `DecisionTreeClassifier` class computes a decision tree predictive model from a dataset. 

To get all the options for learning you can simply type: 

In [None]:
from sklearn.tree import DecisionTreeClassifier
help(DecisionTreeClassifier)

You notice that there are many (hyper)parameters to set. These influence the complexity of the model. An important such parameter is the `max_depth` that sets a limit on how deep a decision tree can become. 

Let's create a decision tree model with `max_depth=3`:

In [None]:
cls = DecisionTreeClassifier(max_depth=3)

This creates a decision tree model with default values for the other hyperparameters:

In [None]:
cls

Let's create a validation set, fit the model and evaluate.

In [None]:
from sklearn.metrics import log_loss, accuracy_score
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(train,train_labels,
                                                  test_size=.2, random_state=random_seed)

cls.fit(train_X,train_y)

predictions_train = cls.predict(train_X)
predictions_val = cls.predict(val_X)

print("Accuracy: (%f) %f"%(accuracy_score(predictions_train, train_y),accuracy_score(predictions_val, val_y)))

predictions_train_prob = cls.predict_proba(train_X)
predictions_val_prob = cls.predict_proba(val_X)

print("Log-loss: (%f) %f"%(log_loss(train_y,predictions_train_prob[:,-1]),log_loss(val_y,predictions_val_prob[:,-1])))

The following code plots the fitted decision tree `cls` as a `tree.png` file:

In [None]:
"""
from sklearn import tree
from io import StringIO
from IPython.display import Image, display
import pydotplus

out = StringIO()
tree.export_graphviz(cls_DT, out_file=out)
graph=pydotplus.graph_from_dot_data(out.getvalue())
graph.write_png("tree.png")
"""

How do other values for for the `max_depth` hyperparameter perform?

In [None]:
import seaborn as sns

result = []
for md in range(1,10):
  cls = DecisionTreeClassifier(max_depth=md)
  cls.fit(train_X,train_y)
  predictions_train_prob = cls.predict_proba(train_X)
  predictions_val_prob = cls.predict_proba(val_X)
  result.append([md,log_loss(train_y,predictions_train_prob[:,-1]),"train"])
  result.append([md,log_loss(val_y,predictions_val_prob[:,-1]),"val"])

toplot = pd.DataFrame(result,columns=["max_depth","log-loss","set"])
sns.lmplot(x="max_depth",y="log-loss",hue="set",data=toplot,fit_reg=False)

In [None]:
cls = DecisionTreeClassifier(max_depth=14)

predictions_list = []
for i in range(10):
  train_X, val_X, train_y, val_y = train_test_split(train,train_labels,
                                                    test_size=.2, random_state=i)

  cls.fit(train_X,train_y)
  predictions_val = cls.predict(val_X)
  predictions_val_prob = cls.predict_proba(val_X)
  predictions_list.append(list(predictions_val_prob[:,-1]))
  print("%f %f"%(log_loss(val_y,predictions_val_prob[:,-1]),accuracy_score(val_y,predictions_val)))

In [None]:
tmp = pd.DataFrame(predictions_list)
predictions_avg = tmp.mean(axis=0)
print("Avg. model: %f"%(log_loss(val_y,predictions_avg)))

In [None]:

#pd.plotting.scatter_matrix(tmp.transpose())
sns.pairplot(tmp.transpose(),kind="scatter")

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

scaler_minmax = preprocessing.MinMaxScaler()
scaler_minmax.fit(train)
train_norm = pd.DataFrame(scaler_minmax.transform(train),columns=train.columns)

#cls = DecisionTreeClassifier(max_depth=4)
cls = LogisticRegression(C=0.5)

predictions_list = []
for i in range(10):
  train_X, val_X, train_y, val_y = train_test_split(train_norm,train_labels,
                                                    test_size=.2, random_state=i)

  cls.fit(train_X,train_y)
  predictions_val = cls.predict(val_X)
  predictions_val_prob = cls.predict_proba(val_X)
  predictions_list.append(list(predictions_val_prob[:,-1]))
  print("%f %f"%(log_loss(val_y,predictions_val_prob[:,-1]),accuracy_score(val_y,predictions_val)))

In [None]:
tmp = pd.DataFrame(predictions_list)
#pd.plotting.scatter_matrix(tmp.transpose())
sns.pairplot(tmp.transpose(),kind="scatter")

In [None]:
predictions_avg = tmp.mean(axis=0)
print("Avg. model: %f"%(log_loss(val_y,predictions_avg)))

# 5. Hyperparamters

For our first submission we set the hyperparameter `max_depth=3`. Other values might result in lower log-loss on the testset. 

Since we don't have the testset labels we can only check this on the public leaderboard, which we can/should not do!

So, we need to create our own testset (**not seen during training!**) with known class labels.

Scikit-learn offers many options to do this. One of them is the `train_test_split` function:

In [None]:
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(train_norm,train_labels,
                                                  test_size=.2, random_state=random_seed)

#train fold
print(train_X.shape)
print(train_y.shape)
#validation fold
print(val_X.shape)
print(val_y.shape)

Fit a decision tree model with `max_depth=14` default paramters on the `train_X` data set.

In [None]:
#solution
cls_DT = DecisionTreeClassifier(max_depth=14)
cls_DT.fit(train_X,train_y)

What is the accuracy and log-loss on `train_X`? 

In [None]:
#solution
predictions = cls_DT.predict(train_X)
print("Accuracy: %f"%(accuracy_score(predictions, train_y)))
predictions = cls_DT.predict_proba(train_X)
print("Log-loss: %f"%(log_loss(train_y,predictions[:,1])))

What is the accuracy and log-loss on `val_X`?

In [None]:
#solution
predictions = cls_DT.predict(val_X)
print("Accuracy: %f"%(accuracy_score(predictions, val_y)))
predictions = cls_DT.predict_proba(val_X)
print("Log-loss: %f"%(log_loss(val_y,predictions[:,1])))

What do you see?


The following code evaluates different values for this hyperparameter.

In [None]:
for maxdepth in range(1,20,1):
    cls = DecisionTreeClassifier(max_depth=maxdepth)
    cls.fit(train_X,train_y)
    predictions_train = cls.predict(train_X)
    predictions_val = cls.predict(val_X)
    predictions_train_prob = cls.predict_proba(train_X)[:,1]
    predictions_val_prob = cls.predict_proba(val_X)[:,1]
    print("%i (%f) %f (%f) %f"%(maxdepth,
                                accuracy_score(predictions_train, train_y),accuracy_score(predictions_val, val_y),
                                log_loss(train_y,predictions_train_prob),log_loss(val_y,predictions_val_prob)))

What do you see?

So, we have split the data into a train- and validationset. We can of course split the data in many different ways (different random seeds) resulting in different train- and validationsets.

Let's try 5 different random seeds:

In [None]:
for run in range(5):
  train_X, val_X, train_y, val_y = train_test_split(train_norm,train_labels,
                                                  test_size=.8, random_state=run)
  min_m = 100
  best = None
  for maxdepth in range(1,20,1):
      cls = DecisionTreeClassifier(max_depth=maxdepth)
      cls.fit(train_X,train_y)
      predictions_val_prob = cls.predict_proba(val_X)[:,1]
      m = log_loss(val_y,predictions_val_prob)
      if m < min_m:
        min_m = m
        best = maxdepth
  print("%i %f"%(best,min_m))

What do you see?

The solution is to run several train-validations splits and average the performance.

One popular method is cross-validation that uses each datapoint once as a testpoint.

It works as follows:
<br/>
<br/>
<img src="https://scikit-learn.org/stable/_images/grid_search_cross_validation.png"/>
<br/>
<br/>

It is easy to run this in Scikit-learn:



In [None]:
from sklearn.model_selection import cross_val_predict

for maxdepth in range(1,10,1):
    cls = DecisionTreeClassifier(max_depth=maxdepth)
    predictions = cross_val_predict(cls,train_norm,train_labels,
                                    cv=10,
                                    method="predict_proba")
    print("%i %f"%(maxdepth,log_loss(train_labels,predictions[:,1])))

We can do this in two lines of code with the `GridSearchCV` module:


In [None]:
from sklearn.model_selection import GridSearchCV

params = {
    'max_depth':range(1,10)
    }

GSCV = GridSearchCV(cls_DT, params,
                    cv=10,
                    scoring="neg_log_loss",
                    verbose=1).fit(train_norm,train_labels)

print(GSCV.best_estimator_)
print(GSCV.best_score_)

Play with the hyperparameters in a Template notebook and make some Kaggle submissions.

# 5. Ensemble learning: bagging

We have seen that bias and variance play an important role in Machine Learning. 

Let's first see what bagging can do for our dataset. 

In [None]:
from sklearn.ensemble import BaggingClassifier

cls = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=4),random_state=random_seed)
                                                            
cls.fit(train_X,train_y)
predictions_train = cls.predict(train_X)
predictions_val = cls.predict(val_X)
print("Accuracy: (%f) %f"%(accuracy_score(predictions_train, train_y),accuracy_score(predictions_val, val_y)))
predictions_train_prob = cls.predict_proba(train_X)
predictions_val_prob = cls.predict_proba(val_X)
print("Log-loss: (%f) %f"%(log_loss(train_y,predictions_train_prob[:,1]),log_loss(val_y,predictions_val_prob[:,1])))

With the `RandomForestClassifier` the variance of the decision tree is reduced also by selecting features for decision tree contruction at random. Let's see how far we get with default hyperparameter values.   

In [None]:
from sklearn.ensemble import RandomForestClassifier

cls = RandomForestClassifier(random_state=random_seed)

cls.fit(train_X,train_y)
predictions_train = cls.predict(train_X)
predictions_val = cls.predict(val_X)
print("Accuracy: (%f) %f"%(accuracy_score(predictions_train, train_y),accuracy_score(predictions_val, val_y)))
predictions_train_prob = cls.predict_proba(train_X)
predictions_val_prob = cls.predict_proba(val_X)
print("Log-loss: (%f) %f"%(log_loss(train_y,predictions_train_prob[:,1]),log_loss(val_y,predictions_val_prob[:,1])))

# 6. Ensemble learning: boosting

How about the `GradientBoostingClassifier`?

In [None]:
#solution
from sklearn.ensemble import GradientBoostingClassifier

cls = GradientBoostingClassifier(random_state=random_seed,
                                    max_depth=10)
cls.fit(train_X,train_y)
predictions_train = cls.predict(train_X)
predictions_val = cls.predict(val_X)
print("Accuracy: (%f) %f"%(accuracy_score(predictions_train, train_y),accuracy_score(predictions_val, val_y)))
predictions_train_prob = cls.predict_proba(train_X)
predictions_val_prob = cls.predict_proba(val_X)
print("Log-loss: (%f) %f"%(log_loss(train_y,predictions_train_prob[:,1]),log_loss(val_y,predictions_val_prob[:,1])))