## Part 1 : Classification and Regression Trees

### Video 1 : Decision Tree for Classification

- <b>Classification Tree</b> : sequence of if-else questions about individual features.
- <b> Objective</b>: infer class labels
- Able to capture non linear relationshops between features and labels.
- dont require feature scaling

<img src = "https://miro.medium.com/max/720/1*XMId5sJqPtm8-RIwVVz2tg.png">

In [None]:
from sklearn.tree import DecisionTreeClassifier #import model
from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.metrics import accuracy score #import metrics

X_train, y_train, X_test, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 1,
                                                    stratify = y)
#initiate dt
dt = DecisionTreeClassifier(max_depth = 2,
                            random_state = 1)

dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

accuracy_score(y_test, y_pred)

- <b>Decision Region</b> : region in the feature space where all instances are assigned to one class label.
- <b>Decision Boundary</b> : surface separating different decision regions

#### Practice 1 : Train your first classification Tree

In [None]:
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate a DecisionTreeClassifier 'dt' with a maximum depth of 6
dt = DecisionTreeClassifier(max_depth = 6, random_state=SEED)

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict test set labels
y_pred = dt.predict(X_test)
print(y_pred[0:5])

#### Practice 2 : Evaluate the classification Tree

In [None]:
# Import accuracy_score
from sklearn.metrics import accuracy_score

# Predict test set labels
y_pred = dt.predict(X_test)

# Compute test set accuracy  
acc = accuracy_score(y_test, y_pred)
print("Test set accuracy: {:.2f}".format(acc))

#### Practice 3 : Logistic Regression vs classification tree

In [None]:
# Import LogisticRegression from sklearn.linear_model
from sklearn.linear_model import  LogisticRegression

# Instatiate logreg
logreg = LogisticRegression(random_state=1)

# Fit logreg to the training set
logreg.fit(X_train, y_train)

# Define a list called clfs containing the two classifiers logreg and dt
clfs = [logreg, dt]

# Review the decision regions of the two classifiers
plot_labeled_decision_regions(X_test, y_test, clfs)

### Video 2 : Classification Tree Learning

- <b>Decision Tree</b> : data structure consisting of hierarchy of nodes
- <b>Node</b> : question or prediction

Three Kinds of nodes :
- <b>Root</b> : no parent node question giving rise to 2 children nodes
- <b> Internal nodes</b> : 1 parent node, question giving rise to 2 children nodes
- <b> leaf</b> : 1 parent node, no children nodes ->> prediction



In [None]:
from sklearn.tree import DecisionTreeClassifier #import model
from sklearn.model_selection import train_test_split #import train_test_split
from sklearn.metrics import accuracy score #import metrics

X_train, y_train, X_test, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 1,
                                                    stratify = y)
#initiate dt
dt = DecisionTreeClassifier(criterion = 'gini',
                            random_state = 1)

dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

accuracy_score(y_test, y_pred)

#### Practice 1 : Using entropy as a criterion

In [None]:
# Import DecisionTreeClassifier from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Instantiate dt_entropy, set 'entropy' as the information criterion
dt_entropy = DecisionTreeClassifier(max_depth=8, criterion='entropy', random_state=1)

# Fit dt_entropy to the training set
dt_entropy.fit(X_train, y_train)

#### Practice 2 : Entropy vs Gini Index

In [None]:
# Import accuracy_score from sklearn.metrics
from sklearn.metrics import accuracy_score

# Use dt_entropy to predict test set labels
y_pred= dt_entropy.predict(X_test)

# Evaluate accuracy_entropy
accuracy_entropy = accuracy_score(y_test, y_pred)

# Print accuracy_entropy
print('Accuracy achieved by using entropy: ', accuracy_entropy)

# Print accuracy_gini
print('Accuracy achieved by using the gini index: ', accuracy_gini)

### Video 3 : Decision Tree for Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor #import model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE #metrics

X_train, y_train, X_test, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = 3)

dt = DecisionTreeRegressor(max_depth = 4, 
                           min_samples_leaf = 0.1,
                           random_state = 3)

dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)

#compute test set MSE
mse_dt = MSE(y_test, y_pred)

#compute test set RMSE
rmse_dt = mse_dt**(1/2)

print(rmse_dt)

#### Practice 1 : Train your first regression tree

In [None]:
# Import DecisionTreeRegressor from sklearn.tree
from sklearn.tree import DecisionTreeRegressor

# Instantiate dt
dt = DecisionTreeRegressor(max_depth= 8,
             min_samples_leaf=0.13,
            random_state=3)

# Fit dt to the training set
dt.fit(X_train, y_train)

#### Practice 2 : Evaluate the Regression Tree

In [None]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Compute y_pred
y_pred = dt.predict(X_test)

# Compute mse_dt
mse_dt = MSE(y_test, y_pred)

# Compute rmse_dt
rmse_dt = mse_dt**(1/2)

# Print rmse_dt
print("Test set RMSE of dt: {:.2f}".format(rmse_dt))

#### Practice 3 : Linear regression vs regression tree


In [None]:
# Predict test set labels 
y_pred_lr = lr.predict(X_test)

# Compute mse_lr
mse_lr = MSE(y_test, y_pred_lr)

# Compute rmse_lr
rmse_lr = mse_lr**(1/2)

# Print rmse_lr
print('Linear Regression test set RMSE: {:.2f}'.format(rmse_lr))

# Print rmse_dt
print('Regression Tree test set RMSE: {:.2f}'.format(rmse_dt))

## Part 2 : The Bias-Variance Tradeoff

### Video 1 : Generalization Error

Difficulties in Approximating F :

- <b>Overfitting</b> : f(x) fits the training set noise.
- <b>Underfitting</b> : f^ is not flexible enough to approx f

- High Bias : Underfitting
- High Variance : Overfitting

### Video 2 : Diagnose bias and Variance Problems

Estimating the generalization error :

solution :
- split the data to training and test sets
- evaluate the error of f hat on the unseen test set
- fit f hat to the training set

In [None]:
# k fold CV
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size 0.2, random_state = 123)

dt = DecisionTreeRegressor(max_depth= 4,
             min_samples_leaf=0.14,
            random_state=123)

MSE_CV = - cross_val_score(dt, X_train, y_train, cv = 10, scoring = 'neg_mean_squared_error', n_jobs = -1)

dt.fit(X_train, y_train)

y_pred_train = dt.predict(X_train)
y_pred_test = dt.predict(X_test)

print(MSE_CV.mean())
print(MSE(y_train, y_pred_train))
print(MSE(y_test, y_pred_test))

#### Practice 1 : Instantiate the model

In [None]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(max_depth= 4, min_samples_leaf = 0.26, random_state=SEED)

#### Practice 2 : Evaluate the 1- fold CV Error

In [None]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv= 10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

#### Practice 3 : Evaluate the Training Error

In [None]:
# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv= 10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

### Video 3 : Ensemble Learning

- train different models on the same dataset
- let each model make its predictions
- meta model : aggregates predictions of individual models
- final prediction : more robust and less prone to errors

In [None]:
# impoer functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#import models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

SEED = 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

lr = LogisticRegression(random_state = SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state = SEED)

#define a list that contains the tuple
classifier = [('Logistic Regression', lr),
              ('K Nearest Neighbours', knn),
              ('Classification Tree', dt)]

for clf_name, clf in classifiers:
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf_name, accuracy_score(y_test, y_pred))
    

vc = VotingClassifier(estimators = classifier)
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

print(accuracy_score(y_test, y_pred))

#### Practice 1 : Define the ensemble


In [None]:
# Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNN(n_neighbors= 27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

#### Practice 2 : Evaluate Individual Classifiers

In [None]:
# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

#### Practice 3 : Better performance with a Voting Classifier


In [None]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

## Part 3 : Bagging and Random Forest

### Video 1 : Bagging

Bootstrap Aggregation:

- 1 algorithm
- many subsets of the traing sets

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c8/Ensemble_Bagging.svg/440px-Ensemble_Bagging.svg.png">

Bagging, also known as bootstrap aggregation, is the ensemble learning method that is commonly used to reduce variance within a noisy dataset. In bagging, a random sample of data in a training set is selected with replacement—meaning that the individual data points can be chosen more than once.

In [None]:
#import models
from sklearn.ensemble import BaggingClassifier
fropm sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

SEED = 1

X_train, y_train, X_test, y_test = train_test_split(X, y, test_size = 0.3, stratify = y, random_state =SEED)

dt = DecisionTreeClassifier(max_depth = 4, )