#### Generalization Error

learn how to diagnose the problems of over and underfitting, learn about ensembling (several models are aggrevated to produce more robust predictions)

in supervised learning you make the assumption that there's a mapping f between features and labels (y=f(x) where f is as unknown function that you want to determine)
in reality, data generation is always accompanied with randomness (noise) 
your goal is to find a model, fhat, that best approximates f, fhat can be logistic regression, decision tree, neural network, etc., when training fhat you want to make sure that you discard as much noise as possible
in the end fhat should achieve a low predictive error on unseen datasets

two issuse may happen when approximating f:
- overfitting: when fhat fits the noise int the training set, the model overfits the training set and its predictive power on unseen datasets is pretty low, this model would have a low training set error but a high test set error
- underfitting: when fhat is not flexible enough to approximate f, the training set error will be roughly equal to the test set error but both those errors are pretty high, the model isn't flexible enough to capture the complex dependency between features and labels 

the **generalization error** of a model tells you how much it generalizes on unseen data, the generalization error of fhat is does fhat generalize well on unseen data?
it can be decomposed int 3 terms: generalization error of fhat = bias^2^ + variance + irreducible error
- the irreducible error is the error contribution of noise
- the bias term tells you on average how much fhat and f and different, high bias models lead to underfitting 
- the variance term tells you how much fhat is inconsistent over different training sets, a high variance model will have fhat following the training data points so closely that it'll miss the true function f, high variance models lead to overfitting

the **model complexity** of a model sets its flexibility to approximate the true function f, sets the flexibility of fhat, for example, if you increase the maximum tree depth it'll also increase the complexity of a decision tree, the best model complexity corresponds to the lowest generalization error
- when model complexity increases the variance increases while the bias decreases
- when model complexity decreases the variance decreases and the bias increases 

your goal is to find the model complexity the achieves the lowest generalization error
this error is the sum of 3 terms with the irreducible error constant so you need to find a balance between bias and variance because as one increases the other decreases this is known as the **bias-variance tradeoff**

to visualize this:
- imagine approximating fhat as aiming at the center of a shooting target, the center is the true function f
- if fhat is low bias and low variance, your shots will be closely clustered around the center
- if fhat is high variance and high bias, your shots will miss the target and be spread all around it
- if fhat is low variance but high bias the shots will be clustered but not on the target
- if fhat is high variancge but low bias then the should will be spread out but around the target

__________________________low variance (precise) || high variance (not precise)

low bias (accurate)      |clustered on target    || spread around target
high bias (not accurate) |clustered off target   || spread off target

#### Diagnose Bias and Variance Problems

once you train a supervised machine learning model labeled fhat, how do you estimate the generalization error of a model? in other words, how do you estimate fhat's generalization error?

this can't be done directly because f is unknown, you usually only have one dataset, and you don't have access to the error term because of noise (noise in unpredictable) 

a solution to this is to first split the data into a training set and a test set!, the model fhat can then be fit to the training set and its error can be evaluated on the unseen test set, the generalization error of fhat is approximately equal to the test set error of fhat 

the test set should be kept untouched until you're confident about fhat's performance, it should only be used to evaluate fhat's final performance or error, evaluationg fhat on the training set may produce an optimistic estimation of the error (a biased estimate) because fhat has already seen all the training points when it was fit, because of this, you should use cross validation to obtain a reliable estimate of fhat's performance

you can perform cv using k-fold or hold-out 
if you did 10 fold you'd get 10 errors and then the cv error will be the mean of these errors, you could then check if that error is greater than fhat's training set error:
- greater will mean that fhat suffers from high variance, fhat has overfit the training set, to remedy this you can try decreasing fhat's complexity (decrease max_depth, increase min samples per leaf, etc.) or you could gather more data to train fhat with 
- if the cross validation error is roughly equal to the training error but much greater than the desired error then fhat suffers from high bias, fhat underfits the training set, to remedy this you can try increasing the model's complexity (increase max depth, decrease min samples per leaf, etc.) or you could gather more relevant features for the problem 

In [None]:
# kfold cv in sklearn 
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import cross_val_score

# set seed for reproducibility
SEED = 123

# split the dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# instantiate the decision tree regressor 
dt = DecisionTreeRegressor(max_depth=4, 
                           min_samples_leaf=0.14,
                           random_state=SEED)

# evaluate the list of MSE obtained by 10-fold cv
# set n_jobs to -1 to exploit all CPU cores in computation 
# the scoring parameter was set because cross_val_score() doesn't allow computing the mean squared errors directly 
MSE_CV = - cross_val_score(dt, X_train, y_train, cv=10, scoring='neg_mean_squared_error', n_jobs=-1)
# the result is a numpy array of the 10 negative mean squared errors achieved on the 10 folds
# multiply this result by -1 to obtain an array of CV-MSE

# fit dt to the training set
dt.fit(X_train, y_train)
# predict the labels of the training set
y_predict_train = dt.predict(X_train)
# predict the labels of the test set
y_predict_test = dt.predict(X_test)

# the cv mean squared error can be determined as the mean of mse_cv
print('CV MSE: {:.2f}'.format(MSE_CV.mean()))

# use the  function mse to evaluate the train and test set mean squared errors
# training set mse
print('Train MSE: {:.2f}'.format(MSE(y_train, y_predict_train)))
# testing set mse
print('Test MSE: {:.2f}'.format(MSE(y_test, y_predict_test)))

# the training set error is smaller than the cv error so we can deduce 
# that dt overfits the training set and that it suffers from high variance

#### Ensemble Learning

ensemble learning is a supervised learning technique

CARTs (classification and regression trees) have many advantages such as being simple to understand, being simple to interpret, and being easy to use, their flexibility gives them an ability to describe non-linear dependencies between features and labels, you don't need a lot of feature processing to train a CART because you don't need to standardize or normalize features before feeding them to a CART 

CARTS also have some limitations, for example a classification tree is only able to produce orthogonal decision boundaries, CARTs are very sensitive to small variations in the training set (sometimes a single point removed from the training set can drastically change a CART's learned parameters), CARTs suffer from high variance when they're trained without constraints in which case they may overfit the training set

**ensemble learning** is a solution that takes advantage of the flexibility of CARTs while reducing  their tendency to memorize noise, here's how it works:
- different models are trained on the same dataset
- each model makes its own predictions
- a meta-model then aggregates the predictions of the individual models and outputs a final prediction 
- that final prediction will be more robust and less prone to errors than each individual model
- the best result are obtained when the models are skillful but in different ways (if some models make predictions that are way off then the other models should compensate those errors) 

for a classification problem, ensemble learning may work by using the same training set on different classifiers such as a decision tree, logistic regression, knn, and other models, each classifier then learns its parameters and makes predictions, these predictions are then fed to a meta model which aggregates them and outputs a final prediction

one ensemble technique is the **voting classifier**
- this will use a binary classification task 
- the ensemble will consist of N classifiers making the predictions P0, P1, ..., Pn with P=0 or 1
- the meta model will output the final prediction by hard voting (majority rule)

In [None]:
# voting classifier using sklearn
# use all the features in the dataset to predict whether a cell is malignant or not 

# import functions to compute accuracy and split data
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# import models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.ensemble import VotingClassifier

# set seed for reproducibility
SEED = 1

# split the dataset into 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# instantiate the individual classifiers
lr = LogisticRegression(random_state=SEED)
knn = KNN()
dt = DecisionTreeClassifier(random_state=SEED)

# define a list that contians the tuples (classifier_name, classifier) of the name of the model and the models themselves
classifiers = [('Logistic Regression', lr), 
               ('K Nearest Neighbors', knn), 
               ('Classification Tree', dt)]

# iterate over the list of tuples to fit each classifier to the training set, evaluate its accuracy on the test, and print the result
for clf_name, clf in classifiers:
    # fit clf to the training set
    clf.fit(X_train, y_train)
    
    # predict the labels of the test set
    y_pred = clf.predict(X_test)
    
    # evaluate the accuracy of clf on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy_score(y_test, y_pred)))
# in this case, logistic regression has the best accuracy with 94.7%

# instantiate a voting classifier (vc) 
vc = VotingClassifier(estimators=classifiers)

# fit vc to the training set and predict the test set labels
vc.fit(X_train, y_train)
y_pred = vc.predict(X_test)

# evaluate the test set accuracy of vc
print('Voting Classifier: {.3f}'.format(accuracy_score(y_test, y_pred)))
# in this case the accuracy is 95.3% which is higher than what was achieved by any of the individual models in the ensemble