# Complexity, bias and variance

you saw how the complexity of a model labeled f" influences the bias and variance terms of its generalization error.
Which of the following correctly describes the relationship between 
f's complexity and f"'s bias and variance terms?

- As the complexity of f" increases, the bias term decreases while the variance term increases.

# Overfitting and underfitting

In this exercise, you'll visually diagnose whether a model is overfitting or underfitting the training set.

For this purpose, we have trained two different models A and B  on the auto dataset to predict the mpg consumption of a car using only the car's displacement (`displ`) as a feature.

The following figure shows you scatterplots of `mpg` versus `displ` along with lines corresponding to the training set predictions of models A and B  in red.

<center><img src="images/02.06.jpg"  style="width: 400px, height: 300px;"/></center>


- B suffers from high bias and underfits the training set.

# Instantiate the model

you'll diagnose the bias and variance problems of a regression tree. The regression tree you'll define in this exercise will be used to predict the mpg consumption of cars from the auto dataset using all available features.

In [3]:
import pandas as pd
mpg = pd.read_csv("dataset/auto.csv")
X = mpg.drop(["origin", "mpg"], axis=1)
y = mpg["mpg"]
# mpg.head()

In [4]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

# Set SEED for reproducibility
SEED = 1

# Split the data into 70% train and 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

# Instantiate a DecisionTreeRegressor dt
dt = DecisionTreeRegressor(min_samples_leaf=0.26, max_depth=4, random_state=SEED)

# Evaluate the 10-fold CV error

you'll evaluate the 10-fold CV Root Mean Squared Error (RMSE) achieved by the regression tree `dt` that you instantiated in the previous exercise.


In [5]:
from sklearn.model_selection import cross_val_score

# Compute the array containing the 10-folds CV MSEs
MSE_CV_scores = - cross_val_score(dt, X_train, y_train, cv=10, 
                       scoring='neg_mean_squared_error',
                       n_jobs=-1)

# Compute the 10-folds CV RMSE
RMSE_CV = (MSE_CV_scores.mean())**(1/2)

# Print RMSE_CV
print('CV RMSE: {:.2f}'.format(RMSE_CV))

CV RMSE: 5.14


# Evaluate the training error

You'll now evaluate the training set RMSE achieved by the regression tree `dt` that you instantiated in a previous exercise

In [6]:
# Import mean_squared_error from sklearn.metrics as MSE
from sklearn.metrics import mean_squared_error as MSE

# Fit dt to the training set
dt.fit(X_train, y_train)

# Predict the labels of the training set
y_pred_train = dt.predict(X_train)

# Evaluate the training set RMSE of dt
RMSE_train = (MSE(y_train, y_pred_train))**(1/2)

# Print RMSE_train
print('Train RMSE: {:.2f}'.format(RMSE_train))

Train RMSE: 5.15


# High bias or high variance?

you'll diagnose whether the regression tree dt you trained in the previous exercise suffers from a bias or a variance problem. Does `dt` suffer from a high bias or a high variance problem? note that `baseline_RMSE = 5.1`

- `dt` suffers from high bias because `RMSE_CV` and `RMSE_train` both scores are greater than `baseline_RMSE`.

# Define the ensemble

In this exercise, you'll instantiate three classifiers to predict whether a patient suffers from a liver disease using all the features present in the dataset.

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

## Set seed for reproducibility
SEED=1

# Instantiate lr
lr = LogisticRegression(random_state=SEED)

# Instantiate knn
knn = KNeighborsClassifier(n_neighbors=27)

# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=0.13, random_state=SEED)

# Define the list classifiers
classifiers = [('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]

# Evaluate individual classifiers

In this exercise you'll evaluate the performance of the models in the list classifiers that we defined in the previous exercise. You'll do so by fitting each classifier on the training set and evaluating its test set accuracy.

In [12]:
df = pd.read_csv("dataset/indian_liver_patient_preprocessed.csv", index_col=0)
X = df.drop("Liver_disease", axis=1)
y = df["Liver_disease"]
# df.head()

In [16]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.3, random_state=42)

# Iterate over the pre-defined list of classifiers
for clf_name, clf in classifiers:    
 
    # Fit clf to the training set
    clf.fit(X_train, y_train)    
   
    # Predict y_pred
    y_pred = clf.predict(X_test)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_test, y_pred) 
   
    # Evaluate clf's accuracy on the test set
    print('{:s} : {:.3f}'.format(clf_name, accuracy))

Logistic Regression : 0.655
K Nearest Neighbours : 0.661
Classification Tree : 0.661


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


# Better performance with a Voting Classifier

Finally, you'll evaluate the performance of a voting classifier that takes the outputs of the models defined in the list classifiers and assigns labels by majority voting.

In [17]:
# Import VotingClassifier from sklearn.ensemble
from sklearn.ensemble import VotingClassifier

# Instantiate a VotingClassifier vc
vc = VotingClassifier(estimators=classifiers)     

# Fit vc to the training set
vc.fit(X_train, y_train)   

# Evaluate the test set predictions
y_pred = vc.predict(X_test)

# Calculate accuracy score
accuracy = accuracy_score(y_test, y_pred)
print('Voting Classifier: {:.3f}'.format(accuracy))

Voting Classifier: 0.655


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
