# Lab1 - Scikit-learn
Author: *Samuel Sofela*

## 1. Introduction

The goal of this lab is to become familiar with the scikit-learn library.

You will practice loading example datasets, perform classification and regression with linear scikit-learn models, and investigate the effects of reducing the number of features (columns in X) and the number of samples (rows in X and y).


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Classification

Using yellowbrick spam - classification  
https://www.scikit-yb.org/en/latest/api/datasets/spam.html

The goal is to investigate `LogisticRegression(max_iter=2000)` and effects of reducing the number of features and number of samples on classification performance.

### 2.1 Implement convenience function

In [19]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

def get_classifier_accuracy(model, X, y):
    '''Calculate train and validation accuracy of classifier (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn classifier): Classifier to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training accuracy, validation accuracy
    
    '''
    #TODO: IMPLEMENT FUNCTION BODY
    # split the data into training set and validation set
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    
    # fit the model using training data
    model.fit(X_train, y_train)

    # predict the data on training set
    y_predicted_training = model.predict(X_train)

    # get the training accuracy
    training_accuracy = accuracy_score(y_train, y_predicted_training)

    # predict the data on validation set
    y_predicted_test = model.predict(X_test)
    
    # get the validation accuracy
    test_accuracy = accuracy_score(y_test, y_predicted_test)

    return (training_accuracy, test_accuracy)
    
    
    

### 2.2 Load data

Use the yellowbrick function `load_spam()`, load the spam data set into feature matrix `X` and target vector `y`.

Print size and type of `X` and `y`.


In [12]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets.loaders import load_spam
X,y = load_spam()
print(X.shape)
print(type(X))
print(y.shape)
print(type(y))
print(int(len(X)*0.01))

(4600, 57)
<class 'pandas.core.frame.DataFrame'>
(4600,)
<class 'pandas.core.series.Series'>
46


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [33]:
# TODO: ADD YOUR CODE HERE
x_small,_, y_small, _ = train_test_split(X, y, train_size=0.01, random_state=174)
print("The side of x_small is: ", x_small.shape, " and the type is: ", type(x_small))
print("The side of x_small is: ", y_small.shape, " and the type is: ", type(y_small))

The side of x_small is:  (46, 57)  and the type is:  <class 'pandas.core.frame.DataFrame'>
The side of x_small is:  (46,)  and the type is:  <class 'pandas.core.series.Series'>


### 2.3 Train and evaluate models

1. Import `LogisticRegression` from sklearn
2. Instantiate model `LogisticRegression(max_iter=2000)`.
3. Create a pandas DataFrame `results` with columns: Data size, training accuracy, validation accuracy
4. Call your convenience function `get_classifier_accuracy()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation accuracy for each call to the `results` DataFrame
6. Print `results`

In [32]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(max_iter=2000)

results_pd = pd.DataFrame(columns = ['Data size', 'Training Accuracy', 'Validation Accuracy'])

train_acc, test_acc = get_classifier_accuracy(logreg, X, y)
results_pd.loc[0]=[X.shape, train_acc,test_acc ]

x_2col = X.iloc[:,:2]
train_acc, test_acc = get_classifier_accuracy(logreg, x_2col, y)
results_pd.loc[1]=[x_2col.shape, train_acc,test_acc ]

train_acc, test_acc = get_classifier_accuracy(logreg, x_small, y_small)
results_pd.loc[2]=[x_small.shape, train_acc,test_acc ]

print(results_pd)

    Data size  Training Accuracy  Validation Accuracy
0  (4600, 57)           0.934493             0.918261
1   (4600, 2)           0.608986             0.613043
2    (46, 57)           0.941176             0.750000


In [42]:
#evaluating the difference between the training and validation accuracies for the entire data
results_pd["Difference in Accuracy"]= results_pd['Training Accuracy'] - results_pd['Validation Accuracy']
print(results_pd)


    Data size  Training Accuracy  Validation Accuracy  Difference in Accuracy
0  (4600, 57)           0.934493             0.918261                0.016232
1   (4600, 2)           0.608986             0.613043               -0.004058
2    (46, 57)           0.941176             0.750000                0.191176


### 2.4 Questions
1. What is the validation accuracy using all data? What is the difference between training and validation accuracy?
1. How does the validation accuracy and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation accuracy and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*

1. Using all the data, Validation accuracy= 0.918261, and the difference between the training data and validation data is: 0.016232
2. When only two columns are used, relative to using the entire data,  the validation accuracy reduces from 0.918261 to 0.613043 while the difference between the training and validation accuracies decreases from 0.016232 to -0.004058.
3. When only 1% of the rows are used, relative to using the entire data,  the validation accuracy reduces from 0.918261 to 0.750000 while the difference between the training and validation accuracies increases from 0.016232 to 0.191176.




## 3. Regression

Using yellowbrick energy - regression  
https://www.scikit-yb.org/en/latest/api/datasets/energy.html

The goal is to investigate `LinearRegression()` and effects of reducing the number of features and number of samples on regression performance.

### 3.1 Implement convenience function

In [59]:
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def get_regressor_mse(model, X, y):
    '''Calculate train and validation mean-squared error (mse) of regressor (model)
        
        Splits feature matrix X and target vector y 
        with sklearn train_test_split() and random_state=956.
        
        model (sklearn regressor): Regressor to train and evaluate
        X (numpy.array or pandas.DataFrame): Feature matrix
        y (numpy.array or pandas.Series): Target vector
        
        returns: training mse, validation mse
    
    '''
    #TODO: IMPLEMENT FUNCTION BODY
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=956)
    model.fit(X_train, y_train)
    y_train_predicted = model.predict(X_train)
    y_test_predicted = model.predict(X_test)
    mse_training = mean_squared_error(y_train, y_train_predicted)
    mse_validation = mean_squared_error(y_test, y_test_predicted)
    return (mse_training, mse_validation)
    
    

### 3.2 Load data

Use the yellowbrick function `load_energy()` load the energy data set into feature matrix `X` and target vector `y`.

Print dimensions and type of `X` and `y`.

In [60]:
# TODO: ADD YOUR CODE HERE
from yellowbrick.datasets.loaders import load_energy
x,y = load_energy()
print("The side of x is: ", x.shape, " and the type is: ", type(x))
print("The side of x is: ", y.shape, " and the type is: ", type(y))


The side of x is:  (768, 8)  and the type is:  <class 'pandas.core.frame.DataFrame'>
The side of x is:  (768,)  and the type is:  <class 'pandas.core.series.Series'>


Using the sklearn function `train_test_split()` prepare a feature matrix `X_small` and target vector `y_small` that contain only **1%** of the rows. Use `random_state=174`.

Print size and type of `X_small` and `y_small`.

In [61]:
# TODO: ADD YOUR CODE HERE
x_small, _, y_small, _ = train_test_split(x, y, train_size=0.01, random_state=174)
print("The side of x_small is: ", x_small.shape, " and the type is: ", type(x_small))
print("The side of y_small is: ", y_small.shape, " and the type is: ", type(y_small))

The side of x_small is:  (7, 8)  and the type is:  <class 'pandas.core.frame.DataFrame'>
The side of y_small is:  (7,)  and the type is:  <class 'pandas.core.series.Series'>


### 3.3 Train and evaluate models

1. Import `LinearRegression` from sklearn
2. Instantiate model `LinearRegression()`.
3. Create a pandas DataFrame `results` with columns: Data size, training MSE, validation MSE
4. Call your convenience function `get_regressor_mse()` using 
    - `X` and `y`
    - Only first two columns of `X` and `y`
    - `X_small` and `y_small`
5. Add the data size, training and validation MSE for each call to the `results` DataFrame
6. Print `results`

In [68]:
# TODO: ADD YOUR CODE HERE
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

results_pd1 = pd.DataFrame(columns = ['Data size', 'Training MSE', 'Validation MSE'])

train_acc1, test_acc1 = get_regressor_mse(linreg, x, y)
results_pd1.loc[0]=[x.shape, train_acc1,test_acc1]

x_2col = x.iloc[:,:2]
train_acc1, test_acc1 = get_regressor_mse(linreg, x_2col, y)
results_pd1.loc[1]=[x_2col.shape, train_acc1 ,test_acc1]

train_acc1, test_acc1 = get_regressor_mse(linreg, x_small, y_small)
results_pd1.loc[2]=[x_small.shape, train_acc1,test_acc1]

print(results_pd1)

  Data size  Training MSE  Validation MSE
0  (768, 8)  8.012691e+00       10.366349
1  (768, 2)  5.360043e+01       46.410426
2    (7, 8)  2.145702e-29       69.977449


In [69]:
#evaluating the difference between the training and validation accuracies for the entire data
results_pd1["Difference in MSE"]= results_pd1['Training MSE'] - results_pd1['Validation MSE']
print(results_pd1)

  Data size  Training MSE  Validation MSE  Difference in MSE
0  (768, 8)  8.012691e+00       10.366349          -2.353657
1  (768, 2)  5.360043e+01       46.410426           7.190004
2    (7, 8)  2.145702e-29       69.977449         -69.977449


### 3.4 Questions
1. What is the validation MSE using all data? What is the difference between training and validation MSE?
1. How does the validation MSE and difference between training and validation change when only two columns are used? Provide values.
1. How does the validation MSE and difference between training and validation change when only 1% of the rows are used? Provide values.

*YOUR ANSWERS HERE*

1. Using all the data, Validation MSE= 10.366349, and the difference, (training data - validation data) is: -2.353657
2. When only two columns are used, relative to using all the data, the validation MSE increases from 10.366349 to 46.410426 while the difference between training MSE and validation MSE increases from -2.353657 to  7.190004.
3. When only 1% of rows are used, relative to using the entire data, the validation MSE increases from from 10.366349 to 69.977449 while the difference between training MSE and validation MSE increases from from -2.353657 to -69.977449 (the absolute values are considered)


## 4. Observations/Interpretation

Describe any pattern you see in the results. Relate your findings to what we discussed during lectures. Include data to justify your findings.


*ADD YOUR FINDINGS HERE*
1. Using 2 columns implies that we are reducing the model complexities. For the logistic regression, the resulted in decrease in the validation accuracy (from 0.918261 to 0.613043) and the difference between the validation and training accuracies reduces (0.016232 to -0.004058). This show that we are in the high bias regime and aligns with the validation curve in class.
 
  In a similar scenario of reducing model complexity using a linear regression model, the MSE increases from 10.366349 to 46.410426 while the difference in the validation and training MSE increases from -2.353657 to 7.190004. This implies that with reduced complexity, the model predicted the data with more error (higher MSE) which means lesser accuracy as expected. This is also reflected in high MSE of the training data 53.60043. 

  For both models, the accuracy of the training model reduced with reduced model complexity leading to high bias and is underfit.

2. Using 1% of the data set reduces the size of the training set. For the logistic regression model, this increased the difference between the training and validation accuracies from 0.016232 to 0.191176. The validation accuracy also reduced from 0.918261 to 0.750000. This implies that model is in the high variance regime and aligns with the learning curve described in class. 

  Similarly, reduce the size of the data sample in the linear regression model increased the  MSE -2.353657 to -69.977449 and the validation MSE also increased from 10.366349 to 69.977449. This implies the high variance regime and the low MSEs of the training set relative to the validation set shows that the model was overfit.


Overall, the following conclusion can be made:
1. Original model = good fit
2. model using 2 columns = underfit
3. model using 1% data size = overfit

## 5. Reflection
Include a sentence or two about:
- what you liked or disliked,
- found interesting, confusing, challangeing, motivating
while working on this assignment.


*ADD YOUR THOUGHTS HERE*

