# Three learning algorithms

**Author: Diana Mateus**



Scikit-learn is a very popular Machine Learning library for Python. In this notebook we will study how to put in practice the the three simple machine learning models 

*   Linear Regression
*   K-nearest neighbors
*   Naive Bayes 


**GOALS**: 

*   Understanding the purpose of data splitting in Machine Learning
*   Experimenting the train, val and test schemes with existing model from the scikit library
*   Evaluating a binary classification or a regression ML approach


## 0. Importing Modules and Data

Run the following lines to load the modules required for this lab.

In [1]:
import numpy as np #scientific computing (in ML it handles and operates on multi-dimensional arrays)
import matplotlib.pyplot as plt #for data visualization
import sklearn #for Machine Learning
import pandas as pd #for reading, writing and processing databases 


### Loading and Splitting Data

Remember that the objective of a Supervised Machine Learning methods is

*   to learn from examples
*   how to make predictions
*   for unseen data!!! (Generalization)

To train a supervised learning model we need an annotated dataset. The dataset is often a matrix $X$ of dimensions NxD, with N  the number of points/samples and $D$ the dimensionality of the vector describing *one* sample.

We need part of the data to train the model parameters. Moreover, if the model has hyperparameters (non trainable parameters) their tunning should be done on a different subset of the data. Finally, to verify that the model generalises well, it is important to evaluate its performance on unseen data (not used for training nor validation). For the above reasons necessary to split the data matrix into three groups:
*  **Training set** : used to fit the model parameters.
*  **Validation set** : used to set the model hyper-parameters.
*  **Test set** : used only after training and validation have been finished to evaluate the performance of the method.

For real life problems is important to reduce the use of the test set to its minimum, to improve generalization.

**What to do**: Run the following sections to load and split two datasets: one for regression and one for classification. Explore and change the code lines to understand the dataset dimension and how to split it


### Datasets for Regression


In [None]:
from sklearn import datasets 
diabetes = datasets.load_diabetes()

# Explore on your own the dimensions of the dataset and their meaning. 
print('The full data matrix has shape',diabetes.data.shape)

X = diabetes.data[:, np.newaxis, 2] 

# Splitting the data matrix
X_train = X[:-30]
X_test = X[-30:]
y_train = diabetes.target[:-30]
y_test = diabetes.target[-30:]

# Explore the data
#print(diabetes.DESCR) #Comment/Uncomment to see the dataset description
print('Dimension of the feature vector', diabetes.feature_names)
print('Dimension of the target value',diabetes.target.shape)
print('X train', X_train.shape)
print('X test',X_test.shape)
print('y train',y_train.shape)
print('y test',y_test.shape)



### Datasets for Classification

In [None]:
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()

label_names = data['target_names']
labels = data['target']
feature_names = data['feature_names']
features = data['data']

print(label_names)
print(labels.shape)
print(feature_names)
print(features.shape)
#print(data.DESCR) #Comment/Uncomment to see or hide the dataset description

### Dataset splitting

The following scikit function splits the dataset into train and test

In [None]:
from sklearn.model_selection import train_test_split

X2_train, X2_test, y2_train, y2_test = train_test_split(features,labels,test_size = 0.40, random_state = 42)


# 1. Training a ML model

When relying with on the scikit library, training a model is very simple. You  need to:
*   Load the model from scikit
*   Declare a new instance of the model 
*   Train the model parameters
*   Make predictions for new data
*   Evaluate the performance

Identify in the example code the above steps



### Model 1. Linear Regression

In [None]:
#Load and declare a new instance 
from sklearn import linear_model
regr = linear_model.LinearRegression()

In [None]:
#Fit (train) the model 
regr.fit(X_train, y_train)
print('Coefficients: \n', regr.coef_)
print('Intercept: \n', regr.intercept_)

In [None]:
#Make predictions
y_pred = regr.predict(X_test)

In [None]:
#Evaluate the performance
from sklearn.metrics import mean_squared_error, r2_score
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))

In [None]:
#Visualize 
plt.scatter(X_test, y_test, color='blue')
plt.plot(X_test, y_pred, color='red', linewidth=3)
plt.xlabel('bmi')
plt.ylabel('diabetes progression')
plt.show()

## Model 2. K-Nearest Neighbors Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knnClassifier = KNeighborsClassifier(n_neighbors=2)
knnClassifier.fit(X2_train, y2_train)
y2_pred_knn = knnClassifier.predict(X2_test)
print(y2_pred_knn.shape)

In [None]:
#Compute accuracy on the training set
train_accuracy = knnClassifier.score(X2_train, y2_train)
    
#Compute accuracy on the test set
test_accuracy = knnClassifier.score(X2_test, y2_test) 

print('train accuracy', train_accuracy)
print('test accuracy',test_accuracy)

## Model 3. Naive Bayes Classifier

In [None]:
from sklearn.naive_bayes import GaussianNB
gnbClassifier = GaussianNB()
gnbClassifier.fit(X2_train,y2_train)
y2_pred_gnb = gnbClassifier.predict(X2_test)
print(y2_pred_gnb.shape)

In [None]:
#Compute accuracy on the training set
train_accuracy = gnbClassifier.score(X2_train, y2_train)
    
#Compute accuracy on the test set
test_accuracy = gnbClassifier.score(X2_test, y2_test) 

print('train accuracy', train_accuracy)
print('test accuracy',test_accuracy)

# QUESTIONS



1.   In the regression example and keeping the size of the test set fixed, what is the effect on the (test) performance when we progressively increase the amount of training data? Demonstrate with performane plots and discuss.
2.   In the linear regression example and keeping the size of the test set fixed, what is the effect of using other varaibles other than the  'bmi' used during training?, is it using more information always better? (try adding new features, report results and discuss the results).
3.   In the above examples, we have split each dataset into two subsets each. Are two subsets (train and test) enough for the three models (lin reg, KNN, NaiveBayes)? Answer for each model.
4.   Repeat the KNN example, but splitting the dataset into three subsets(train,val,test). Progressively modify the hyperparameter k.  What is the best neighborhood size k?, what is the appropriate methodologogy to find this number? 
5.   How do we know if learning was really succesful (vs underfitting or overfitting?)
7. What is the highest performance you can achieve for the two classification methods? What model for classification is better?, why?
8.   Naive classifiers are probabilistic classifiers. How do we recover the probabilistic information associated to this model? What quantities can we recover?
9.   How is the performance score computed? create a function that calculates the TP, TN, FP and FN and replace the in-built score function with your own. Provide this code into the report. 
10.   We saw in the lecture that linear models were used for regression, while KNN and Naive Bayes were used for classification. Can we use linear models for classification?  KNN or Naive Bayes for regression problems?

**BONUS**
Implement your own version of one of the three algorithms and compare the results

