<a href="https://colab.research.google.com/github/securitylab-repository/TPS-IA/blob/master/TP2_2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting the progression of diabetes using linear regression

The **diabetes** data set described in lecture can be obtained as a single file, [diabetes-data.csv](https://raw.githubusercontent.com/securitylab-repository/TPS-IA/master/diabetes-data.csv), from the course website. We obtained it at https://web.stanford.edu/~hastie/Papers/LARS/diabetes.data. For some background information on the data, see this seminal paper:

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.

Before you start on this notebook, install `diabetes-data.csv` in the same directory. We will walk through some of the examples from lecture as well as giving you some problems to solve.


In [None]:
# Standard includes
%matplotlib inline
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
# Routines for linear regression
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
# Set label size for plots
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14)

This next snippet of code loads in the diabetes data. There are 442 data points, each with 10 predictor variables (which we'll denote `x`) and one response variable (which we'll denote `y`).

Make sure the file `'diabetes-data.csv'` is in the same directory as this notebook.

In [None]:
data = np.genfromtxt('diabetes-data.csv', delimiter=',')
features = ['age', 'sex', 'body mass index', 'blood pressure', 
            'serum1', 'serum2', 'serum3', 'serum4', 'serum5', 'serum6']
x = data[:,0:10] # predictors
y = data[:,10] # response variable

## 1. Predict `y` without using `x`

If we want to predict `y` without knowledge of `x`, what value would be predict? The <font color="magenta">mean</font> value of `y`.

In this case, the mean squared error (MSE) associated with the prediction is simply the variance of `y`.

In [None]:
print "Prediction: ", np.mean(y)
print "Mean squared error: ", np.var(y)

## 2. Predict `y` using `x`

To fit a linear regression model, we could directly use the formula we saw in lecture. To make things even easier, this is already implemented in `sklearn.linear_model.LinearRegression()`.

The following code takes `x` and `y`, along with the index `f = 3` of a single feature and fits a linear regressor to `(x[f],y)`. It then plots the data along with the resulting line.

In [None]:
data = np.genfromtxt('diabetes-data.csv', delimiter=',')
features = ['age', 'sex', 'body mass index', 'blood pressure', 
            'serum1', 'serum2', 'serum3', 'serum4', 'serum5', 'serum6']
x = data[:,0:10] # predictors
y = data[:,10] # response variable
regr = linear_model.LinearRegression()
x1 = x[:,[3]]
regr.fit(x1, y)
# Make predictions using the model. A tric  to plot the line
y_pred = regr.predict(x1)
# Plot data points as well as predictions
plt.plot(x1, y, 'bo')
plt.plot(x1, y_pred, 'r-', linewidth=3)
plt.xlabel(features[3], fontsize=14)
plt.ylabel('Progression of disease', fontsize=14)
plt.show()
print "w = ", regr.coef_
print "b = ", regr.intercept_

print "Score: " , regr.score(x1, y)
print "Mean squared error: ", mean_squared_error(y, regr.predict(x1))

> Compare the results with those obtained in question `2.`

> Give code that makes a prediction for the values `84` and `83` of the `blood preddure` feature 

> Let's try this with feature #2 (body mass index).

> For you to try: Feature #2 ('body mass index') is the single feature that yields the lowest mean squared error. Which feature is the second best?

> What is the problem with the approach used here to compute the `score` and `MSE`.
  - propose a solution to this issue by using the function below 

In [None]:
def split_data(n_train):
    if (n_train < 0) or (n_train > 442):
        print "Invalid number of training points"
        return
    np.random.seed(0)
    perm = np.random.permutation(442)
    training_indices = perm[range(0,n_train)]
    test_indices = perm[range(n_train,442)]
    trainx = x[training_indices,:]
    trainy = y[training_indices]
    testx = x[test_indices,:]
    testy = y[test_indices]
    return trainx, trainy, testx, testy

This function `split_data`  partitions the data set into separate training and test sets. It is invoked as follows:

* `trainx, trainy, testx, testy = split_data(n_train)`

Here:
* `n_train` is the desired number of training points
* `trainx` and `trainy` are the training points and response values
* `testx` and `testy` are the test points and response values

The split is done randomly, but the random seed is fixed, and thus the same split is produced if the procedure is called repeatedly with the same `n_train` parameter.


In [None]:
### You can use this space to figure out the second-best feature

> Extend this solution to predict `y` using a specified subset of features from `x`

> 

In [None]:
### You can use this space to figure out the extension

> Is the data normalized ?
 - if not proceed to the *normalization*

In [None]:
### You can use this space to figure out the normalization of the data

> Imporve you linear regression algorithm by `regularizing` it. Use for this purpose `sklearn.linear_model.RidgeCV` and/or `sklearn.linear_model.Lasso` algorithms.
  - Compare the results with simple linear regression algorithm.

In [None]:
### You can use this space to figure out the regularization