
# Linear regression example

This example uses the only one feature of the `diabetes` dataset, in order to illustrate a two-dimensional plot of this regression technique. 

The material is taken from [here](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py).

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
%matplotlib inline

## Load dataset

In [None]:
# Load the diabetes dataset
diabetes = datasets.load_diabetes()

In [None]:
print(diabetes.DESCR)

In [None]:
diabetes.feature_names

## Extract data from dataset
Note: Each newaxis object in the selection tuple serves to expand the dimensions of the resulting selection by one unit-length dimension. The added dimension is the position of the newaxis object in the selection tuple.

In [None]:
# Use only one feature
feature_number = 2 # BMI values
diabetes_X = diabetes.data[:, np.newaxis, feature_number]

In [None]:
diabetes.data

## Split your data into training and test

In [None]:
# play with list manipulation
l = [1,2,3,4,5]

In [None]:
# python way for extracting 1st and 2nd list item
l[:-3]

In [None]:
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

In [None]:
# Plot the test data
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.show()

## Linear regression

In statistics, linear regression is a linear approach to modelling the relationship between a dependent variable and an independent variables. 

- [Wikipedia reference](https://en.wikipedia.org/wiki/Linear_regression)
- [scikit reference](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [None]:
# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

## Make predictions

In [None]:
# Make predictions using the testing set
diabetes_y_pred = regr.predict(diabetes_X_test)

## Evaluation

### Mean squared error
A risk metric corresponding to the expected value of the squared (quadratic) error or loss.

$$ 
{\displaystyle \operatorname {MSE} ={\frac {1}{n}}\sum _{i=1}^{n}(Y_{i}-{\hat {Y_{i}}})^{2}.} 
$$

- [scikit reference](https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error)
- [Wikipedia reference](https://en.wikipedia.org/wiki/Mean_squared_error)

### $R^2$-score
It provides a measure of how well future samples are likely to be predicted by the model. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.


In [None]:
# The coefficients -- Estimated coefficients for the linear regression problem
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
      % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# Explained variance score: 1 is perfect prediction
print('R2 score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))

## Plot results

In [None]:
# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)
plt.xticks(())
plt.yticks(())
plt.show()

The straight line can be seen in the plot, showing how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation.

## EXERCISE 1

Use a different feature in the `diabetes` dataset. That is, substitute `feature_number` with a number between 0 and 9 (number of features). 