# Regression: diabetes dataset

Included with scikit-learn is the diabetes dataset. It contains 10 predictor variables and one target variable.
We will use this to explore various types of regression.

## Load and inspect the data


In [None]:
# the usual imports
from __future__ import division
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Load and have a look at the shape of the diabetes dataset. The column names are not included in the data, so we will specify them separately.

In [None]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
diabetes.keys()

In [None]:
X = diabetes.data
y = diabetes.target

X.shape, y.shape

To better inspect the data, build a pandas DataFrame from it.

In [None]:
column_names=['age', 'sex', 'bmi', 'map', 'tc', 'ldl', 'hdl', 'tch', 'ltg', 'glu']
# fill in missing code here
df = 
# include the target variable in the DataFrame
df['target'] = 

Now, use pandas functionality to display (part of) the rows, summary values, and variable intercorrelations.
What can you say about the predictor variables?
Which predictor variables are most correlated with the target?
How about correlations between the predictors?

In [None]:
# display first columns
# fill in missing code here
df.

In [None]:
# get statistical summaries per column
# fill in missing code here
df.

In [None]:
# display correlation matrix
# fill in missing code here
df.

A nice way to inspect correlation strength is using a heatmap. The seaborn library has a heatmap function:

In [None]:
import seaborn as sns
plt.figure()
coefs = np.corrcoef(df.values.T)
sns.set(style='whitegrid')
hm = sns.heatmap(coefs, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=df.columns, xticklabels=df.columns) 
plt.show()
sns.reset_orig()

Using pandas' scattermatrix, we can check for nonlinear relationships:

In [None]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(df.drop('sex',1), diagonal='kde')

## Split into train and test sets

In [None]:
from sklearn import cross_validation
# fill in missing code here
X_train, X_test, y_train, y_test =

## Linear Regression 
Now, perform linear regression on the training data.
What are the regression coefficients?

In [None]:
from sklearn import linear_model
# fill in missing code here
lreg_model = 



How well does the model perform on the training data? Determine R square.

In [None]:
'R^2 (train): {}'.format(lreg_model.score(X_train, y_train))

How well does the model perform on the test data? Determine R square.

In [None]:
# fill in missing code here

## Ordinary Least Squares using statsmodels

We can use statsmodels to get p values for the model and the coefficients. How many coefficients are significant on the p=0.5 level?

In [None]:
import statsmodels.api as sm
sm_linear = sm.OLS(y_train, X_train)
sm_results = sm_linear.fit()
print sm_results.summary()

## Regularization - Lasso coefficient path

However, we have seen strong (positive and negative) intercorrelations among the predictors, so more predictors may be useful once we eliminate/reduce collinearity. Let's look at what happens when we introduce regularization. Plot the Lasso coefficient path:

In [None]:
from cycler import cycler
def coefficient_path(model, alphas, X, y):
  model = model
  coefs = []
  for a in alphas:
    model.set_params(alpha=a)
    model.fit(X, y)
    coefs.append(model.coef_)
  plt.figure()
  ax = plt.gca()
  ax.set_prop_cycle(cycler('color', ['c', 'm', 'y', 'k', 'b', 'r', 'y', 'g']))
  ax.plot(alphas, coefs)
  ax.set_xscale('log')
  ax.set_xlim(ax.get_xlim()[::-1])  # reverse axis
  plt.xlabel('alpha')
  plt.ylabel('weights')
  plt.title('Coefficients as a function of the regularization')
  plt.axis('tight')
  plt.show()


In [None]:
alphas = np.logspace(-5, 2, 50)
alphas

In [None]:
# call the coefficient_path function with a Lasso model, the alphas, and the training data:
# fill in missing code here

## Lasso: 2 features

Now perform Lasso Regression to build a model with 2 features. What are the coefficients chosen by Lasso?
    

In [None]:
# fill in missing code here
lasso_model =


What are the p values reported by statsmodels for this 2-coefficient model?

In [None]:
sm_linear = sm.GLS(y_train, X_train[:,[2,8]])
sm_results = sm_linear.fit()
print sm_results.summary()

What are the R^2 values (train and test) for the 2 coefficient Lasso model?

In [None]:
# fill in missing code here

## Lasso: 4 features

Now try Lasso Regression with 4 features. What are the coefficients chosen by Lasso?
    

In [None]:
# fill in missing code here
lasso_model = 

What are the p values reported by statsmodels for this 2-coefficient model?

In [None]:
sm_linear = sm.GLS(y_train, X_train[:,[2,3,6,8]])
sm_results = sm_linear.fit()
print sm_results.summary()

What are the R^2 values (train and test) for the 4 coefficient Lasso model?

In [None]:
# fill in missing code here

## Nonlinear regression: k Nearest Neighbors

Even though the predictors and the target variable look rather linearly correlated, let's try if a nonlinear method delivers better performance.

Perform k nearest neighbors regression with different numbers for k and compare accuracies for test and training set.
Does k nearest neighbors regression perform better than linear regression?

(Note: To determine the best k, you would normally use a validation set or perform cross validation. The fact that we compute accuracies on the test set for different values of k is for quick overview only.)

In [None]:
from sklearn import neighbors
n_neighbors = [3,5,10,20,30,40,50]
weight = 'distance'

for n in n_neighbors:
    print('knn (n = {})'.format(n))
    # fill in missing code here
    # pass in number of neighbors and weight
    knn_model = 
    
    print 'R^2 (train): {}, R^2 (test): {}\n'.format(knn_model.score(X_train, y_train), knn_model.score(X_test, y_test))

What's your conclusion regarding KNN regression?