# Diabetes Data Set

Dataset file: 'diabetes.data'  
Reference link for description of dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

### Preview of the Data Set

Load the data set.

a) Analyse the data set. Print the number of features, feature names, data types of the features, number of data points and the values of the first 10 data points.

In [8]:
import pandas as pd
data = pd.read_csv('diabetes.data','\t')
print(data[:11])
data = data.values

    AGE  SEX   BMI     BP   S1     S2    S3    S4      S5  S6    Y
0    59    2  32.1  101.0  157   93.2  38.0  4.00  4.8598  87  151
1    48    1  21.6   87.0  183  103.2  70.0  3.00  3.8918  69   75
2    72    2  30.5   93.0  156   93.6  41.0  4.00  4.6728  85  141
3    24    1  25.3   84.0  198  131.4  40.0  5.00  4.8903  89  206
4    50    1  23.0  101.0  192  125.4  52.0  4.00  4.2905  80  135
5    23    1  22.6   89.0  139   64.8  61.0  2.00  4.1897  68   97
6    36    2  22.0   90.0  160   99.6  50.0  3.00  3.9512  82  138
7    66    2  26.2  114.0  255  185.0  56.0  4.55  4.2485  92   63
8    60    2  32.1   83.0  179  119.4  42.0  4.00  4.4773  94  110
9    29    1  30.0   85.0  180   93.4  43.0  4.00  5.3845  88  310
10   22    1  18.6   97.0  114   57.6  46.0  2.00  3.9512  83  101


### Training and Testing Data Sets

b) Split the data set into training and testing data set with a 80:20 ratio.

(Hint: What precautions must you take before you split the data set?)

In [9]:
import numpy as np
np.random.seed(0)

# Shuffling the data before slicing it off to ensure randomness and break any sort of order
np.random.shuffle(data)
train = data[0:354,:]
test = data[354:,:]

### Linear Regression

c) Using linear regression, seek a model for the response of interest ($Y$), as a function of the baseline variables such as age, sex, body mass index, etc. Compute the training error and testing error.

In [17]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X_train = train[:, 0:10]
Y_train = train[:, 10]
X_test = test[:, 0:10]
Y_test = test[:, 10]

reg = LinearRegression()
reg.fit(X_train, Y_train)

print("Train Error: ", mean_squared_error(Y_train, reg.predict(X_train)))
print("Test Error: ", mean_squared_error(Y_test, reg.predict(X_test)))

Train Error:  2908.6779784108207
Test Error:  2753.3208215123846


### Data Preprocessing

d) Normalize the data set and perform linear regression again. Compute the training error and testing error. Comment.

In [18]:
from sklearn.preprocessing import StandardScaler

scale = StandardScaler()
X_train_scaled = scale.fit_transform(X_train)
X_test_scaled = scale.transform(X_test)

reg = LinearRegression()
reg.fit(X_train_scaled, Y_train)

print("Train Error after Normalize: ", mean_squared_error(Y_train, reg.predict(X_train_scaled)))
print("Test Error after Normalize: ", mean_squared_error(Y_test, reg.predict(X_test_scaled)))
print()
print("We see that Train and Test Errors remain the same.")
print("This is because we just subtract and divide by constant terms to each feature of each sample point")
print("Subtracting and Dividing by constant terms just changes the coefficients and not Prediction Value")
print("However, Normalizing helps us in deciding the important features")

Train Error after Normalize:  2908.6779784108203
Test Error after Normalize:  2753.320821512383

We see that Train and Test Errors remain the same.
This is because we just subtract and divide by constant terms to each feature of each sample point
Subtracting and Dividing by constant terms just changes the coefficients and not Prediction Value
However, Normalizing helps us in deciding the important features


### Feature Reduction

e) Rank the features in order of importance (based on the study in d)). Comment.

In [19]:
# print(reg.coef_)
print("Features in Decreasing order of Importance according to their coefficient values")
print("S1")
print("S5")
print("BMI")
print("S2")
print("BP")
print("S4")
print("SEX")
print("S3")
print("S6")
print("AGE")

Features in Decreasing order of Importance according to their coefficient values
S1
S5
BMI
S2
BP
S4
SEX
S3
S6
AGE


### Polynomial Regression

f) Repeat the exercise in d) with quadratic features. List the features you would add to the existing data set. Compute the training error and the testing error. Comment.

In [155]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree = 2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

scale = StandardScaler()
X_train_scaled = scale.fit_transform(X_train_poly)
X_test_scaled = scale.transform(X_test_poly)

reg = LinearRegression()
reg.fit(X_train_scaled, Y_train)

print("Train Error: ", mean_squared_error(Y_train, reg.predict(X_train_scaled)))
print("Test Error: ", mean_squared_error(Y_test, reg.predict(X_test_scaled)))
print()
print("We see that Train Error decreases as we move from Linear regression to polynomial regression")
print("We also see that Test Error increases as we move from Linear regression to polynomial regression")
print("This is due to overfitting")
print()
print("We add Quadratic Features, ie suppose our intial feautres we X1, X2")
print("We now add extra features like X1^2, X2^2, X1X2")
print("In total we have features like X1^2, X2^2, X1X2, X1, X2, 1(for intercept)")
print("For K features and d degree polynomial fitting we get (K+d)C(d) total number of features")
print("From the above formulae, total features in our case will be 12C2 i.e. 66 features")

Train Error:  2431.5011577282403
Test Error:  2840.661063272829

We see that Train Error decreases as we move from Linear regression to polynomial regression
We also see that Test Error increases as we move from Linear regression to polynomial regression
This is due to overfitting

We add Quadratic Features, ie suppose our intial feautres we X1, X2
We now add extra features like X1^2, X2^2, X1X2
In total we have features like X1^2, X2^2, X1X2, X1, X2, 1(for intercept)
For K features and d degree polynomial fitting we get (K+d)C(d) total number of features
From the above formulae, total features in our case will be 12C2 i.e. 66 features
