# Diabetes Data Set

Dataset file: 'diabetes.data'  
Reference link for description of dataset: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

### Preview of the Data Set

Load the data set.

a) Analyse the data set. Print the number of features, feature names, data types of the features, number of data points and the values of the first 10 data points.

In [19]:
import pandas as pd
data = pd.read_csv('diabetes.data','\t')
print(data)
data = data.values

     AGE  SEX   BMI      BP   S1     S2    S3    S4      S5   S6    Y
0     59    2  32.1  101.00  157   93.2  38.0  4.00  4.8598   87  151
1     48    1  21.6   87.00  183  103.2  70.0  3.00  3.8918   69   75
2     72    2  30.5   93.00  156   93.6  41.0  4.00  4.6728   85  141
3     24    1  25.3   84.00  198  131.4  40.0  5.00  4.8903   89  206
4     50    1  23.0  101.00  192  125.4  52.0  4.00  4.2905   80  135
5     23    1  22.6   89.00  139   64.8  61.0  2.00  4.1897   68   97
6     36    2  22.0   90.00  160   99.6  50.0  3.00  3.9512   82  138
7     66    2  26.2  114.00  255  185.0  56.0  4.55  4.2485   92   63
8     60    2  32.1   83.00  179  119.4  42.0  4.00  4.4773   94  110
9     29    1  30.0   85.00  180   93.4  43.0  4.00  5.3845   88  310
10    22    1  18.6   97.00  114   57.6  46.0  2.00  3.9512   83  101
11    56    2  28.0   85.00  184  144.8  32.0  6.00  3.5835   77   69
12    53    1  23.7   92.00  186  109.2  62.0  3.00  4.3041   81  179
13    50    2  26.2 

### Training and Testing Data Sets

b) Split the data set into training and testing data set with a 80:20 ratio.

(Hint: What precautions must you take before you split the data set?)

In [20]:
import numpy as np
np.random.seed(7)
np.random.shuffle(data)
train = data[0:354,:]
test = data[354:,:]

### Linear Regression

c) Using linear regression, seek a model for the response of interest ($Y$), as a function of the baseline variables such as age, sex, body mass index, etc. Compute the training error and testing error.

In [21]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(train[:,0:10],train[:,10])
from sklearn.metrics import mean_squared_error
predtr = reg.predict(train[:,0:10])
predte = reg.predict(test[:,0:10])
train_error = mean_squared_error(train[:,10],predtr)
test_error = mean_squared_error(test[:,10],predte)
print('Training Error:',train_error)
print('Testing Error:',test_error)

Training Error: 2940.815513704523
Testing Error: 2610.401561501811


### Data Preprocessing

d) Normalize the data set and perform linear regression again. Compute the training error and testing error. Comment.

In [32]:
# mean = np.mean(train[:,0:10],axis=0)
# var = np.mean(((train[:,0:10] - mean)**2),axis=0)
# std = np.sqrt(var)
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
train_n = scale.fit_transform(train)
test_n = scale.transform(test)
reg2 = LinearRegression()
reg2.fit(train_n[:,0:10],train_n[:,10])
predtr_norm = reg2.predict(train_n[:,0:10])
predte_norm = reg2.predict(test_n[:,0:10])
train_mse_norm = mean_squared_error(train_n[:,10],predtr_norm)
test_mse_norm = mean_squared_error(test_n[:,10],predte_norm)
print('Training Error Normalized:',train_mse_norm)
print('Testing Error Normalized:',test_mse_norm)

Training Error Normalized: 0.48362342756694393
Testing Error Normalized: 0.4292861434579788


### Feature Reduction

e) Rank the features in order of importance (based on the study in d)). Comment.

In [33]:
print(reg2.coef_)
# therefore going from highest to the lowest importance we get:
# 1st is S1 (highest importance)
# 2nd is S5
# 3rd is BMI
# 4th is S2
# 5th is BP
# 6th is S4
# 7th is SEX
# 8th is S3
# 9th is S6
# 10th is AGE (lowest importance)

[-0.00879172 -0.13748946  0.34396646  0.19004288 -0.44228641  0.21568029
  0.11974461  0.18928457  0.4155079   0.06303564]


### Polynomial Regression

f) Repeat the exercise in d) with quadratic features. List the features you would add to the existing data set. Compute the training error and the testing error. Comment.

In [40]:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree = 2)
trainpoly = poly.fit_transform(train[:,0:10])
testpoly = poly.transform(test[:,0:10])
scale2 = StandardScaler()
trainq = np.zeros((np.shape(trainpoly)[0],67))
trainq[:,0:66] += trainpoly
trainq[:,66]   += train[:,10]
testq = np.zeros((np.shape(testpoly)[0],67))
testq[:,0:66] += testpoly
testq[:,66]  += test[:,10]
trainq = scale2.fit_transform(trainq)
testq = scale2.transform(testq)
regq = LinearRegression()
regq.fit(trainq[:,0:10],trainq[:,10])
predtrq = regq.predict(trainq[:,0:10])
predteq = regq.predict(testq[:,0:10])
msetrq = mean_squared_error(trainq[:,10],predtrq)
mseteq = mean_squared_error(testq[:,10],predteq)
print('Training Error:',msetrq)
print('Testing Error',mseteq)

Training Error: 0.6704713765072232
Testing Error 0.6169251467027934
