This code helps find a person's progression in diabetes based on various attributes


In [1]:
# Import relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

Read diabetes.csv into your Jupyter notebook.

In [2]:
# Read diabetes.csv file using pandas into Jupyter notebook
df = pd.read_csv('diabetes.csv')
# Use head to check if the data was read correctly
df.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,Progression
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646,151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204,75.0
2,0.085299,0.05068,0.044451,-0.00567,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.02593,141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362,206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641,135.0


The diabetes.csv aims to predict a person’s progression in the condition
with respect to various attributes about them.
Differentiate between the independent variables and the dependent
variable and assign them to variables x and y

A: The dependent variable is Progression as our aim is to predict a person's progression in the condition. The independent variables are all other attributes such as age, sex, bmi, bp and s1 to s6

Generate training and test sets comprising 80% and 20% of the data
respectively

In [3]:
# Generate training and test sets comprising 80% and 20% of the data respectively

# Assign independent variables as X
X = df[['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']]
# Assign dependent variable as y
y = df['Progression']

# Initialise a linear regression model
diabetes_model = LinearRegression() 

# Train model on data and find line of best fit 
diabetes_model.fit(X, y)

# Print the intercept and coefficients
print('Intercept: \n', diabetes_model.intercept_)
print('Coefficients: \n', diabetes_model.coef_) 

# Split data as 80% training and 20% test
rseed = 23
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rseed)

# Print training data and test data
print("Training data:", X_train.shape, y_train.shape)
print("Test data:", X_test.shape, y_test.shape)

Intercept: 
 152.13348416289597
Coefficients: 
 [ -10.0098663  -239.81564367  519.84592005  324.3846455  -792.17563855
  476.73902101  101.04326794  177.06323767  751.27369956   67.62669218]
Training data: (353, 10) (353,)
Test data: (89, 10) (89,)


Use a MinMaxScaler and StandardScaler from sklearn.preprocessing. Fit
these scalers on the train set and use these fit scalers to transform the train
and test sets.

In [4]:
# Fit the min max scaler and standard scalar on train data
minmax_sc = MinMaxScaler()
standard_sc = StandardScaler()
minmax_sc.fit(X_train)
standard_sc.fit(X_train)

# Apply the min max scaler on train and test data and transform the data
X_train_minmax_sc = minmax_sc.transform(X_train)
X_test_minmax_sc = minmax_sc.transform(X_test)

#Apply the standar scaler on train and test data and transform the data
X_train_standard_sc = standard_sc.transform(X_train)
X_test_standard_sc = standard_sc.transform(X_test)

Generate a multiple linear regression model using the training set. Use all
of the independent variables.
Print out the intercept and coefficients of the trained model.

In [5]:
# Fit the model using the min max scaled training data
diabetes_model_minmax = LinearRegression()
diabetes_model_minmax.fit(X_train_minmax_sc, y_train)

# Print intercept and coefficients of trained model
print('Intercept (MinMax): ', diabetes_model_minmax.intercept_)
print('Coefficients (MinMax): ', diabetes_model_minmax.coef_)

# Fit the model using the standard scaled training data
diabetes_model.fit(X_train_standard_sc, y_train)

# Print intercept and coefficients of trained model
print('Intercept (Standard): ', diabetes_model.intercept_)
print('Coefficients (Standard): ', diabetes_model.coef_)

Intercept (MinMax):  0.26313819520225934
Coefficients (MinMax):  [  -4.325416    -25.87489936  130.23755571   79.72225993 -281.78073485
  169.4272368    62.57500688   61.16459841  195.21322347   20.61117695]
Intercept (Standard):  147.78470254957506
Coefficients (Standard):  [ -0.95567861 -12.89955019  24.0062771   15.36227276 -47.12832261
  29.63772795  10.58261516  12.38396033  38.62646575   3.59889203]


Generate predictions for the test set. Compute R-squared for your model on the test set. You can use r2_score from sklearn.metrics to obtain this score.


In [6]:
# Generate predictions for min max scaled test set
y_pred = diabetes_model_minmax.predict(X_test_minmax_sc)

# Compute R_squared for model based on test data. Use r2_score from sklearn.metrics
R_squared = r2_score(y_test, y_pred)
print('R-squared (Min-Max): ', round(R_squared, 4))

# Generate predictions for standard scaled test set
y_pred = diabetes_model.predict(X_test_standard_sc)

# Compute R_squared for model based on test data. Use r2_score from sklearn.metrics
R_squared = r2_score(y_test, y_pred)
print('R-squared(Standard): ', round(R_squared,4))

#Other way of obtaining R2 Score
print("R2 Score (Min Max):", round(diabetes_model_minmax.score(X_test_minmax_sc, y_test), 4))
print("R2 Score (Standard):", round(diabetes_model.score(X_test_standard_sc, y_test), 4))

R-squared (Min-Max):  0.4588
R-squared(Standard):  0.4588
R2 Score (Min Max): 0.4588
R2 Score (Standard): 0.4588
