# Assignment 02: Evaluate the Diabetes Dataset

The given dataset lists the glucose level readings of several pregnant women taken either during a survey examination or routine medical care. It specifies if the 2-hour post-load plasma glucose was at least 200 mg/dl. Analyze the dataset to:

    Find the features of the dataset
    Find the response label of the dataset
    Create a model  to predict the diabetes outcome
    Use training and testing datasets to train the model
    Check the accuracy of the model

#### 1: Import the dataset

In [1]:
#Import the required libraries
import pandas as pd

In [2]:
#Import the diabetes dataset
diabetes_df = pd.read_csv("../../Data/pima-indians-diabetes.data", header=None)

#### 2: Analyze the dataset

In [3]:
#View the first five observations of the dataset
diabetes_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### 3: Find the features of the dataset

In [4]:
#Use the .NAMES file to view and set the features of the dataset
feature_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

In [5]:
#Use the feature names set earlier and fix it as the column headers of the dataset
diabetes_df = pd.read_csv("../../Data/pima-indians-diabetes.data", header=None, names=feature_names)

In [6]:
#Verify if the dataset is updated with the new headers
diabetes_df.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [7]:
#View the number of observations and features of the dataset
diabetes_df.shape

(768, 9)

#### 4: Find the response  of the dataset

In [8]:
diabetes_df.corr()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
pregnant,1.0,0.129459,0.141282,-0.081672,-0.073535,0.017683,-0.033523,0.544341,0.221898
glucose,0.129459,1.0,0.15259,0.057328,0.331357,0.221071,0.137337,0.263514,0.466581
bp,0.141282,0.15259,1.0,0.207371,0.088933,0.281805,0.041265,0.239528,0.065068
skin,-0.081672,0.057328,0.207371,1.0,0.436783,0.392573,0.183928,-0.11397,0.074752
insulin,-0.073535,0.331357,0.088933,0.436783,1.0,0.197859,0.185071,-0.042163,0.130548
bmi,0.017683,0.221071,0.281805,0.392573,0.197859,1.0,0.140647,0.036242,0.292695
pedigree,-0.033523,0.137337,0.041265,0.183928,0.185071,0.140647,1.0,0.033561,0.173844
age,0.544341,0.263514,0.239528,-0.11397,-0.042163,0.036242,0.033561,1.0,0.238356
label,0.221898,0.466581,0.065068,0.074752,0.130548,0.292695,0.173844,0.238356,1.0


In [9]:
#Select features from the dataset to create the model
selected_features = ['pregnant', 'insulin', 'glucose', 'bmi', 'age']

In [10]:
#Create the feature object
x_feature = diabetes_df[selected_features]

In [11]:
#Create the reponse object
y_target = diabetes_df.label

In [12]:
#View the shape of the feature object
x_feature

Unnamed: 0,pregnant,insulin,glucose,bmi,age
0,6,0,148,33.6,50
1,1,0,85,26.6,31
2,8,0,183,23.3,32
3,1,94,89,28.1,21
4,0,168,137,43.1,33
...,...,...,...,...,...
763,10,180,101,32.9,63
764,2,0,122,36.8,27
765,5,112,121,26.2,30
766,1,0,126,30.1,47


In [13]:
#View the shape of the target object
y_target

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: label, Length: 768, dtype: int64

#### 5: Use training and testing datasets to train the model

In [14]:
#Split the dataset to test and train the model
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_feature, y_target, test_size=.25)
x_train.shape, y_train.shape, x_test.shape, y_test.shape

((576, 5), (576,), (192, 5), (192,))

#### 6: Create a model  to predict the diabetes outcome

In [15]:
# Create a logistic regression model using the training set
from sklearn.linear_model import LogisticRegression
logReg = LogisticRegression()
classifier = logReg.fit(x_train, y_train)

In [16]:
#Make predictions using the testing set
y_pred = classifier.predict(x_test)
y_pred

array([0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1], dtype=int64)

#### 7: Check the accuracy of the model

In [17]:
#Evaluate the accuracy of your model
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.7552083333333334

In [18]:
#Print the first 30 actual and predicted responses
import numpy as np
print("Actual:   ", np.array(y_test[:30]))
print("Predicted:", y_pred[:30])

Actual:    [1 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1]
Predicted: [0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 0 0 0]
