# Model Quality and Improvements

## 1. Defining the Question

### a) Data Analysis Question

Can I develop a model that predicts whether a patient will be diagnosed with diabetes.

### b) Metric for Success

The machine learning model should predict whether a patient will be diagnosed with diabetes with an accuracy score greater than 0.85

### c) Understanding the context 

The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

### d) Experimental Design

1. Data Importation
2. Data Exploration
3. Data Cleaning
4. Data Preparation
5. Data Modeling (Using Decision Trees, Random Forest and Logistic Regression)
6. Model Evaluation
7. Hyparameter Tuning
8. Findings and Recommendations

### e) Data Relevance

The datasets consists of several medical predictor variables and one target variable, *Outcome*. Predictor variables includes the number of pregnancies the patient has had, plasma glucose concentration, blood pressure, skin thickness, insulin level, BMI, diabetes pedigree function and age.

## 2. Reading the Data

In [1]:
# Importing our libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# Load the data below
# --- 
dataset_url = "https://raw.githubusercontent.com/wambasisamuel/DE_Week04_Tuesday/main/diabetes2.csv"
df = pd.read_csv(dataset_url) 

In [3]:
# Checking the first 5 rows of data
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [4]:
# Checking the last 5 rows of data
# ---
df.tail(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [5]:
# Checking number of rows and columns
df.shape

(768, 9)

In [6]:
# Checking datatypes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


Observations:

*   The are 768 observations in the dataset.
*   The dataset has 9 features.
*   All the features are numerical.



## 3. External Data Source Validation

The provided dataset matches the one on Kaggle. It has enough features to help in developing a machine learning model that can predict employee promotions.

## 4. Data Preparation

### Data Standardisation

In [7]:
# Standardise column names
# ---
# Convert column names to lowercase 
df.columns = df.columns.str.lower()
df.columns

Index(['pregnancies', 'glucose', 'bloodpressure', 'skinthickness', 'insulin',
       'bmi', 'diabetespedigreefunction', 'age', 'outcome'],
      dtype='object')

### Data Cleaning

#### Missing Data

In [8]:
# Checking missing entries of all the variables
# ---
# 
df.isnull().sum()

pregnancies                 0
glucose                     0
bloodpressure               0
skinthickness               0
insulin                     0
bmi                         0
diabetespedigreefunction    0
age                         0
outcome                     0
dtype: int64

It appears there is no missing values. I will dig deeper to determine if missing values have been assigned zeros. I will do this for all columns except the *pregnancies* and *outcome* columns

In [9]:
# Rows with 0's 
df_zeroes = df[(df['glucose']==0) | (df['bloodpressure']==0) | (df['skinthickness']==0) | (df['insulin']==0) | (df['bmi']==0) | (df['diabetespedigreefunction']==0) | (df['age']==0)]
df_zeroes.shape

(376, 9)

376 observations have zeroes representing missing values. I will replace these missing values with NaN, which will help in their imputation.

In [10]:
# Copy of the dataframe
# df_copy = df.copy(deep = True)
# Replace 0's with NaN
df[['glucose','bloodpressure','skinthickness','insulin','bmi']] = df[['glucose','bloodpressure','skinthickness','insulin','bmi']].replace(0,np.NaN)

# Count of NANs
df.isnull().sum()

pregnancies                   0
glucose                       5
bloodpressure                35
skinthickness               227
insulin                     374
bmi                          11
diabetespedigreefunction      0
age                           0
outcome                       0
dtype: int64

I will impute the missing values of the column with the mean value of that particular column.

In [11]:
columns_to_impute = ['glucose','bloodpressure','skinthickness','insulin','bmi']
for col in columns_to_impute:
  df[col].fillna(df[col].mean(), inplace = True)
df.isnull().sum()

pregnancies                 0
glucose                     0
bloodpressure               0
skinthickness               0
insulin                     0
bmi                         0
diabetespedigreefunction    0
age                         0
outcome                     0
dtype: int64

#### Duplicate data

In [12]:
# Find the total duplicate records
df.duplicated().sum()

0

There are no duplicate records.

## 5. Data Modelling

### Splitting the dataset

I will split the dataset to have 75% data for training and 25% for testing

In [13]:
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)

# Training and Validation features and targets
features_train = df_train.drop(['outcome'], axis=1)
target_train = df_train['outcome']

features_valid = df_valid.drop(['outcome'], axis=1)
target_valid = df_valid['outcome']

### Decision Tree Modelling

In [23]:
dt_model = DecisionTreeClassifier(random_state=12345)

# Train the model
dt_model.fit(features_train, target_train)

# Get model score
dt_model_score = dt_model.score(features_valid, target_valid)
dt_model_score

0.7708333333333334

### Random Forest Modelling

In [19]:
rf_model = RandomForestClassifier(random_state=12345)

# Train the model
rf_model.fit(features_train, target_train)

# Get model score
rf_model_score = rf_model.score(features_valid, target_valid)
rf_model_score

0.796875

### Logistic Regression Modelling

In [21]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')

# Train the model
lr_model.fit(features_train, target_train)

# Get model score
lr_model_score = lr_model.score(features_valid, target_valid)
lr_model_score

0.796875

## 6. Model Evaluation

* None of the models thus far achieves the required 85% accuracy.
* Random forest and Logistic regression models have the highest accuracy each with 79.875%

## 7. Hyperparameter Tuning

### Decison Tree

I will tune the max_depth hyperparameter

In [28]:
from sklearn.metrics import accuracy_score
for depth in range(1, 10):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(model.score(features_valid, target_valid))
    #print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.7708333333333334
max_depth = 2 : 0.7708333333333334
max_depth = 3 : 0.765625
max_depth = 4 : 0.7395833333333334
max_depth = 5 : 0.7760416666666666
max_depth = 6 : 0.7760416666666666
max_depth = 7 : 0.7239583333333334
max_depth = 8 : 0.7395833333333334
max_depth = 9 : 0.75


### Random Forest

I will tune the n_estimators parameter

In [30]:
for estimator in range(1, 20):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimator)
    model.fit(features_train, target_train)
    valid_score = model.score(features_valid, target_valid)
    print('estimators =', estimator, ': ', end='')
    print(valid_score)

estimators = 1 : 0.7083333333333334
estimators = 2 : 0.734375
estimators = 3 : 0.734375
estimators = 4 : 0.7239583333333334
estimators = 5 : 0.7395833333333334
estimators = 6 : 0.7552083333333334
estimators = 7 : 0.7291666666666666
estimators = 8 : 0.75
estimators = 9 : 0.7552083333333334
estimators = 10 : 0.7447916666666666
estimators = 11 : 0.7447916666666666
estimators = 12 : 0.7708333333333334
estimators = 13 : 0.7864583333333334
estimators = 14 : 0.7864583333333334
estimators = 15 : 0.7916666666666666
estimators = 16 : 0.78125
estimators = 17 : 0.8020833333333334
estimators = 18 : 0.7864583333333334
estimators = 19 : 0.7760416666666666


## 6. Summary and Recommendations

Below are the findings:

1. Even after hyperparameter tuning, no model achieves 85% accuracy.  
2. The highest accuracy thus far is 80.21% for a random forest classifier having 17 estimators.



