# Forecasting Onset of Diabetes Mellitus

This project focuses on predicting the whether or not a patient has diabetes. The data is cleaned, analyzed, and used to develop a predictive model.

## Columns 

Pregnancies: Number of times pregnant <br><br>
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test <br><br>
BloodPressure: Diastolic blood pressure (mm Hg) <br> <br>
SkinThickness: Triceps skin fold thickness (mm) <br><br>
Insulin: 2-Hour serum insulin (mu U/ml) <br><br>
BMI: Body mass index (weight in kg/(height in m)^2)<br><br>
Diabetes Pedigree Function: Diabetes pedigree function<br><br>
Age: Age (years)<br><br>
Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0

## Import Tools 

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline 
import seaborn as sns
import itertools 
plt.style.use('fivethirtyeight')

## Load Data

In [None]:
"""import os

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename)) """
df = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()

## Data Cleaning <brk>
    
Explore the data to look for any inconsistencies. <br>

Some good procedures when going through data for the first time are:
1. Check number of rows & columns --> df.shape
2. Check for null values 
3. Check data types of each column 
4. Note any imbalances in data, such as one target outcome having significantly more data records than others

In [None]:
df = df.rename(columns = {'BloodPressure':'BP', 'DiabetesPedigreeFunction':"DPF"}) #Rename column titles to make them shorter

In [None]:
df.info()

In [None]:
df.describe()

**Notes** <brk>
    
    1. Do the minimum values of 0 make sense for all the different features ??  
    2. There are no null value data points in the original data 
    3. All the data is in the correct data type format (int or float). 
    
    The zero values in certain columns (Glucose, BP, Skin Thickness, Insuline, BMI) will be replaced with 'NaN' until further
    analysis is conducted on the data. 
    
    It is really important to handle these zero values as they affect the statistics of the data. When they are replaced by 
    NaN, Python automatically disregards them when calculating metrics such as mean, median, percentile, etc. 

In [None]:
zeroCols = ['Glucose', 'BP', 'SkinThickness', 'Insulin','BMI'] # Columns with incorrect Zero values 
df2 = df.copy() # create a copy of the original dataframe 
df2[zeroCols] = df2[zeroCols].replace(0,np.NaN) #Replace 0s with NaNs
df2.head()

In [None]:
outcomes = df2["Outcome"].value_counts()
print(outcomes)

# 0 = No Diabetes
# 1 = Diabetes 

**Note:** There is an big imbalance between the number of records for both outcomes. This must be taken into consideration when training the model to predict the onset of diabetes. 

**Descriptive Stats Comparison** <br>
Comparison of descriptive stats between the original data and the new data with 0s replaced with NaNs

In [None]:
df.describe() # Original data loaded into a dataframe 

In [None]:
df2.describe() # Data with 0s replaced with NaNs

**Note:** After replacing the 0s, 'SkinThickness' and 'Insulin' data has been significantly reduced. Almost 50% of the Insulin values are Null/missing and 30% of SkinThickness values are Null/missing.

In [None]:
#df2.info()

In [None]:
null_values = (df2.isna().sum()/len(df2))*100
null_values.drop(labels = ['Pregnancies','DPF','Age','Outcome'], inplace=True)
print("Column Name" + "     " + "% of Null Values\n")
print(null_values)

It is possible that the **presence/lack** of Skin Thickness and Insulin data is related to a person having diabetes. To check, the ratio between the number of data points for each outcome was noted. If the ratio for all the features is within the same range, then it could be assumed that there might not be a relationship as mentioned previously. However, if ratio is significantly skewed one way (ex: records of diabetes patients have significantly more data points of insulin collected), it could be an indicator of a relationship between the missing values and a patient having diabetes. 

In [None]:
dp = df2.groupby('Outcome').count() # Grouping the number of data points for both outcomes 
#print(dp)
outcome_0 = dp.loc[0,:] # Number of data points related to Outcome = 0
outcome_1 = dp.loc[1,:] # Number of data points related to Outcome = 1
print("Column Name" + "     " + "Outcome 1 to Outcome 2 Data Points Ratio\n") 
print(outcome_1/outcome_0)

According to the ratios displayed above, the ratio for the number of data points is around 0.5 (plus-min 0.04). It would be safe to conclude that the ratio between the number of data points for Skin Thickness and Insulin for both outcomes are within the same range as other features. 

## Simple EDA
Explore the data to get a better understanding of different trends, correlations, patterns, etc. 

In [None]:
hist = df2.hist(figsize = (20,20))
# Disregard the outcome histogram 

**Based on the skewness of the appropriate features, the 0 values will be replaced. **

In [None]:
# Replace 0 values in BMI, BP (Blood Pressure), Glucose, Insulin, and Skin Thickness 

df2['Glucose'].fillna(df2['Glucose'].median(), inplace = True)
df2['BMI'].fillna(df2['BMI'].median(), inplace = True)
df2['BP'].fillna(df2['BP'].mean(), inplace = True)
df2['Insulin'].fillna(df2['Insulin'].median(), inplace = True)
df2['SkinThickness'].fillna(df2['SkinThickness'].mean(), inplace = True)


In [None]:
histZR = df2.hist(figsize = (20,20)) # Histogram of data with zeros replaced 

**The zeros have been replaced with mean/median values**  <br>
If the NaN records were removed, half of the records in the dataset would have to be removed as the Insulin data by itself had 48% NaN values. 

## Data Analysis <br>
The data will now be explored more in-depth 

In [None]:
# Heat Map
hmap = sns.heatmap(df2.corr(), cmap = "BrBG", annot=True)
#plt.savefig(r'C:\Users\Shakti\Desktop\Data Science\Projects\Pima-Indians-Diabetes-Project\Data Visualizations\heatmap.jpg')

In [None]:
df2.corr()

**Notes:** <br>
1. There is minimal correlation between Skin Thickness & Insulin to the Outcomes. 
2. Relatively, Glucose (0.49) and BMI (0.31) have the highest correlation with the Outcome 

In [None]:
# Pair Plot 
pplot=sns.pairplot(df2, hue = 'Outcome', palette="husl")
plt.show()
#plt.savefig(r'C:\Users\Shakti\Desktop\Data Science\Projects\Pima-Indians-Diabetes-Project\Data Visualizations\pairplot.jpg')

**Note:** From the pair plot, it is hard to find any features which clearly distinguish between the outcomes. <br>

**Next:** Lets split the data for both outcomes and compare their descriptive stats. 

In [None]:
out0 = df2[df2['Outcome']==0]
out1 = df2[df2['Outcome']==1]

In [None]:
out0.describe()

In [None]:
out1.describe()

From a first glance, it seems that the data for Outcome 1 has higher descritive stats values. However, there is also a **high standard deviation** and  the distrbution for all the features are not 'normal'. 

## Data Preparation (Machine Learning)

Data needs to prepared to be used in machine learning models. If there is categorical data, it would need to be encoded. Numerical data would need to be scaled. 

Splitting the data into training and testing sets is also very important.It is never good to train your model on some data, and then test it on that same data. There are different approaches to help improve generalization in a model, but it is always important to test the model on data it has never seen before.

### Approach <br>

**Imbalanced Data** <br>
Approximately 35% of the data has an outcome of 1, and 65% of the data has an outcome of 0. The imbalance is not extreme but learning to address such problems is still important when developming machine learning models. In this approach an ensemble learning method will be used with **SVM, Logistic Regression, Random Forests, and KNN** 


**AUROC** will be used to gage a models' ability to correctly classify the data. The models will also be "penalized" or optimized to handle the imbalanced data as best as possible. For more information, look into ROC curves.

**Train/Validation/Test Split**:The data will be split into three groups. <br>
Training data: data used to train the model <br>
Validation: data used to tune the hyperparameters <br> 
Test: data used for final evaluation of the model <br>
--> Ratio: 70/15/15

### Import Algorithms

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.utils.multiclass import unique_labels
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

### Feature Scaling 
Some algorithms require the features to be **scaled/standardized/normalized**. There are different ways to accomplish this, but the method may vary based on the spread of the data. We will use Sklearn's **Robust Scaler** which uses the inerquartile range to scale the values.

Required: KNN, Logistic Regression, SVM <br>
Not Required: Random Forests 

**Data Prep** <br>
The data will be split into training/validation/test sets

In [None]:
features = (df2.iloc[:,:8]).values # feature values 
target = (df2.loc[:,'Outcome']).values # target values 

In [None]:
# Train - Validation/Test Split --> 70/30 
testSize = 0.3
trainSize = 0.7
validSize = 0.5
rs = 42 # random state 

x_train, x, y_train, y = train_test_split(features,target,train_size = trainSize, random_state=rs)
x_val, x_test, y_val, y_test = train_test_split(x,y,train_size=validSize, random_state = rs)

In [None]:
print(f"# of Training Data:{len(x_train)}\n# of Validation Data: {len(x_val)}\n# of Test Data:{len(x_test)}")

**Data Scaling**

In [None]:
scaler = RobustScaler()
xTrain_scaled = scaler.fit_transform(x_train)
xVal_scaled = scaler.fit_transform(x_val)
xTest_scaled= scaler.fit_transform(x_test)

### Machine Learning Models

**Evaluation Metrics**<br>
These metrics give us a better understanding of how our model performs. Ideally, AUC of ROC should be as close to 1 as possible. Sensititivy is a measure of the proportion of actual positives that are classified (e.g.the percentage of sick people who are correctly identified as having the condition). Sensitivity is a measure of actual negatives that are correctly classified (e.g. the percentage of healthy people who are correctly identified as not having the condition). Both of those metrics are used to measure the AUC. <br><br>
Accuracy alone is not a good enough measure. If there were 150 (Class 1) and 10 (Class 2) data points. Even if the model misclassifies all of the Class 2 data points, the accuracy would still be 150/160 (93.75%). In cases which imbalances in data it is vital to use other metrics to measure the performance of the model. 

In [None]:
def modeleval(yTrue, yPredict, print_metrics,modelname):
    # Area Under ROC Curve
    auc = roc_auc_score(yTrue,yPredict)

    # Confusion Matrix Evaluation 
    cm = confusion_matrix(yTrue,yPredict)

    # True negative, Flase positive, false negative, true positive
    tn, fp, fn, tp = confusion_matrix(yTrue,yPredict).ravel() 

    # True Positive Rate (Sensitivity)
    tpr = tp/(tp+fn)

    # True Negative Rate (Specificity)
    tnr = tn/(tn+fp)

    # Accuracy 
    acc = accuracy_score(yTrue,yPredict)
    
    # Model Metrics
    mm = {
        'AUC':auc,
        'Confusion Matrix':cm,
        'TN':tn,
        'FP':fp,
        'FN':fn,
        'TP':tp,
        'TPR':tpr,
        'TNR':tnr,
        'Accuracy':acc
    }
    
    if print_metrics:
        print(f"Sensitivity:{mm['TPR']}\n\n\
Specificity:{mm['TNR']}\n\n\
AUC of ROC:{mm['AUC']}\n\n\
Accuracy:{mm['Accuracy']}\n\n")
        
        x = pd.crosstab(yTrue, yPredict, rownames=['True'], colnames=['Predicted'], margins=True)
        print(f"{x}\n")
        plot_confusion_matrix(yTrue, yPredict,classes=np.array(['No Diabetes','Diabetes']),
                      title='Confusion matrix: ' + modelname)
        plt.show()
    
    return mm  

In [None]:
def plotroc(yvt,yvp,modelname): # y_validation_truth & y_validation_prediction
    f, t, thresh = roc_curve(yvt, yvp)
    roc_auc = auc(f, t)
    plt.title('Receiver Operating Characteristic: ' + modelname)
    plt.plot(f, t, 'b', label = 'AUC = %0.2f' % roc_auc)
    plt.legend(loc = 'lower right')
    plt.plot([0, 1], [0, 1],'r--')
    plt.xlim([0, 1])
    plt.ylim([0, 1])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

In [None]:
# Plot Confusion Matrix

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    """if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm) """

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax

# This function was obtained from the Scikit-learn documentation for plotting the confusion matrix

### 1. SVM 

In [None]:
# Initialize SVM Classifier 
svm_ = SVC(kernel='rbf',class_weight = 'balanced',random_state = 1)

#Train the model with the training data 
svm_.fit(xTrain_scaled,y_train)

#Validate the model 
y_valPredict = svm_.predict(xVal_scaled)

**Evaluate the SVM Model**

In [None]:
svmModel = modeleval(y_val,y_valPredict,True,'SVM')

In [None]:
plotroc(y_val,y_valPredict,"SVM")

**Observations:** <br> Using the validation set, the different kernels for SVM were tested (Linear, RBF, Sigmoid, and Poly) to gage the model performance. It was determed that the RBF (Radial Basis Function) provided the best results based on the metrics displayed above. 

### 2. KNN (k-Nearest Neighbour)

In [None]:
knnAUC = [] # AUCROC values 
valScores = [] # Validation accuracy scores 
kvalues = [] # K values

In [None]:
for i in range (1,21):            
    knn = KNeighborsClassifier(i)
    knn.fit(xTrain_scaled,y_train)

    #Predict 
    knnPred = knn.predict(xVal_scaled)
    
    #Evaluation 
    knnModel = modeleval(y_val,knnPred,False,'k-NN')
    
    knnAUC.append(knnModel['AUC'])
    valScores.append(knnModel['Accuracy'])
    kvalues.append(i)
    
knnPerformance = pd.DataFrame({'AUC':knnAUC,'Accuracy':valScores})

In [None]:
# Plot KNN Performance 
figknn = plt.figure(figsize=(8,8))
knnP = plt.subplot(111)
knnP.plot(kvalues,knnAUC, label = 'AUC', marker = 'o')
knnP.plot( kvalues,valScores,label = 'Accuracy', ls = '-')
plt.xlabel('K-Values')
plt.ylabel("AUCROC & Accuracy")
plt.title('K-NN Model Performance')
plt.xticks(np.arange(1,21,1))
knnP.legend()
plt.show()
#figknn.savefig(r'C:\Users\Shakti\Desktop\Data Science\Projects\Pima-Indians-Diabetes-Project\Data Visualizations\knnModelTuning.jpg')

As it can be noted from the graph above, **K=11** provides the highest AUCROC and Accuracy for the K-NN Model. With 
**AUCROC = 71.40%** and **Accuracy = 75.65%** 

**Rebuild Model** <br>
Rebuild the model with the optimal parameters. --> K=10

In [None]:
knn = KNeighborsClassifier(11)
knn.fit(xTrain_scaled,y_train)

#Predict 
knnPred = knn.predict(xVal_scaled)

In [None]:
#Evaluation 
knnModel = modeleval(y_val,knnPred,True, 'k-NN')

In [None]:
plotroc(y_val,knnPred, "KNN")

### 3. Logistic Regression

In [None]:
regularization = ['l1', 'l2'] # Regularization methods 
cost = [0.001, 0.01, 0.1, 1, 10,100] 
aucLogreg = np.zeros((len(cost),len(regularization)))

In [None]:
row = 0
col = 0

for i in regularization:
    for c in cost:
        logreg = LogisticRegression(C = c,class_weight = 'balanced',penalty = i)
        logreg.fit(xTrain_scaled,y_train)
        lrPred=logreg.predict(xVal_scaled)
        lrModel = modeleval(y_val,lrPred,False,'Logistic Regression')
        
        aucLogreg[row,col] = lrModel['AUC']
        row += 1
    row=0
    col +=1

In [None]:
# Plot KNN Performance 
figlr = plt.figure(figsize=(8,8))
lrP = plt.subplot(111)
for i in range(2):
    lrP.plot(cost, aucLogreg[:,i], label = regularization[i], marker = 'o')
#lrP.plot( kvalues,valScores,label = 'Accuracy', ls = '-')
plt.xlabel('C-Values')
plt.ylabel("AUCROC")
plt.title('Logistic Regression Model Performance: Regularization')
plt.xscale('log')
plt.xticks(cost)
lrP.legend()
plt.show()

In [None]:
logreg = LogisticRegression(C=0.01,class_weight = 'balanced',penalty = 'l2')

#Train
logreg.fit(xTrain_scaled,y_train)

In [None]:
#Predict
lrPred=logreg.predict(xVal_scaled)

In [None]:
#Evaluate
lrModel = modeleval(y_val,lrPred,True,'Logistic Regression')

In [None]:
plotroc(y_val,lrPred, "Logistic Regression")

**Regularization** <br>
Regularization is a method is used to reduce the risk of overfitting a model. In our model, the 'l2' (Ridge Regression) regularization is used, as it can be seen that it helps obtain a higher AUROC. C is a control variable which helps control the Lambda operator in the regularization function. It is the inverse of the if regularization strength (a.k.a Lambda) <br>

The reason C=0.01 was chosen over 0.001 (gives highest AUROC in the graph) was because a C=0.01 provides a better trade off between Specificity and Sensitivity.

### 4. Random Forests

In [None]:
rf = RandomForestClassifier(random_state = 5)

In [None]:
#Train
rf.fit(x_train,y_train)

In [None]:
#Predict
rfPred=rf.predict(x_val)

In [None]:
#Evaluate
rfModel = modeleval(y_val,rfPred,True,'Random Forests')

In [None]:
plotroc(y_val,rfPred, "Random Forests")

**The results obtains so far for the Random Forest model are based on default hyperparamters in the model. The AUCROC is only 0.61 and the accuracy is 67.83%. To improve the model, we will have to optimize the hyperparamters.**

**GridSearchCV** <br>

GridSearchCV is a tool that is used to help with hyperparameter tuning to help pick the optimal hyperparameters. It can be applied over different typse of models, but in this case, we will only be applying it to the Randon Forest model since this model has a lot of hyperparameters which can affect the model's performance. It is an important tool which significantly help improve the model's design. 

In [None]:
# The parameter grid outlines which paramters you want to optimize and the corresponsind hyperparameter values you want to test.
param_grid = { 
    'n_estimators': [10,50,100,200,300,500,600],
    'max_features': ['auto', 'sqrt', 'log2'],
}

In [None]:
# Initialize the grid search model

rf_gs = GridSearchCV(estimator = RandomForestClassifier(random_state = 5),
                    param_grid = param_grid, cv=5)

#Train the model
rf_gs.fit(x_train,y_train)

In [None]:
#obtain the best paramters determined by the grid search
print(f"The best paramters for the Random Forest model are: \n {rf_gs.best_params_}")

**-->** Rebuild the model with the optimal parameters.

In [None]:
rf_custom = RandomForestClassifier(max_features= 'auto',n_estimators=200,random_state = 5)
rf_optimal = RandomForestClassifier(max_features= 'log2',n_estimators=100,random_state = 5)

In [None]:
#Train
rf_custom.fit(x_train,y_train)
rf_optimal.fit(x_train,y_train)

In [None]:
#Predict
rf_customPred=rf_custom.predict(x_val)
rf_optimalPred=rf_optimal.predict(x_val)

#Evaluate
print("----------------------------------Metrics for the CUSTOM Random Forests Model----------------------------------\n")
rf_customModel = modeleval(y_val,rf_customPred,True,'Random Forests (Custom)')
plotroc(y_val,rf_customPred, "Random Forests (Custom)")
print("----------------------------------Metrics for the OPTIMAL Random Forests Model----------------------------------\n")
rf_optimalModel = modeleval(y_val,rf_optimalPred,True,'Random Forests (Optimal)')
plotroc(y_val,rf_optimalPred, "Random Forests (Optimal)")

The **Custome RF Model** was made using paramters tested without GridSearch. It provided a beter a AUCROC comapred to the the **Optimal RF Model** created using GridSearch. It is important to note that in this case, the sensitivity, specificity, and AUC are primarily being used to asses the model performance. GridSearch also asses other metrics to evaluate the model performance which can be seen found through the documentation. In this case, we will use the custom rf model. 

## Model Testing

The models will be tested on the **test data** to observe their performance on unseen data. This will help gage which model performs best on new data. 

In [None]:
theModels = []
theModels.append(('SVM',svm_))
theModels.append(('k-NN', knn))
theModels.append(('Logistic Regresion', logreg))
theModels.append(('Random Forests',rf_custom))
#model_names = ['SVM','k-NN','Logistic Regression','Random Forests']
#models = [svm_,knn,logreg,rf_customModel]

In [None]:
#Iterate over the models
theAcc = []
theAUC = []
# xTest_scaled= scaler.fit_transform(x_test)
for name,model in theModels:
    if name == 'Random Forests':
        modelPred = model.predict(x_test)
    else:
        modelPred = model.predict(xTest_scaled)
    
    modelMetrics = modeleval(y_test,modelPred,False,name)
    
    #msg = "{name}: {modelMetrics['Accuracy']}".format(name,)
    msg = name + "\n --> Accuracy: {:.2f} %\n --> AUC: {:.2f}\n".format(modelMetrics['Accuracy']*100,modelMetrics['AUC'])
    print(msg)
    
    theAcc.append(modelMetrics['Accuracy'])
    theAUC.append(modelMetrics['AUC'])

In [None]:
y_pos = np.arange(len(theModels))
mname = []
for i in range(len(theModels)):
    mname.append(theModels[i][0])
    
figModel = plt.figure(figsize=(8,8))
plt.bar(y_pos + 0.00,theAUC,align='center',alpha = 0.65,width = 0.25,label = 'AUCROC')
plt.plot(y_pos,theAcc,color = 'r',label = 'Accuracy')
#plt.bar(y_pos + 0.25,theAUC,align='center',alpha = 0.65,width = 0.25,label = 'AUCROC')
plt.xticks(y_pos,mname)
plt.title('Model Accuracy and AUCROC')
plt.ylabel('Accuracy/AUC')
plt.xlabel('Model')
plt.legend()
plt.show()
#figModel.savefig(r'C:\Users\Shakti\Desktop\Data Science\Projects\Pima-Indians-Diabetes-Project\Data Visualizations\ModelTesting.jpg')

In [None]:
knnTestPred = knn.predict(xTest_scaled)
plot_confusion_matrix(y_test, knnTestPred,classes=np.array(['No Diabetes','Diabetes']),
                      title='Confusion matrix: ' + 'k-NN Model')
#plt.show()

In [None]:
f, t, thresh = roc_curve(y_test, knnTestPred)
roc_auc = auc(f, t)
figKnnTest = plt.figure(figsize=(8,8))
plt.title('Receiver Operating Characteristic: ' + 'k-NN Model')
plt.plot(f, t, 'b', label = 'AUC = %0.2f' % roc_auc)
plt.legend(loc = 'lower right')
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## Conclusion

From the information above, we can not that in this project, k-NN was provided the best accuracy and AUCROC. Random Forests was also another model which performed relatively well. It was surpring to see the SVM model perform so poorly after getting an AUC of almost 76% on the validation set. However, it is an important display of how models may perform on unseen data. <br>

### Improvements

This project covered a lot of different concepts which are helpful in building machine learning models, such as data manipulation, data exploration, data scaling/normalization, and model evaluation. However, there are always ways to improve. For example, an ensemble voting method can be used to use multiple models to determine the final outcome. When working with medical data, it is also important to consider an optimal value for tradeoff between false negatives and false positives. Feature engineering techniques can also help enhane model performance. For instance, it can be noted that between some features, certain ranges of values consisted of more points relating to non-diabetic/diabetic patients. That information can be used to derive additional feature can help the model make more accuracy predictions. 