# Early Stage Diabetes Risk Prediction

## Objective:

* Our main goal is to use the power of data science algorithms, tools and techniques to predict diabetes in the early stage of life so preventive measures can be taken in advance. Diabetes is the most common disease that people can get in their early stages of life or mid-life. It also comes as heredity from parents to their child if either or both of the parent has a diabetic history. In order to provide our analysis and prediction we will use <a herf="https://www.kaggle.com/uciml/pima-indians-diabetes-database">"Pima Indians Diabetes Database"</a> from Kaggle.


## Data Description:

<p>This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, <b>all patients here are females at least 21 years old of Pima Indian heritage.</b></p>
</br>
</br>

<p>The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.</p>
</br>
</br>

## Data Creator Acknowledgements:
<b>Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261--265). IEEE Computer Society Press.</b>

## Our Work Flow:

<ol>
    <li>Data Description</li>
    <li>Basic Understanding of data using head(),tail() etc. likewise</li>
    <li>Looking For Null/Missing values and Handling them</li>
    <li>Basic EDA Using Data Visualization</li>
    <li>Applying Classification Model(s)</li>    
</ol>

## Data Description

<p>As mentioned above we are going to use "Pima Indians Diabetes Database" from Kaggle Brief understanding of the dataset is as below:</p>

<p>There are <b>769 observations and 9 features</b> in the dataset where each row represent a pesron. Small deatils about each feature is written below:</p>

* Pregnancies (Numerical): It contains the count of how many times women have been pregnant.
* Glucose (Numerical): It contains the level of glucose in an individual body.
* Blood Pressure (Numerical): It contains an individual measurement of blood pressure.
* SkinThickness (Numerical): It contains an individual measurement of skin thickness.
* Insulin (Numerical): It contains the level of insulin in an individual body.
* BMI (Numerical): It contains the measurement of body mass index in an individual body.
* DiabetesPedigreeFunction (Numerical): It contains a measure that is based on A function which scores the likelihood of diabetes based on family history. It provides some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to an individual.
* Age (Numerical): It contains the age of an individual.
* Outcome (Categorical Nominal): Target column which specifies whether an individual has diabetes or not 0 means the individual is not diabetic and 1 means the individual is diabetic.

# Basic Understnding Of Data

In [None]:
# Loading neccesory liabraries 

import pandas as pd  # data analysis
import numpy as np   # data manipluation 

import matplotlib.pyplot as plt # visualizing plot
import seaborn as sns # visualizing interactive plots
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [None]:
# loading data
pima_diabetes = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

In [None]:
# printing head and tail of data
print("Head of data:\n\n",pima_diabetes.head())
print("\nTail of data:\n\n",pima_diabetes.tail())

In [None]:
# shape of data
pima_diabetes.shape

In [None]:
# Getting basic overview of data using .info
pima_diabetes.info()

### As we can see there are 9 columns and 768 rows their is not emplye cells or null in data and also we can see datatype of each variable. 

In [None]:
# Printing five-point summary of data using .describe
pima_diabetes.describe()

### Above point summary gives quick stats of our data few of observations are like 

* Average glucose level of all patients is 120.894531
* On an average number of times getting pregnancies is aprroximately 4.
* Average blood pressure is 69.105.
* Average skinThickness is 20.536.
* Average Age of all given people is 33.240.
* <b>One most important observation is to notice over here is some of the column has minimum value set as '0' for example Glucose,BloodPressure,Skinthickness,Insulin, and BMI. If we apply our common sense it is logically incorrect becasue a person skintickness cannot be 0 and here all people have been considered diabetic and non-diabetic but even if person is non-diabetic it's body blood pressure,glucose level, insulin level and Body Mass Index cannot be '0'. Which suggest us that this could be a garbadge value which we need to handle.</b>

# Handling Missing/Garbage Values

### As we saw above basic summary of our data gives us an understanding that there are not any empty cells or null values in our dataset but a few of the columns like Glucose, blood pressure, skin thickness, Insulin, and BMI that consists of '0' as the value which is logically incorrect that we have to handle. 

### Here all these columns are numerical so to convert our '0' into logical values there are most common ways through which we can impute these values with either the average value of that column or median value of that column. Though handling or removing this type of missing value also depends on our criteria or objective for example if we are only looking for people between the age of minimum age value to age 50 and if all above mentioned missing column has most zeros for people age above 50 then we can remove these all '0' because our criteria or objective of analysis will not get affected by that. But, here we are considering whole data so we have to impute all of them.

### To decide between choosing mean values or median values we can check for the shape of all these missing value columns by using boxplot and density/Histogram plot. This will help us to look into if the values of these columns are skewed or not and whether they have outlier or not. Based on these two criteria we can decide which measure will be suitable to choose for imputing missing values.

In [None]:
# ploting histogram/densitiy plot for all missing columns

fig, ax2 = plt.subplots(3,2,figsize=(16, 16)) #setting size of each image and formation of image like 3 rows and 2 column

# creating density plot for each missing column 

sns.distplot(pima_diabetes['Glucose'],ax=ax2[0][0])
sns.distplot(pima_diabetes['BloodPressure'],ax= ax2[0][1])
sns.distplot(pima_diabetes['SkinThickness'],ax= ax2[1][0])
sns.distplot(pima_diabetes['Insulin'],ax= ax2[1][1])
sns.distplot(pima_diabetes['BMI'],ax= ax2[2][0])

#deleting one extra box as we have passed 3 rows and 2 column showing type but we need only 5 boxes
fig.delaxes(ax2[2,1]) 
plt.show()

### From the above desntiy plot we can see that Glucose,Bloodpressure and BMI column values are nearly normaly distributed but still there are some outliers where SkinThickness and Insulin column are right skewed.

In [None]:
# plotting boxplot of all missing columns

fig, ax2 = plt.subplots(3,2,figsize=(16, 16)) #setting size of each image and formation of image like 3 rows and 2 column

# creating desity plot for each missing column 

sns.boxplot(pima_diabetes['Glucose'],ax=ax2[0][0])
sns.boxplot(pima_diabetes['BloodPressure'],ax= ax2[0][1])
sns.boxplot(pima_diabetes['SkinThickness'],ax= ax2[1][0])
sns.boxplot(pima_diabetes['Insulin'],ax= ax2[1][1])
sns.boxplot(pima_diabetes['BMI'],ax= ax2[2][0])

#deleting one extra box as we have passed 3 rows and 2 column showing type but we need only 5 boxes
fig.delaxes(ax2[2,1]) 
plt.show()

### Boxplot suggest us that there are few outliers in all missing value columns and as we have already seen in desnsity plot same we can see column Glucose,BloodPressure and BMI has normally distributed values and SkinThickness,Insulin column are right skewed.

In [None]:
#printing all missing columns mean and median value
missing_col = ['Glucose','BloodPressure','SkinThickness','Insulin','BMI']

mis_col_mean = pima_diabetes[missing_col].mean()

mis_col_median = pima_diabetes[missing_col].median()

print('Mean Of all Missing Column:\n\n',mis_col_mean)
print('\n\nMedian Of all Missing Column:\n\n',mis_col_median)

### From the above result, we understood that for Glucose and Insulin mean is greater than the median. In the case of BloodPressure, SkinThickness and BMI median is greater mean. Also, we saw some skewness and outliers in the above boxplot and density plots. If we consider statistical methods, if your mean is greater than the median then the median will be a good measure of central tendency so we can take the median value for Glucose and Insulin. For the remaining three columns the difference between mean and median is not huge for them. We will consider the median value for replacing '0'. We will try to replace this median value with their outcome value means all diabetic patient's NULL values will be replaced with diabetic patient's median values and the same for non-diabetic patients.

In [None]:
#replacing all zeros with nan
pima_diabetes[['Glucose','BloodPressure','SkinThickness',
  'Insulin','BMI']] = pima_diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

#checking for null values
print('Checking Null Values\n\n', pima_diabetes.isnull().sum())

In [None]:
# Couting median values according to diabetic and non-diabetic patient
all_median = pima_diabetes.groupby('Outcome')[["Glucose", "BloodPressure", "SkinThickness","Insulin","BMI"]].median()
all_median

In [None]:
# replacing zeros with median values according to diabetic and non-diabetic patient

# For Glucose Column 
pima_diabetes.loc[(pima_diabetes['Outcome'] == 0 ) & (pima_diabetes['Glucose'].isnull()), 'Glucose'] = 107.0
pima_diabetes.loc[(pima_diabetes['Outcome'] == 1 ) & (pima_diabetes['Glucose'].isnull()), 'Glucose'] = 140.0

# For BloodPressure Column
pima_diabetes.loc[(pima_diabetes['Outcome'] == 0 ) & (pima_diabetes['BloodPressure'].isnull()), 'BloodPressure'] = 70.0
pima_diabetes.loc[(pima_diabetes['Outcome'] == 1 ) & (pima_diabetes['BloodPressure'].isnull()), 'BloodPressure'] = 74.5

# For SkinThickness Column
pima_diabetes.loc[(pima_diabetes['Outcome'] == 0 ) & (pima_diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 27.0
pima_diabetes.loc[(pima_diabetes['Outcome'] == 1 ) & (pima_diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 32.0

# For Insulin Column
pima_diabetes.loc[(pima_diabetes['Outcome'] == 0 ) & (pima_diabetes['Insulin'].isnull()), 'Insulin'] = 102.50
pima_diabetes.loc[(pima_diabetes['Outcome'] == 1 ) & (pima_diabetes['Insulin'].isnull()), 'Insulin'] = 169.50

# For Insulin Column
pima_diabetes.loc[(pima_diabetes['Outcome'] == 0 ) & (pima_diabetes['BMI'].isnull()), 'BMI'] = 30.10
pima_diabetes.loc[(pima_diabetes['Outcome'] == 1 ) & (pima_diabetes['BMI'].isnull()), 'BMI'] = 34.30

In [None]:
#Checkin after replacing Null Values
print('Checking Null Values\n\n', pima_diabetes.isnull().sum())

# Basic EDA Using Data Visualization

In [None]:
#ploting histogram for pregencies,Age and DiabetesPedigreeFunction
pima_diabetes.hist(column = ['Pregnancies','Age','DiabetesPedigreeFunction'],figsize=(16, 16))
plt.show()

### From the histogram of Age column we can say that most of women age lies between 20 to 50 since in data only womens above age of 21 is taken into cosideration. Pregnancies histogram suggest that most of women got pregnancies count between 0 to 5. Lastly, Diabetic Pedigree Fucntion column most of values falls between 0 to 1.5. 

In [None]:
# plotting traget variable 'Outcome'

sns.countplot(x="Outcome", data=pima_diabetes)
plt.show()

### From the above barchart we can say there are 500 people who are non-diabetic and 268 people are diabetic.If we calculate percentage than out of all given data 34.9% are diabetic and 64.1% are non-diabetic. In other words non-diabetic people are almost double than diabetic. 

In [None]:
# ploting target variable 'Outcome' and Glucose
mean_glucose_distribution = pima_diabetes.groupby('Outcome')['Glucose'].mean()

sns.barplot(mean_glucose_distribution.index.values,mean_glucose_distribution.values)
plt.xlabel('Outcome')
plt.ylabel('Average Glucose level')
plt.title('Average Glucose level in Diabetic and Non-Diabetic Patients')
plt.show()

### From the above plot we can say that non-diabetic patients average Glucose level is less compare to diabetic patients.

In [None]:
# ploting target variable 'Outcome' and BMI 
mean_bmi_distribution = pima_diabetes.groupby('Outcome')['BMI'].mean()

sns.barplot(mean_bmi_distribution.index.values,mean_bmi_distribution.values)
plt.xlabel('Outcome')
plt.ylabel('Average BMI')
plt.title('Average BMI in Diabetic and Non-Diabetic Patients')

### From the above plot we can say that non-diabetic patients average Body Mass Index is less compare to diabetic patients.

In [None]:
# ploting target variable 'Outcome' and Insulin 
mean_insulin_distribution = pima_diabetes.groupby('Outcome')['Insulin'].mean()

sns.barplot(mean_insulin_distribution.index.values,mean_insulin_distribution.values)
plt.xlabel('Outcome')
plt.ylabel('Average Insulin')
plt.title('Average Insulin in Diabetic and Non-Diabetic Patients')

### From the above plot we can say that non-diabetic patients average Insulin level is less compare to diabetic patients.

In [None]:
# ploting target variable 'Outcome' and BloodPressure 
mean_bloodpressure_distribution = pima_diabetes.groupby('Outcome')['BloodPressure'].mean()

sns.barplot(mean_bloodpressure_distribution.index.values,mean_bloodpressure_distribution.values)
plt.xlabel('Outcome')
plt.ylabel('Average BloodPressure')
plt.title('Average BloodPressure in Diabetic and Non-Diabetic Patients')

### From the above plot we can say that non-diabetic patients average Insulin level is less compare to diabetic patients but here another thing to note the difference is not big.

In [None]:
# ploting target variable 'Outcome' and SkinThickness 
mean_skinthickness_distribution = pima_diabetes.groupby('Outcome')['SkinThickness'].mean()

sns.barplot(mean_skinthickness_distribution.index.values,mean_skinthickness_distribution.values)
plt.xlabel('Outcome')
plt.ylabel('Average SkinThickness')
plt.title('Average SkinThickness in Diabetic and Non-Diabetic Patients')

### From the above plot we can say that non-diabetic patients average Skin Thickness is less compare to diabetic patients.

In [None]:
# adding new Age Group column using age column

bins = [21, 30, 40, 50, 60, 81]
labels = ['21-29', '30-39', '40-49', '50-59', '60+']
pima_diabetes['Age Group'] = pd.cut(pima_diabetes.Age, bins, labels = labels,include_lowest = True)

print('Checking New Column:\n',pima_diabetes['Age Group'].head())

In [None]:
# ploting age group wise diabetic and non diabetic patients

sns.countplot(x="Age Group", hue="Outcome", data=pima_diabetes)
plt.title('Age Group Wise Count Of Diabetic and Non-Diabetic Patients')

### From the plot we can interpret that Age Group 21-29 has highest number of Diabetic and non-diabetic patients compare to other age group and least number of Diabetic and non diabetic patients are in 60+ age group. Since in our data Non-diabetic patients are more than diabetic patients we can find this pattern in Age group 21-29,30-39 and 60+. But for Age Group 40-49 and 50-59 has more number of Diabetic patients compare to Non-Diabetic patients.

In [None]:
# dropping Age Group column

pima_diabetes = pima_diabetes.drop(['Age Group'],axis=1)

pima_diabetes.head()

In [None]:
# creating correlation plot 
f, ax = plt.subplots(figsize= [5,5])
sns.heatmap(pima_diabetes.corr(), annot=True, fmt=".2f", ax=ax, cmap = "magma" )
ax.set_title("Correlation Matrix", fontsize=20)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

### Correlation matrix plot help us to understand correlation between our features. By looking at the above plot we can judge that highest correlation with our target variable Outcome is only one variable that is Glucose with 0.50 correlation rate. Other variables are less correlated to our target variable Outcome.

# Applying Classification Model(s) For Prediction

<p> Our dataset is not very big there are only 768 rows in our dataset so it will be hard to get high accuracy for model we create. Also our binary classes for deciding whether patient is diabetic or non-diabetic is very imbalanced because we have 500 non-diabetic patients and 268 diabetic patients which could influnce our model result it will not generalize meaning that whenever we will provide new unseen data it will give us wrong prediction that patient is non-diabetic but in actual patient is diabetic.</p>

<p>In order to solve this problem we will get the help of tree based classification model like Decision Tree and their ensemble tree models like Random Forest and Gradient Boosting. Resason behind using this model is they use if-else technique to solve a problem and go through each and every feature of dataset. Also they perform extremely well on many real-world problems.</p>

<p>To evaluate model performance we will use confusion matrix,accuracy,F1 score,Precision and recall as classification metrics based upon this we will be able to measure our models performance. After checking accuracy of all three model whichever model gives highest accuracy we will try to fine tune it using hyperparameter optimization and K-fold cross validation to get more generalized and better accuray for that model.At last we will check to which feature model is giving most importance for prediction.</p>

## Machine Learning Steps:

<ol>
    <li>Importing Required Libraries and splitting data into train and test set.</li>
    <li>Initializing Model.</li>
    <li>Fitting Our Model on train set.</li>
    <li>Evaluating model perfomance by using confusion matrix,accuracy,f1 score,precision and recall.</li>
    <li>Model tunning using hyperparameter optimization and k-fold cross validation</li>
    <li>Feature Importance for knowing to which feature our model is giving more importance for doing prediction.</li>
</ol>

In [None]:
# importing train test split library
from sklearn import model_selection
from sklearn.model_selection import train_test_split

# importing models library
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,GradientBoostingClassifier

# importing model evaluation library
from sklearn.metrics import classification_report,confusion_matrix,f1_score,precision_score,recall_score,accuracy_score

# importing model tuning library for cross validation
from sklearn.model_selection import cross_val_score

# loading target feature and response feature into different variable
X = pima_diabetes.drop('Outcome',axis=1) #all columns except the last one
y = pima_diabetes['Outcome']

# Checking new variables
print('Variable X:\n',X.head())
print('\n\n Variable y:\n',y.head())

In [None]:
# Spliting our data into train and test split into ratio of 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

## Decision Tree

In [None]:
# fitting decision tree model on train data
decision_tree_clf = DecisionTreeClassifier(random_state=123)
decision_tree_clf.fit(X_train,y_train)

# Predicting the test set
y_pred_decision_clf = decision_tree_clf.predict(X_test)
print("Predicted Value:\n",y_pred_decision_clf[0:5])

In [None]:
# printing accuracy score and confusion matrix
cm = confusion_matrix(y_test, y_pred_decision_clf)

df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
sns.heatmap(df_cm, annot = True, fmt ='g')
plt.title('Decision Tree')
print("Train Data Accuracy: %.4f" %accuracy_score(y_train, decision_tree_clf.predict(X_train)))
print("Test Data Accuracy: %.4f" %accuracy_score(y_test, y_pred_decision_clf))

### We can see that accuracy of our model on test set is 87.01% which is good and confusion matrix tells us that total 130(TP)+(TN)71 = 201 are correcltly classfied and 17(FN) + (FP)13 = 30 is incorrectly classified out of total (TP+TN+FP+FN) 231 test set data by our model.

In [None]:
#printing f1 score,Preicison score, recall score

print('\n Precision Score:',precision_score(y_test, y_pred_decision_clf))
print('\n Recall Score:',recall_score(y_test, y_pred_decision_clf))
print('\n f1_score',f1_score(y_test, y_pred_decision_clf))

### precision score is counted using (TP/TP+FP) = (130/130+13) which gives us 84.52% and recall is counted using (TP/TP+FN) =(130/130+17) which gives us 80.68% and f1 score is harmonic mean of precision and recall which is 82.56%. preicision is that which helps us to find out all classified positive what percent was correct and recall is the number of correct results divided by the total number of results. Since in our particular problem we need to focus on precision here we got nice score for precision and  f1 as well.

# Radom Forest

In [None]:
# fitting random model on train data
rf_clf = RandomForestClassifier(random_state = 123)
rf_clf = rf_clf.fit(X_train,y_train)

# # Predicting the test set
y_pred_rf_clf = rf_clf.predict(X_test)
print("Predicted Value:\n",y_pred_rf_clf[0:5])

In [None]:
# printing accuracy score and confusion matrix
cm = confusion_matrix(y_test, y_pred_rf_clf)

df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
sns.heatmap(df_cm, annot = True, fmt ='g')
plt.title('Random Forest')
print("Train Data Accuracy: %.4f" %accuracy_score(y_train, rf_clf.predict(X_train)))
print("Test Data Accuracy: %.4f" %accuracy_score(y_test, y_pred_rf_clf))

### We can see that accuracy of our model  has increased compare to decision tree classifier on test set with 93.07% accuracy which is good and confusion matrix tells us that total 136(TP)+(TN)79 = 215 are correcltly classfied and 9(FN) + (FP)7 = 16 is incorrectly classified out of total (TP+TN+FP+FN)= 231 test set data by our model.

In [None]:
#printing f1 score,Preicison score, recall score

print('\n Precision Score:',precision_score(y_test, y_pred_rf_clf))
print('\n Recall Score:',recall_score(y_test, y_pred_rf_clf))
print('\n f1_score',f1_score(y_test, y_pred_rf_clf))

### As we can see above after applying random forest ensemble method over data we got precision score of 91.86%,recall score of 89.77% and f1 score of 90.80% which means our scores has increased compare to decision tree model and that we can expect as random forest is more generalinzed model compare to decision tree model.

# Gradient Boosting

In [None]:
# fitting decision tree model on train data
Gb_clf = GradientBoostingClassifier(random_state = 123)
Gb_clf = Gb_clf.fit(X_train,y_train)

# # Predicting the test set
y_pred_Gb_clf = Gb_clf.predict(X_test)
print("Predicted Value:\n",y_pred_Gb_clf[0:5])

In [None]:
# printing accuracy score and confusion matrix
cm = confusion_matrix(y_test, y_pred_Gb_clf)

df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
sns.heatmap(df_cm, annot = True, fmt ='g')
plt.title('Gradient Boosting')
print("Train Data Accuracy: %.4f" %accuracy_score(y_train, Gb_clf.predict(X_train)))
print("Test Data Accuracy: %.4f" %accuracy_score(y_test, y_pred_Gb_clf))

### We can see that accuracy of our model  has decreased compare to random forest classifier on test set with 91.77% accuracy and confusion matrix tells us that total 135(TP)+(TN)77 = 212 are correcltly classfied and 11(FN) + (FP)8 = 19 is incorrectly classified out of total (TP+TN+FP+FN)= 231 test set data by our model.

In [None]:
#printing f1 score,Preicison score, recall score

print('\n Precision Score:',precision_score(y_test, y_pred_Gb_clf))
print('\n Recall Score:',recall_score(y_test, y_pred_Gb_clf))
print('\n f1_score',f1_score(y_test, y_pred_Gb_clf))

### As we can see above after applying gradient boosting ensemble method over data we got precision score of 90.59%,recall score of 87.50% and f1 score of 89.01% which means our scores has decreased compare to random forest one of main reason could be our data is not large enough otherwise in most cases gradient boosting gives better accuracy compare to other models.

# Model Tunning

### From the above result we can see that Random Forest and Gradient Boosting both gives us better accuray of more than 90% but random forest accuracy is better than gradient boosting for this particular problem. So, we will use random forest model for hyper parameter optimization and k-fold cross validation. 

In [None]:
# fitting Radnom forest train data
rf_clf_tuned = RandomForestClassifier(random_state=123,n_estimators=150,
                                      criterion='entropy',max_features=3) # Adding hyper parameter
rf_clf_tuned = rf_clf_tuned.fit(X_train,y_train)

# # Predicting the test set
y_pred_rf_clf_tuned = rf_clf_tuned.predict(X_test)
print("Predicted Value:\n",y_pred_rf_clf_tuned[0:5])

In [None]:
# printing accuracy score and confusion matrix
cm = confusion_matrix(y_test, y_pred_rf_clf_tuned)

df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
sns.heatmap(df_cm, annot = True, fmt ='g')
plt.title('Random Forest')

print("Train Data Accuracy: %.4f" %accuracy_score(y_train, rf_clf_tuned.predict(X_train)))
print("Test Data Accuracy: %.4f" %accuracy_score(y_test, y_pred_rf_clf_tuned))

In [None]:
# using k-fold cross validation
accuracies = cross_val_score(estimator = rf_clf_tuned, X= X_train, y = y_train, cv = 10)
print(accuracies)

In [None]:
print("Average of Accuracy:\n",np.mean(accuracies))

### After applying model tunning we did not got much more accuracy compare to what we got in without tunning for random forest and even for 10 k-fold corss validation we got accuracy of 87.34% which is nearly to same of deicision tree. We belive reason behind why our accuray of model is not increasing is that because our data is imbalanced and because of that it will work on train set very well but it will have to put a lot effort to get good accuracy for unseen test set.

# Feature Importance

In [None]:
feat_importances = pd.Series(rf_clf_tuned.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')

### Above feature importance graph gives us better understanding of our model we can see that our best tunned random forest model gives most important to feature 'Insulin' compare to other features for predicting whether patient is diabetic or not.

# Conclusion

<p>After doing analysis of our data and applying machine learnig algorithm to it for prediction we learn few things which are:</p>

* From analysis we saw that every patinet who is diabetic have high average Insulin level,Blood Pressure,Glucose Level and skin thickness.
* Based upon this dataset we can say that women age between 40-49 and 50-59 has more number of diabetic patients compare to non-dabetic.
* Insulin level in body could be one of the most important feature for deciding whether patient is diabetic or not based upon our best fitted model.
* If we got imbalenced data it becomes tough to get high accuracy and relability of that model becomes less strong.