Model building in Scikit-learn
Let's build the diabetes prediction model.

Here, you are going to predict diabetes using Logistic Regression Classifier.

Let's first load the required Pima Indian Diabetes dataset using the pandas' read CSV function. 

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


In [None]:
import pandas as pd 
import numpy as np

Dataset Description

Pregnancies: No. of times pregnant

Glucose: Plasma Glucose Concentration (mg/dl)

Blood Pressure: Diastolic Blood Pressure(mmHg)

Skin Thickness:A value used to estimate body fat. Normal Triceps SkinFold Thickness in women is 23mm. Higher thickness leads to obesity and chances of diabetes increases.

Insulin: 2-Hour Serum Insulin (mu U/ml)

BMI: Body Mass Index (weight in kg/ height in m2)
Diabetes Pedigree Function: It provides information about diabetes history in relatives and genetic relationship of those relatives with patients. Higher Pedigree Function means patient is more likely to have diabetes.

Age:Age (years)

Outcome: Class Variable (0 or 1) where ‘0’ denotes patient is not having diabetes and ‘1’ denotes patient having diabetes.

In [None]:
# Reading Data and extracting some insights

df= pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")
df.head()
X=df.iloc[:,0:8]
y=df.iloc[:,-1]
X

In [None]:
df.describe()

There is no missing values in the data.

Min value =0 in Pregnancy , Glucose, BP, SkinThickness, Insulin, BMI which is practically not possible so we will handle it by replacing it with NaN

There is extreme variation(std,mean) across the fields hence we will require to Standardize the data so that they fall in the same range. *with Standardization data values gets converted to (-3 to 3 )range

In [None]:
#libraries for plotting data
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
matrix = np.triu(df.corr())
sns.heatmap(df.corr(),annot=True,fmt='.1g',vmin=-1, vmax=1, center= 0,cmap='YlOrRd',mask=matrix)

**What do you get from this**

Outcome :- Glucose and BMI are highly correlated

Age and Pregnancy are positively corelated i.e. greater the age , more will be the number of pregnancies.

SkinThickness has high relevance to BMI and Insulin.

There is not a single feature that doesnot have direct or indirect impact over outcome.


**Data Preprocessing**

it involves Treating missing values

Remove them simply if values are not critical or doesnt form a major part of your data.

Replace them with average or median values. Depends on the business decision.

Dealing with outliers Standardize your data

In [None]:
feature=X.columns
dfzero=(X[feature]==0).sum()
dfzero

In [None]:
X[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']]=X[['Pregnancies','Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)

In [None]:
X.isnull().sum()

Fill Na with Average of the data values NA values can be treated with Mean, Median and Mode depending on the type of data and the no. of values missing.

In [None]:
X['Glucose'].fillna(X['Glucose'].mean(),inplace=True)
#inplace=True is necessary
X['BMI'].fillna(X['BMI'].mean(),inplace=True)
X['Pregnancies'].fillna(X['Pregnancies'].mean(),inplace=True)
X['BloodPressure'].fillna(X['BloodPressure'].mean(),inplace=True)
X['SkinThickness'].fillna(X['SkinThickness'].mean(),inplace=True)
X['Insulin'].fillna(X['Insulin'].mean(),inplace=True)

In [None]:
X.describe() # there is minimum value across every feature, no more error values.

In [None]:
X.isnull().sum() #there are no more null values

In [None]:
# Age Bracket based on the input values of Age Vs Glucose
X1=X['Glucose']
Y1=X['Age']

plt.scatter(X1,Y1)
plt.xlabel('Glucose Level')
plt.ylabel('Age')
plt.title(label='Age Vs Glucose Chart')
plt.show()

In [None]:
X.head()
X=X.drop(columns=['Age','SkinThickness'])


In [None]:
X.head()

SPLIT DATASET INTO TRAINING AND TEST DATASET

In [None]:
from sklearn.model_selection import train_test_split 
Xtrain,Xtest,ytrain,ytest= train_test_split(X,y,test_size=0.2,random_state=0)

In [None]:
Xtrain.shape #training data has 614 rows and 8 columns

Xtest.shape #test dataset has 154 rows and 8 columns

In [None]:
#import pandas
import pandas as pd
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']


In [None]:
Xtrain.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier
RF_Classifier=RandomForestClassifier()
RF_Classifier.fit(Xtrain,ytrain)

In [None]:
ypred_RF= RF_Classifier.predict(Xtest)

In [None]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report,roc_auc_score
cm_rf=confusion_matrix(ytest,ypred_RF)
cm_rf

In [None]:
score_rf=accuracy_score(ytest,ypred_RF)
print('Score based on RandomForest model',score_rf)

In [None]:
from sklearn.metrics import roc_curve
fpr,tpr,_=roc_curve(ytest,ypred_RF)
plt.plot(fpr,tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')