## Pima Indians Diabetes Prediction (Using Logistic Regression)

This is a classification predictive analytics project. Here I will be using logistic regression model to predict the occurance of diabetes. 

### Importing Libraries, Dataset, and EDA

In [None]:
#importing libraries
import pandas as pd### Importing Libraries, Dataset, and EDA
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
#importing dataset
diabetes_df=pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

In [None]:
diabetes_df.head()

In [None]:
#Evaluating target variable distribution
diabetes_df['Outcome'].hist()

In [None]:
#examining missing values
sns.heatmap(diabetes_df.isnull(),cbar=False)


In [None]:
#examining data types
diabetes_df.info()


In [None]:
#examining data distribution
diabetes_df.describe()

In [None]:
#examining data intercorrelations
sns.pairplot(diabetes_df,hue='Outcome')

### Logistic Regression

To begin with, I will split data into 70% training and 30% testing dataset. After that, I wil build a logistic regression model and test the model using the 30% dataset.

I will be using Sci-kit learn ML library to split the dataset, build the model, and test the model.

- from sklearn.model_selection import train_test_split: to split our data into 70% training & 30% testing datasets
- from sklearn.linear_model import LogisticRegression: to build a logistic regression model using the training dataset
- from sklearn.metrics import roc_auc_score,roc_curve: to construct a roc curve adn roc auc score
- from sklearn.metrics import classification_report: to get the classsification report (f1, accuracy, precision)
- import statsmodels.api as sm: to determine features having no effect in the outcome

In [None]:
#assigning input and target variables
X=diabetes_df.drop('Outcome',axis=1)
y=diabetes_df['Outcome']

In [None]:
X.head()

In [None]:
y.head()

In [None]:
#splitting data 70/30 into training and test datasets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=1)

In [None]:
#building logistic model
logmodel=LogisticRegression(solver='liblinear')
logmodel.fit(X_train,y_train)

In [None]:
#evaluating model performance
y_pred=logmodel.predict(X_test)

In [None]:
y_pred[:20]

In [None]:
#displaying confusion matrix
from sklearn.metrics import confusion_matrix, classification_report
confusion_matrix(y_test,y_pred)

In [None]:
#displaying classification report (accuracy, recall, precision)
print(classification_report(y_test,y_pred))

In [None]:
from sklearn.metrics import roc_auc_score,roc_curve
import matplotlib.pyplot as plt

#ROC AUC score calculation
logit_roc_auc =roc_auc_score(y_test,logmodel.predict(X_test))
print("The ROC AUC score is "+str(logit_roc_auc))


In [None]:
#displaying ROC curve
fpr,tpr,thresholds = roc_curve(y_test,logmodel.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr,tpr,label="Logistic Regression (area = %0.2f)"%logit_roc_auc)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.plot([0,1],[0,1],'r--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristics")
plt.legend(loc="lower right")
plt.show()

In [None]:
#determining features having no effect on the outcome using statsmodels package
import statsmodels.api as sm
logit_model=sm.Logit(y_train,X_train)
logmodel_2 = logit_model.fit()
print(logmodel_2.summary2())

In the above table, P>|z| shows statistical significance in predicting the outcome of diabetes. If the P>|z| is greater than 0.05, that parameter does not have statistically significant effect on the outcome.

In above table, we can see that SkinThickness, Insulin, BMI,DiabetesPedigreeFunction, and Age have probability greater than 0.05. These parameters are not stastistically significant in predicting the outcome

In [None]:
#rebuilding the model excluding non-statistically significant features
X1 = diabetes_df.drop(["Outcome","SkinThickness","Insulin","BMI","DiabetesPedigreeFunction","Age"],axis=1)
X1.head()

In [None]:
y.head()

In [None]:
#assigning training and testing data for new model
X1_train, X1_test, y1_train, y1_test = train_test_split(X1,y,test_size=0.3,random_state=1)


In [None]:
#building new logistic model
logmodel_re=LogisticRegression(solver='liblinear')
logmodel_re.fit(X1_train,y1_train)

In [None]:
#drawing roc. curve to see new model
logit_roc_auc_re =roc_auc_score(y1_test,logmodel_re.predict(X1_test))
print("The ROC AUC score is "+str(logit_roc_auc_re))
fpr1,tpr1,thresholds1 = roc_curve(y1_test,logmodel_re.predict_proba(X1_test)[:,1])
plt.figure()
plt.plot(fpr1,tpr1,label="Logistic Regression (area = %0.2f)"%logit_roc_auc_re)
plt.xlim([0.0,1.0])
plt.ylim([0.0,1.05])
plt.plot([0,1],[0,1],'r--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Receiver Operating Characteristics")
plt.legend(loc="lower right")
plt.show()


Here, we can see that the ROC AUC score for both model are same. In this context, our model did not change significantly while excluding the statistically non-significant parameters. 

Therefore, our analysis on defining statistically non-significant parameters is correct. 