In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import io
from IPython.core.display import display, HTML
import os

import seaborn as sns
import matplotlib.pyplot as plt


In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df_diabetes = pd.read_csv("../input/pima-indians-diabetes-database/diabetes.csv")

### Let's describe the Dataset

In [None]:
display(df_diabetes.describe())

We noticed that several columns like Glucose and Skin Thickness have zero values.  
Also notice that the 'count' value remains consistent(768) throughout. This means that there are no NULL values.

However, we still have zero values for columns like Insulin and BMI, which is not possible. Hence, these are rows that need to be deleted.

In [None]:
print('''How many rows would be left if we eliminate all zero values?''')
display(df_diabetes[(df_diabetes.BMI != 0) & (df_diabetes.Glucose != 0) & (df_diabetes.BloodPressure != 0) & (df_diabetes.SkinThickness != 0) & (df_diabetes.Insulin != 0)].shape[0])

Let's proceed to remove the zero rows and retain these 392 rows.

In [None]:
df_diabetes = df_diabetes[(df_diabetes.BMI != 0) & (df_diabetes.Glucose != 0) & (df_diabetes.BloodPressure != 0) & (df_diabetes.SkinThickness != 0) & (df_diabetes.Insulin != 0)]

Now describe the dataset again.

In [None]:
display(df_diabetes.describe())

In [None]:
print('''Fetch the distribution of the Outcome variable''')

display(df_diabetes.Outcome.value_counts())

We seem to have a good distribution of the predictor columne('Outcome') across TRUE and FALSE values.

Let's look at the Correlation matrix for this dataset to see how "co-related" these features are to the predictor.

A (> 0) value means that they are, to a large extent, positively co-related(directly proportional).

In [None]:
df_diabetes.corr()

We see that, BMI and Pregnancies are Negative. This means that if a patient has had more number of pregnancies, their BMI values have generally been lower.  
Another example, "DiabetesPedigreeFunction" and "BloodPressure". This means that higher BP in patients has led to lower DbPedigree FUnction.

We can plot "Pair Plots" for every pair of columns to visually see their co-relation. Note that the plots between the above pair of columns confirm their inverse co-relation.

In [None]:
display(sns.pairplot(df_diabetes, height=2.5))

## Scale the data and perform Train Test Split!

We will use the standard 70:30 Split between Train:Test respectively.

In [None]:
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

Drop the predictor column from the training dataset, and use it for the testing data set

In [None]:
X = df_diabetes.drop(["Outcome"], axis=1)
y = df_diabetes["Outcome"]

In [None]:
Xs = scale(X)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(Xs, y, test_size=0.3, random_state=123)

Check the shape of the Train and Test models

In [None]:
display(x_train.shape)
display(x_test.shape)

## Logistic Regression

Logistic Regression is used to predict a Categorical variable from a set of Features.

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logisticRegr = LogisticRegression()

logisticRegr.fit(x_train, y_train)

Now, let's "predict" the values using out model, and also print out the actual values to compare them anecdotally.

In [None]:
display("These are our predicted values")
display(logisticRegr.predict(x_test))

display("These are the actual values for the above predicted values, in order")
display(np.array(y_test))


Of course, this is too difficult to compare individual values.  
Instead, let's use "classification_report" to compare the actual vs predivted values.

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, logisticRegr.predict(x_test)))

Logistic Regression also gives us a "score" value which tells the accuracy in percentages.  
It is nothing but the "fraction of correct predictions".

In [None]:
score = logisticRegr.score(x_test, y_test)
print(score)