# Pima Indians Diabetes Database

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.


The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Import Libaries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import seaborn as sn
import matplotlib.pyplot as plt

# Load Data

In [None]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']

# load dataset
pima = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv", names=col_names)

In [None]:
pima.head()

In [None]:
# load dataset
pima = pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv", skiprows=1, names=col_names)

In [None]:
pima.head()

In [None]:
pima.shape

In [None]:
pima.columns

# Selecting Feature
For Features considering all the columns except label, as it is our Target.

In [None]:
feature_cols = ['pregnant', 'glucose', 'bp','skin', 'insulin', 'bmi', 'pedigree','age']
X = pima[feature_cols] # Features
y = pima.label # Target variable

# Splitting into Train and Test

In [None]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.25,random_state=0)

In [None]:
display(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

# Train the Model

In [None]:
# instantiate the model (using the default parameters)
lr = LogisticRegression()

# fit the model with data
lr.fit(X_train,y_train)

# Predict with Test Dataset

In [None]:
y_pred=lr.predict(X_test)

In [None]:
y_pred

# Model Evaluation 

## Confusion Matrix

In [None]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
cnf_matrix

In [None]:
cnf_matrix = pd.crosstab(y_test, y_pred, rownames=['Actual'], colnames=['Predicted'])
sn.heatmap(cnf_matrix, annot=True)

##### Here, you can see the confusion matrix in the form of the array object or graphical view. 

Diagonal values represent accurate predictions, while non-diagonal elements are inaccurate predictions. 

In the output, 115 and 37 are actual predictions, and 25 and 15 are incorrect predictions.

TP = True Positives = 115
TN = True Negatives = 37
FP = False Positives = 25 (in Black/Dark Gray Block)
FN = False Negatives = 15 (in Black/Dark Gray Block) 
You can then also get the Accuracy using:

Accuracy = (TP+TN)/Total = (115+37)/192 = 0.79

The accuracy is therefore 79% for the test set.


## Accuracy Score

In [None]:
accuracyScore = metrics.accuracy_score(y_test, y_pred)
print('Accuracy Score : ',accuracyScore)
print('Accuracy In Percentage : ', int(accuracyScore*100), '%')