# About the Pima Indians Diabetes Database 
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

**Variables in the dataset**
**Pregnancies**- Number of times pregnant <br/>
**Glucose-** Plasma glucose concentration over 2 hours in an oral glucose tolerance test <br/>
**BloodPressure**- Diastolic blood pressure (mm Hg)<br/>
**SkinThickness**- Triceps skin fold thickness (mm)<br/>
**Insulin**- 2-Hour serum insulin (mu U/ml)<br/>
**BMI**- Body mass index (weight in kg/(height in m)^2)<br/>
**DiabetesPedigreeFunction**- Diabetes pedigree function is a function which scores likelihood of diabetes based on family history <br/>
**Age**- Age of the individuals, ll patients here are females at least 21 years old of Pima Indian heritage.<br/>
**Outcome**- Class variable (0 if non-diabetic, 1 if diabetic)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data=pd.read_csv("/kaggle/input/pima-indians-diabetes-database/diabetes.csv")

**Loading the required libraries**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
data.head()

In [None]:
#Finding general information about the dataset
print("Data shape="+ str(data.shape))
data.info()

In [None]:
#Look for any discrepancies
print("Null Values for this dataset")
print(data.isnull().sum())
# There's no data is missingprint("Number of duplicated data= "+ str(data.duplicated().sum()))

In [None]:
# There's no data is missing
#Description
data.describe()

**Looking at the description summary - Some of the data are not satisfactory such as presence of value 0 for many variables such as glucose, BP etc.<br/>
Handling 0 values in the data**

In [None]:

remove=["Glucose","BMI","SkinThickness","BloodPressure","Age","Insulin"]

#creating a duplicate of data
new_data=data
for i in remove:
  new_data=new_data[new_data[i]!=0]

In [None]:
#Description after data cleaning
print("No. of rows removed = "+ str(len(data)-len(new_data)))

In [None]:
#It is not advised to remove such large amount of data as it might result in misleading insights
#Therefore, it is suggested to replace 0 with NaN
modifs= ['Glucose','BloodPressure','SkinThickness','Insulin','BMI','Age'] #variables to be modified
data[modifs] = data[modifs].replace(0,np.NaN)
#Looking at the null values now

In [None]:
missing_values = (data.isnull().sum() / len(data) * 100).round(2)
print(missing_values)

In [None]:
data.hist(bins=25, figsize=(20, 15));

**Some of the observations from the data:** <br/>


1.   Glucose and Blood Pressure are normally distributed
2.   Skin Thickness,DiabetesPedigreeFunction and BMI is positvely skewed



In [None]:
for column in data:
    plt.figure()
    data.boxplot([column])

**Observations from boxplot**


1.  The max Pregnacies is 17 which is quite an outlier.
2.  Blood Pressure, SkinThickness and Age has very few outliers
3. Insulin and DiabetesPedigreeFunction has large number of outliers

In [None]:
#Bivariate Analysis 
#Visualization
sns.pairplot(data=data)


# NOTE: Both of these coefficients cannot capture any other kind of non-linear relationships.

**Thus, if a scatterplot indicates a relationship that cannot be expressed by a linear or monotonic function, 
then both of these coefficients must not be used to determine the strength of the relationship between the variables.
Correlation <br/>
Now, if we feel that a scatterplot is visually indicating a “might be monotonic, might be linear” relationship, 
our best bet would be to apply Spearman and not Pearson**

In [None]:
corr=data.corr(method="spearman")
corr


In [None]:
#Some of the pairs who show some relationship are
factors=["Age","Pregnancies","BMI","SkinThickness","Glucose","Insulin","Outcome"]

# Creating a matrix and plotting the correlation matrix
data[factors].corr()

In [None]:
# Plotting the correlation matrix
sns.heatmap(data[factors].corr(), annot=True, cmap = 'Reds')
plt.show()

In [None]:
sns.pairplot(data=data, vars=factors, hue="Outcome")