# Context of Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. **The objective of the dataset is to diagnostically predict whether or not a patient has diabetes**, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

**Data Variables**:
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

* Pregnancies: Number of times pregnant
* Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* BloodPressure: Diastolic blood pressure (mm Hg)
* SkinThickness: Triceps skin fold thickness (mm)
* Insulin: 2-Hour serum insulin (mu U/ml)
* BMI: Body mass index (weight in kg/(height in m)^2) 
* DiabetesPedigreeFunction: Diabetes pedigree function
* Age (years)
* Outcome: Class variable (0: Non Diebetic or 1: Diebetic)

**Description of Predictors/Independent Variables**:

* **DiabetesPedigreeFunction**: A function which scores the likelihood of diabetes based on family history. It provided some data on diabetes mellitus history in relatives and the genetic relationship of those relatives to the patient.
* **Insulin**: Regulate blood sugar levels.It helps keep your blood glucose level within a healthy range. Healthy blood glucose levels help reduce the risk of diabetes complications, such as blindness and the loss of limbs. It's important to monitor your blood glucose level regularly if you have diabetes.

#                                          Steps to be Performed
We are to deal with dataset and to find some important insights which are to be useful for business.For that We'll be following these step-
* **Data Exploration**--
    * Import useful liberaries & Load the dataset
    * Check for the summary/Description of data
* **Data Cleaning/Preprocessing**--
    * Treat Missing Values/Anomalies/Bad Data/Duplicate Records/Outliers
* **Data Visualization**--
    * (Univariate, Bivariate & Multivariate Analysis)
* **Data Prepration**--
    * Scaling of Data
    * Encoding Categorical Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%matplotlib inline

In [None]:
file = "/kaggle/input/pima-indians-diabetes-database/diabetes.csv"
df_pima = pd.read_csv(file)
print('Rows: {} \nColumns: {}'.format(df_pima.shape[0],df_pima.shape[1]))
##########################################################################################
# df_pima= pd.read_csv('diabetes.csv') 

In [None]:
df_pima.describe()

In [None]:
print('Null value present in the dataset: ',df_pima.isnull().sum())
print('*************************************************************************')
df_pima.info()
print('*************************************************************************')
print('Duplicate Records: ',df_pima.duplicated().sum())

# As per the findings/arguements about Dataset It is been found that-
"In 2001, 376 of 786 observations in the PID dataset were shown to lack experimental validity[65] because for some attributes, the value of zero was recorded in place of missing experimental observations[66]. It was also shown that if the instances with zero values were removed, performance could be dramatically improved[65]."

**Solution**-
Apart from number of pregnancies and the outcome (obviously), it makes more sense to transform those zeros into missing values (replace(0,np.nan)) and then decide which imputation strategy to use.

In [None]:
#replacing all the 0s in columns with mean-
col=['Glucose' ,'BloodPressure' ,'SkinThickness', 'Insulin' ,'BMI']
df_pima_copy=df_pima

In [None]:
for i in col:
    df_pima_copy[i].replace(to_replace=0,value=df_pima_copy[i].mean(),inplace=True)

In [None]:
df_pima_copy.describe()

# Treating Outliers:

* We will be keeping outliers in Variables: **Age,BMI,Insulin,DiabetesPedigreeFunction & BloodPressure** as they are all in possible range & adding value/Information to the Analysis.
* We will only be **treating outliers in Pregnencies** as it's depicting an unnatural range of 12-17.


In [None]:
Q1,Q3=np.percentile(df_pima_copy['Pregnancies'],[25,95])
Q3

* We can clearly state that 95% of the dataset has 10 Pregnencies. So we will put capping on 10. Above 10 are to treated as outliers and are to be imputed with 10.

In [None]:
df_pima_copy['Pregnancies']=np.where(df_pima_copy['Pregnancies']>10,10,df_pima_copy['Pregnancies'])

# UNIVARIATE Analysis:
Various Plots & Techniques cane be used e.g Boxplot,Distplot,Histplot, PDFs,CDFs etc.

In [None]:
def uiv(col):
    plt.figure(figsize=[9,4])
    sns.boxplot(x=col,data=df_pima_copy,hue='Outcome')
    sns.displot(x=col,data=df_pima_copy,kde=True,color='g',hue='Outcome')
    plt.show()
for x in list(df_pima_copy.columns):
    uiv(x)
####################################################################################
#plot=df_pima_copy.hist(figsize = (20,20)) distplot insights can be drawn by this way as well!

**Observations**:
* All the variables are Right Skewed except BloodPressure.
* All the Variables have got Outliers except Glucose.Outliers present in Pregnencies are treated before performing EDA.
* Dataset is imbalanced as Outcome has 1s & 0s.


# BIVARIATE Analysis: 
* Various Plots can be used e.g Swarmplot,boxplot,regplot,heatmaps,Scatterplot,pointplot,stripplot etc.

In [None]:
fig,axes=plt.subplots(nrows=3,ncols=3)
fig.set_size_inches(15,10)
sns.scatterplot(x='Glucose',y='BMI',hue='Outcome',data=df_pima_copy,ax=axes[0][0])
sns.scatterplot(x='Glucose',y='Age',hue='Outcome',data=df_pima_copy,ax=axes[0][1]) 
sns.scatterplot(x='Glucose',y='Pregnancies',hue='Outcome',data=df_pima_copy,ax=axes[0][2])
sns.scatterplot(x='Glucose',y='BloodPressure',hue='Outcome',data=df_pima_copy,ax=axes[1][0])
sns.scatterplot(x='Age',y='BMI',hue='Outcome',data=df_pima_copy,ax=axes[1][1]) #
sns.scatterplot(x='Glucose',y='Insulin',hue='Outcome',data=df_pima_copy,ax=axes[1][2]) 
sns.scatterplot(x='Age',y='Insulin',hue='Outcome',data=df_pima_copy,ax=axes[2][0])
sns.scatterplot(x='BMI',y='Insulin',hue='Outcome',data=df_pima_copy,ax=axes[2][1])
sns.scatterplot(x='BMI',y='DiabetesPedigreeFunction',hue='Outcome',data=df_pima_copy,ax=axes[2][2])
plt.show()

In [None]:
df_pima_copy.corr()
plt.figure(figsize=[12,5])
mask = np.triu(np.ones_like(df_pima_copy.corr(),dtype=bool))
sns.heatmap(df_pima_copy.corr(),cmap='Blues',annot=True,mask=mask)
plt.show()

# Multivariate Analysis:
3D Scatter Plots or Pairplot can be used for Multivariate Analysis.

In [None]:
sns.pairplot(df_pima_copy,hue='Outcome',palette='bright')

# Observations drawn after performing EDA:

* Outcome shows +ve correlation with Glucose,Age & BMI.Further insulin is highly correlated with Glucose & so BMI is with SkinThickness 
* Insulin & DiabetesPedigreeFunction shows negative correlation with Pregnancies.
* **Outcome is more likely to be 0 or Women are likely to be Non-Diabetic** when:
    * Glucose level is < 140
    * Insulin is between 0-200
    * Pregnancies are upto 2
* **Outcome is more likely to be 1 or Women are likely to be Diabetic** when:
    * Insulin is between 400-600
    * Glucose level is > 150
    * BMI> 50
    * BMI in between 30-40 & age beyond 30-35 years
    * Pregnecies>7 & BMI>35
 
 


**What Next?**
* We need to scale this data using zscore or sklearn Standard Scaler & Start applying algorithms!

# THANKS A LOT!!