**Description :**

The object of this diabets dataset is to predict whether patient has diabetes or not.The datasets consists of several medical predictor(Independent)variable and one target variable(outcome).Predictor variable includes pregnancies,Glucose,Blood Pressure, Skin Thickness,Insulin,BMI,DiabetesPedigreeFunction,age,and Outcome.


**Number of the columns with their meanings :**

Pregnancies --> Number if items pregnant
Glucose --> Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure --> Diastolic blood pressure(mm Hg)
SkinThickness --> Triceps skin fold thickness (mm)
Insulin --> 2-Hours serum insulin (mu U/ml)
BMI --> Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction --> Diabetes pedigree function
Age --> Age(years)
Outcome --> Class Variable (0 or 1)
268 of 768 are 1
500 of 768 are 0

**Step 1: Importing libraries like Numpy,Pandas,Matplotlib,Seaborn**

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

**Step 2: Load Dataset**

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')

**Step 3: Exploratory Data Analysis**

**3.1) Understanding Variable**

In [None]:
# Display of the totall rows and colums of data
df.shape

In [None]:
# Display of the first 5 rows of data
df.head()

In [None]:
# Display of the last 5 rows of data
df.tail()

In [None]:
# Display ramdomly and number of records of data
df.sample(5)

In [None]:
# List the type of all columns 
df.dtypes

In [None]:
# Finding out if the dataset contains any null values
df.info()

**Summary of the dataset**

The described method will help to see how data has been spread of numerical values. We can clearly see the minimum value, mean values, different percentile values, and maximum values.

In [None]:
df.describe()

**Observation**

In the above table, the min value of columns'Glucose','BloodPressure','SkinThickness','Insulin','BMI',is zero(0).It is clear that this values can't be zero.So I am going to impute mean values of these respective columns instead of zero.

**3.2) Data Cleaning**

In [None]:
#Chack the shape before drop the duplicates values 
df.shape

In [None]:
# count duplicates values
df.duplicated().sum()


In [None]:
# if there are duplicates values, this code duplicate values drop ..
df = df.drop_duplicates()

In [None]:
# Chack the shape after dfter the duplicates values
df.shape

Before drop and after drop the duplicates the data set has same shape and sum count is (0) zero. which meanse no duplicates values in the dataset.

**3.3) Chack the NULL values**

In [None]:
# count of null values
# chack the missing values in any colums
# Display number of null values in every columns in dataset

df.isnull().sum()

There is no NULL values in the given dataset

In [None]:
df.columns

**Chack the no.of Zero values in dataset**

In [None]:
print('No.of zero values in Pregnancies :',df[df['Pregnancies']==0].shape[0])

In [None]:
print('No.of zero values in Glucose :',df[df['Glucose']==0].shape[0])

In [None]:
print('No.of zero values in BloodPressure :',df[df['BloodPressure']==0].shape[0])

In [None]:
print('No.of zero values in SkinThickness :',df[df['SkinThickness']==0].shape[0])

In [None]:
print('No.of zero values in Insulin :',df[df['Insulin']==0].shape[0])

In [None]:
print('No.of zero values in BMI :',df[df['BMI']==0].shape[0])

In [None]:
print('No.of zero values in Age :',df[df['Age']==0].shape[0])

**Replace no.of zero values with mean of that columns**

In [None]:
df['Glucose']= df['Glucose'].replace(0,df['Glucose'].mean())
print('No.of zero values in Glucose :',df[df['Glucose']==0].shape[0])

In [None]:
df['BloodPressure']= df['BloodPressure'].replace(0,df['BloodPressure'].mean())
df['SkinThickness']= df['SkinThickness'].replace(0,df['SkinThickness'].mean())
df['Insulin']= df['Insulin'].replace(0,df['Insulin'].mean())
df['BMI']=df['BMI'].replace(0,df['BMI'].mean())

In [None]:
df.describe()

In [None]:
df.corr()

**Step 4: Data Visulization**

**4.1) Count Plot**

In [None]:
# Outcome Count Plot
f,ax=plt.subplots(1,2,figsize=(10,5))

mylabels = ['Nagetive','Positive']
df['Outcome'].value_counts().plot.pie(explode=[0,0.1],autopct='%1.1f%%',ax=ax[0],labels=mylabels ,shadow=True)
ax[0].set_title('Outcome Position')
ax[0].set_ylabel('')

sns.countplot('Outcome',data=df,ax=ax[1])
ax[1].set_title('Outcome Position')

N,P=df['Outcome'].value_counts()
print('Outcome Total Values :',df['Outcome'].value_counts().sum())
print('Negative (0)',N)
print('Positive (1)',P)
plt.grid(axis='y')
plt.show()

Out of total 768 people, 268 are dibetic (positive(1)) and - 500 are non-dibetic(negative(0)).
In the Outcome columns, 1 represent diabetes positive and 0 represents diabetes negative.
The countplot tells us that the database is imbalanced, as number of patients who don't have diabetes is more then those who have diabets.

**4.2) Histograms**

Histograms are one of the most common graph used to display numeric data.
distribution of the data - whether the data is normally distributed or if it's skewed (to the left or right)

In [None]:
# Histogram of each feature 
df.hist(bins=10,figsize=(10,10))
plt.show()

In [None]:
# Box Plot..

for i in df.columns:
  sns.boxplot(x='Outcome',y=i,data=df)
  plt.grid(axis='y')
  plt.show()

In [None]:
# Violin Plot

for i in df.columns:
  sns.violinplot(x='Outcome',y=i, data=df)
  plt.grid()
  plt.show()

In [None]:
sns.boxenplot(x='Outcome',y='Glucose',data=df)
plt.show()

In [None]:
for i in df.columns:
  sns.FacetGrid(df,hue='Outcome', height=5).map(sns.distplot,i).add_legend()
  plt.show()

In [None]:
# Pair Plot
sns.pairplot(data=df, hue='Outcome')
plt.show()

In [None]:
# Corr Relation dataset 
corrmat = df.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,5))
sns.heatmap(df[top_corr_features].corr(),annot=True,cmap='coolwarm')