<p  style="text-align: center;"><font size="12"><b>PIMA INDIANS & DIABETES</b></font></p>
<p  style="text-align: center;"><font size="4"><b>AN EXPLORATORY DATA ANALYSIS</b></font></p>

### ABOUT THE DATASET
The datasets consists of several medical predictor variables and one target variable, **Outcome**. Predictor variables includes the number of pregnancies the patient has had, their **BMI**, **insulin level**, **age**, and so on.

### CONTEXT
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to  predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. All patients here are females at least 21 years old of Pima Indian heritage.



<h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Table of Contents</h3>

* <a href='#1'>I. LOAD LIBRARIES & PACKAGES</a>

* <a href='#2'>II. DATA OVERVIEW & INSIGHTS</a>

* <a href='#3'>III. MISSING DATA & UNIVARIATE ANALYSIS</a>
    
* <a href='#4'>IV. EXPLORATORY DATA ANALYSIS</a>
    * <a href='#4a'>IVa. Define Plot Functions</a> 
    * <a href='#4b'>IVb. Bivariate Exploration</a> 

# <a id='1'>I. LOAD PACKAGES & LIBRARIES</a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

!pip install seaborn==0.11.0

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as py
import plotly.express as px
import missingno as msno
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.subplots import make_subplots

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# <a id='2'>II. DATA OVERVIEW & INITIAL INSIGHTS</a>

In [None]:
df = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
df.head()

In [None]:
df.columns

In [None]:
df.shape

In [None]:
df.reset_index(inplace=True)
df.rename(columns={'index':'id'}, inplace=True)
df.head()

In [None]:
for col in df.columns:
    df.rename(columns={col:col.lower()}, inplace=True)

df.rename(columns={'bloodpressure':'blood_pressure','skinthickness':'skin_thickness',
                  'diabetespedigreefunction':'diabetes_pedigree_function'}, inplace=True)

In [None]:
df.describe()

In [None]:
df_healthy = df.loc[df['outcome'] == 0]
df_diabetic = df.loc[df['outcome'] == 1]

# <a id='3'>III. MISSING VALUES & UNIVARIATE EXPLORATION</a>

In [None]:
total = df.isnull().sum().sort_values(ascending=False)
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total','Percent'], sort=False).sort_values('Total', ascending=False)
missing_data.head(40)

According to this dataframe there are no missing values, however some features contain a 0 value which doesn't make sense for features such as BMI. So let's replace 0 with NaN for features that should not contain 0 values. 

In [None]:
# REPLACE 0 VALUES WITH 'NAN'
df[['glucose','blood_pressure','skin_thickness','insulin','bmi']] = df[['glucose','blood_pressure','skin_thickness','insulin','bmi']].replace(0,np.NaN)

In [None]:
# totals = df.isnull().sum().sort_values(ascending=False)
totals = pd.DataFrame((len(df['id']) - df.isnull().sum()), columns = ['count'])
percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['total','percent'], sort=False).sort_values('total', ascending=False)
totals

In [None]:
# VISUALIZE MISSING DATA PERCENTAGES

def missing_plot(dataset, feature):
    totals = pd.DataFrame((len(df['id']) - df.isnull().sum()), columns = ['count'])
    missing_percent = ((df.isnull().sum())*100)/df.isnull().count().sort_values(ascending=False)
    df_missing = pd.concat([total, missing_percent], axis=1, keys=['total','percent'], sort=False).sort_values('total', ascending=False)
    df_missing = df_missing.round(2)
    
    trace = go.Bar(x = totals.index, 
                   y = totals['count'],
                   opacity = 0.8, 
                   text = df_missing['percent'],  
                   textposition = 'auto',
                   marker=dict(color = '#41d9b3', line=dict(color='#000000',width=1.5)))

    layout = dict(title =  "Missing Value Count & Percentage")

    fig = dict(data = [trace], layout=layout)
    py.iplot(fig)

In [None]:
missing_plot(df, 'id')

### GET MEAN VALUES FOR EACH FEAUTRE WITH NAN VALUES

In [None]:
def get_mean(feat):
    temp = df[df[feat].notnull()]
    temp = temp[[feat,'outcome']].groupby(['outcome'])[[feat]].mean().reset_index()
    temp = temp.round(2)
    return temp


In [None]:
def plot_dist(feature, binsize):
    # 2 datasets
    df_healthy = df.loc[df['outcome'] == 0]
    healthy = df[feature]
    
    df_diabetic = df.loc[df['outcome'] == 1]
    diabetic = df_diabetic[feature]
    
    hist_data = [healthy, diabetic]
    
    group_labels = ['healthy', 'diabetic']
    colors = ['#41d9b3', '#c73062']

    fig = ff.create_distplot(hist_data, group_labels, colors = colors, show_hist = True, bin_size = binsize, curve_type='kde')
    
    fig['layout'].update(title = feature.upper())

    py.iplot(fig, filename = 'Density plot')

### INSULIN

In [None]:
get_mean('insulin')

In [None]:
# REPLACE NAN VALUES WITH MEAN 

df.loc[(df['outcome'] == 0) & (df['insulin'].isnull()), 'insulin'] = 130.29
df.loc[(df['outcome'] == 1) & (df['insulin'].isnull()), 'insulin'] = 206.85

In [None]:
plot_dist('insulin', 0)

### GLUCOSE

In [None]:
get_mean('glucose')

In [None]:
df.loc[(df['outcome'] == 0) & (df['glucose'].isnull()), 'glucose'] = 110.64
df.loc[(df['outcome'] == 1) & (df['glucose'].isnull()), 'glucose'] = 142.32

In [None]:
plot_dist('glucose',0)

### BLOOD PRESSURE

In [None]:
get_mean('blood_pressure')

In [None]:
# REPLACE NAN VALUES WITH MEAN 

df.loc[(df['outcome'] == 0) & (df['blood_pressure'].isnull()), 'blood_pressure'] = 70.88
df.loc[(df['outcome'] == 1) & (df['blood_pressure'].isnull()), 'blood_pressure'] = 75.32

In [None]:
plot_dist('blood_pressure', 0)

### SKIN THICKNESS

In [None]:
get_mean('skin_thickness')

In [None]:
df.loc[(df['outcome'] == 0) & (df['skin_thickness'].isnull()), 'skin_thickness'] = 27.24
df.loc[(df['outcome'] == 1) & (df['skin_thickness'].isnull()), 'skin_thickness'] = 33.00

In [None]:
plot_dist('skin_thickness',0)

### BMI

In [None]:
get_mean('bmi')

In [None]:
df.loc[(df['outcome'] == 0) & (df['bmi'].isnull()), 'bmi'] = 30.86
df.loc[(df['outcome'] == 1) & (df['bmi'].isnull()), 'bmi'] = 35.41

In [None]:
plot_dist('bmi',0)

### DIABETES PEDIGREE FUNCTION

In [None]:
get_mean('diabetes_pedigree_function')

In [None]:
df.loc[(df['outcome'] == 0) & (df['diabetes_pedigree_function'].isnull()), 'diabetes_pedigree_function'] = 0.43
df.loc[(df['outcome'] == 1) & (df['diabetes_pedigree_function'].isnull()), 'diabetes_pedigree_function'] = 0.55

In [None]:
plot_dist('diabetes_pedigree_function',0)

In [None]:
missing_plot(df, 'id')

In [None]:
plot_dist('age',0)

In [None]:
outcome_preg = df.groupby(['outcome','pregnancies'])[['id']].count()
outcome_preg.reset_index(inplace=True)
outcome_preg.rename(columns={'id':'count'}, inplace=True)

sns.set_style('darkgrid')
plt.figure(figsize=(15,6))
sns.barplot(x='pregnancies', y='count', hue='outcome', data=outcome_preg, palette='viridis')
plt.title('Diabetes - Pregnancy Outcome Count')


## CORRELATION

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(df.corr(), cbar = True,  square = True, annot=True, cmap= 'YlGnBu')
plt.title('FEATURE VARIABLE CORRELATIONS')


# <a id='4'>IV. EXPLORATORY DATA ANALYSIS</a> 

## <a id='4a'>IVa. DEFINE PLOTTING FUNCTIONS</a>


In [None]:
def plot_features(feat1, feat2):  
    diabetic = df[(df['outcome'] == 1)]
    healthy = df[(df['outcome'] == 0)]
    
    trace0 = go.Scatter(x = diabetic[feat1], 
                        y = diabetic[feat2],
                        name = 'diabetic',
                        mode = 'markers', 
                        marker = dict(color = '#c73062', line = dict(width = 1)))

    trace1 = go.Scatter(x = healthy[feat1], 
                        y = healthy[feat2],
                        name = 'healthy',
                        mode = 'markers',
                        marker = dict(color = '#41d9b3', line = dict(width = 1)))

    layout = dict(title = feat1.upper() + " " + "vs" +" " + feat2.upper(),
                  height = 750, width = 1000,
                  yaxis = dict(title = feat2.upper(), zeroline = False),
                  xaxis = dict(title = feat1.upper(), zeroline = False))

    plots = [trace0, trace1]

    fig = dict(data = plots, layout=layout)
    py.iplot(fig)

In [None]:
def barplot(feature, sub) :
    diabetic = df[(df['outcome'] == 1)]
    healthy = df[(df['outcome'] == 0)]
#     tmp3 = pd.DataFrame(pd.crosstab(df[feature],df['outcome']), )
    
#     tmp3['% diabetic'] = tmp3[1] / (tmp3[1] + tmp3[0]) * 100

    color=['#c73062','#41d9b3']
    trace1 = go.Bar(x=diabetic[feature].value_counts().keys().tolist(),
                    y=diabetic[feature].value_counts().values.tolist(),
                    text=diabetic[feature].value_counts().values.tolist(),
                    textposition = 'auto',
                    name='diabetic',
                    opacity = 0.8, 
                    marker=dict(color='#c73062', line=dict(color='#000000',width=1)))

    
    trace2 = go.Bar(x=healthy[feature].value_counts().keys().tolist(),
                    y=healthy[feature].value_counts().values.tolist(),
                    text=healthy[feature].value_counts().values.tolist(),
                    textposition = 'auto',
                    name='healthy', 
                    opacity = 0.8, 
                    marker=dict(color='#41d9b3', line=dict(color='#000000',width=1)))
    
#     trace3 =  go.Scatter(x=tmp3.index,
#                          y=tmp3['% diabetic'],
#                          yaxis = 'y2', 
#                          name='% diabetic', 
#                          opacity = 0.6, 
#                          marker=dict(color='black', line=dict(color='#000000',width=0.5)))

    layout = dict(title = str(feature)+' '+(sub),
                  xaxis=dict(), 
                  yaxis=dict(title='Count'), 
                  yaxis2=dict(range= [-0, 75], 
                              overlaying= 'y', 
                              anchor= 'x', 
                              side= 'right',
                              zeroline=False,
                              showgrid= False, 
                              title= '% diabetic'))

    fig = go.Figure(data=[trace1, trace2], layout=layout)
    py.iplot(fig)

In [None]:
# Define pie plot to visualize each variable repartition vs target modalities : Survived or Died (train)

def pieplot(feature, sub):
    diabetic = df[(df['outcome'] == 1)]
    healthy = df[(df['outcome'] == 0)]
    
    col =['Silver', 'mediumturquoise','#CF5C36','lightblue','magenta', '#FF5D73','#F2D7EE','mediumturquoise']
    
    trace1 = go.Pie(values  = diabetic[feature].value_counts().values.tolist(),
                    labels  = diabetic[feature].value_counts().keys().tolist(),
                    textfont=dict(size=15), opacity = 0.8,
                    hole = 0.5, 
                    hoverinfo = "label+percent+name",
                    domain  = dict(x = [.0,.48]),
                    name    = "Diabetic",
                    marker  = dict(colors = col, line = dict(width = 1.5)))
    
    trace2 = go.Pie(values  = healthy[feature].value_counts().values.tolist(),
                    labels  = healthy[feature].value_counts().keys().tolist(),
                    textfont=dict(size=15), opacity = 0.8,
                    hole = 0.5,
                    hoverinfo = "label+percent+name",
                    marker  = dict(line = dict(width = 1.5)),
                    domain  = dict(x = [.52,1]),
                    name    = "Healthy" )

    layout = go.Layout(dict(title = feature.upper() + " distribution by target: "+(sub),
                            annotations = [ dict(text = "Diabetic"+" : "+"268",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .22, y = -0.1),
                                            dict(text = "Healthy"+" : "+"500",
                                                font = dict(size = 13),
                                                showarrow = False,
                                                x = .8,y = -.1)]))
                                          

    fig  = go.Figure(data = [trace1,trace2],layout = layout)
    py.iplot(fig)

In [None]:
#CREATE A DATAFRAME WITH A COUNT OF EACH BOROUGH
outcome = df.groupby(['outcome'])[['id']].count()
outcome.reset_index(inplace=True)
outcome.rename(columns={'id':'count'}, inplace=True)
outcome.sort_values(by='count', ascending=False, inplace=True)
outcome

#CREATE BARCHART AND PIE CHART FOR BOUROUGH VALUES
plt.style.use('fivethirtyeight')

plt.figure(figsize=(15,6))

plt.subplot(1,2,1)
sns.barplot(x='outcome', y='count', data=outcome, palette='viridis')
plt.title('Diabetes Outcome Count')

plt.subplot(1,2,2)
plt.pie(outcome['count'], labels=outcome['outcome'], shadow=True, startangle=90)
plt.title('Diabetes Outcome Percentages')

plt.show()

## <a id='4b'>IVb. BIVARIATE EXPLORATION & NEW FEATURE CREATION</a>

### FEAT1: AGE & PREGNANCIES

In [None]:
plot_features('pregnancies','age')

In [None]:
df.loc[:,'feat1']=0
df.loc[(df['age']<=30) & (df['pregnancies']<=6),'feat1']=1

In [None]:
barplot('feat1',':AGE <= 30 & PREGNANCIES <= 6')

In [None]:
pieplot('feat1','AGE <= 30 & PREGNANCIES <= 6')

### FEAT2: AGE VS BMI

In [None]:
plot_features('bmi','age')

In [None]:
df.loc[:,'feat2']= 0
df.loc[(df['age']<=30) & (df['bmi']<=30),'feat2']=1

In [None]:
barplot('feat2',': AGE <= 30 & BMI <= 30')

In [None]:
pieplot('feat2','AGE <= 30 & BMI <= 30')

### FEAT3: AGE vs SKIN THICKNESS

In [None]:
plot_features('skin_thickness','age')

In [None]:
df.loc[:,'feat3'] = 0
df.loc[(df['age'] <= 30) & (df['skin_thickness'] <= 32), 'feat3'] = 1

In [None]:
barplot('feat3',': AGE <=30 & SKIN THICKNESS <=32')

In [None]:
pieplot('feat3','AGE <=30 & SKIN THICKNESS <=32')

### FEAT4: AGE vs GLUCOSE

In [None]:
plot_features('glucose', 'age')

In [None]:
df.loc[:,'feat4'] = 0
df.loc[(df['age'] <= 30) & (df['glucose'] <= 120), 'feat4'] = 1

In [None]:
barplot('feat4',': AGE <=30 & GLUCOSE <=120')

In [None]:
pieplot('feat4','AGE <= 30 & GLUCOSE <= 120')

### FEAT5: GLUCOSE vs BLOOD PRESSURE

In [None]:
plot_features('glucose','blood_pressure')

In [None]:
df.loc[:,'feat5'] = 0
df.loc[(df['glucose'] <= 100) & (df['blood_pressure'] <= 80), 'feat5'] = 1

In [None]:
barplot('feat5',': GLUCOSE <= 100 & BLOOD PRESSURE <=80')

In [None]:
pieplot('feat5', 'GLUCOSE <= 100 & BLOOD PRESSURE <=80')

### FEAT6: GLUCOSE vs BMI

In [None]:
plot_features('glucose','bmi')

In [None]:
df.loc[:,'feat6'] = 0
df.loc[(df['bmi'] <= 40) & (df['glucose'] <= 100), 'feat6'] = 1

In [None]:
barplot('feat6',': GLUCOSE <= 100 & BMI <= 40')

In [None]:
pieplot('feat6','GLUCOSE <= 100 & BMI <= 40')

### FEAT 7: GLUCOSE vs SKIN THICKNESS

In [None]:
plot_features('glucose','skin_thickness')

In [None]:
df.loc[:,'feat7'] = 0
df.loc[(df['glucose'] <= 120) & (df['skin_thickness'] <= 32), 'feat7'] = 1

In [None]:
barplot('feat7',': GLUCOSE <= 120 & SKIN THICKNESS <= 32')

In [None]:
pieplot('feat7','GLUCOSE <= 120 & SKIN THICKNESS <= 32')

### FEAT 8: GLUCOSE vs INSULIN

In [None]:
plot_features('glucose','insulin')

In [None]:
df.loc[:,'feat8'] = 0
df.loc[(df['insulin'] <= 130) & (df['glucose'] <= 120), 'feat8'] = 1

In [None]:
barplot('feat8',': GLUCOSE <= 120 & INSULIN <= 130')

In [None]:
pieplot('feat8','GLUCOSE <= 120 & INSULIN <= 130')

### FEAT 9: BLOOD PRESSURE vs BMI

In [None]:
plot_features('blood_pressure','bmi')

In [None]:
df.loc[:,'feat9'] = 0
df.loc[(df['bmi'] <= 30) & (df['blood_pressure'] <= 80), 'feat9'] = 1

In [None]:
barplot('feat9',': BMI <= 30 & BLOOD PRESSURE <= 80')

In [None]:
barplot('feat9','BMI <= 30 & BLOOD PRESSURE <= 80')

### FEAT 10: BLOOD PRESSURE vs SKIN THICKNESS

In [None]:
plot_features('blood_pressure','skin_thickness')

In [None]:
df.loc[:,'feat10'] = 0
df.loc[(df['blood_pressure'] <= 80) & (df['skin_thickness'] <= 28), 'feat10'] = 1

In [None]:
barplot('feat10',': BLOOD PRESSURE <= 80 & SKIN THICKNESS <= 28')

In [None]:
pieplot('feat10','BLOOD PRESSURE <= 80 & SKIN THICKNESS <= 28')

### FEAT 11: SKIN THICKNESS vs INSULIN

In [None]:
plot_features('skin_thickness','insulin')

In [None]:
df.loc[:,'feat11'] = 0
df.loc[(df['skin_thickness'] <= 40) & (df['insulin'] <= 131), 'feat11'] = 1

In [None]:
barplot('feat11',': SKIN THICKNESS <= 28 & INSULIN <= 131')

In [None]:
pieplot('feat11','SKIN THICKNESS <= 28 & INSULIN <= 131')

### FEAT 12: SKIN THICKNESS vs BMI

In [None]:
plot_features('skin_thickness','bmi')

In [None]:
df.loc[:,'feat12'] = 0
df.loc[(df['bmi'] <= 30) & (df['skin_thickness'] <= 28), 'feat12'] = 1

In [None]:
barplot('feat12',': SKIN THICKNESS <= 28 & BMI <= 30')

In [None]:
pieplot('feat12','SKIN THICKNESS <= 28 & BMI <= 30')

### FEAT 13: INSULIN vs BMI

In [None]:
plot_features('insulin','bmi')

In [None]:
df.loc[:,'feat13'] = 0
df.loc[(df['bmi'] <= 40) & (df['insulin'] <= 131), 'feat13'] = 1

In [None]:
barplot('feat13',': BMI <= 40 & INSULIN <= 131')

In [None]:
pieplot('feat13','BMI <= 40 & INSULIN <= 131')

In [None]:
plt.figure(figsize=(18,18))
sns.heatmap(df.corr(), cbar = True,  square = False, annot=True, cmap= 'YlGnBu')
plt.title('FEATURE VARIABLE CORRELATIONS')