# Introduction
>The [Pima](https://en.wikipedia.org/wiki/Pima_people) are a group of **Native Americans** living in Arizona. A genetic predisposition allowed this group to survive normally to a diet poor of carbohydrates for years. In the recent years, because of a sudden shift from traditional agricultural crops to processed foods, together with a decline in physical activity, made them develop **the highest prevalence of type 2 diabetes** and for this reason they have been subject of many studies.

## Dataset
The dataset includes data from **768** women with **8** characteristics, in particular:

* Number of times pregnant
* Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* Diastolic blood pressure (mm Hg)
* Triceps skin fold thickness (mm)
* 2-Hour serum insulin (mu U/ml)
* Body mass index (weight in kg/(height in m)^2)
* Diabetes pedigree function
* Age (years)

The last column of the dataset indicates if the person has been diagnosed with diabetes (1) or not (0)

Let's do Exploratory Data Analysis(EDA) by performing initial investigations on data so as to discover any hidden patterns, or to spot anomalies.

## Imports and configuration

In [1]:
# Import all the libraries need to load the dataset and visualize it
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Configure for any default setting of any library
%matplotlib inline
sns.set(style='whitegrid', palette='deep', font='sans-serif', font_scale=1, color_codes=True)

**Comments**
>- **``%matplotlib inline``** sets the backend of matplotlib to the 'inline' backend: With this backend, the output of plotting commands is displayed inline without needing to call plt.show() every time a data is plotted.
>- Set few of the Seaborn's asthetic parameters

## Load the Dataset

In [3]:
# Load the dataset into a Pandas dataframe called pima
pima = pd.read_csv('diabetes.csv')

In [4]:
# Check the head of the dataset
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [5]:
# Check the tail of the dataset
pima.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


**Comments**
>To take a closer look at the data, pandas library provides **“.head()”** function which returns first five observations and **“.tail()”** function which returns last five observations of the data set.

## Inspect the Dataset

In [6]:
# Get the shape and size of the dataset
pima.shape

(768, 9)

**Observation**
>- This dataset contains **768** observations with **8** independant attribues and **1** dependant attribute

In [7]:
# Get more info on it
# 1. Name of the columns
# 2. Find the data types of each columns
# 3. Look for any null/missing values
pima.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


**Observation**
>- The dataset comprises of **768 rows** and **9 columns**
>- Name of the Columns are **Pregnancies**, **Glucose**, **BloodPressure**, **SkinThickness**, **Insulin**, **BMI**, **DiabetesPedigreeFunction**, **Age** and **Outcome**
>- All columns are of type integer where as only BMI and DiabetesPedigreeFunction are of type float
>- There are **No null/missing values** present in the dataset

In [8]:
# Describe the dataset with various summary and statistics
pima.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


**Observations**
>- This **".describe()"** function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding
``NaN`` values
>- ``Central Tendency`` is represented by indices min, 25%, 50%, 75% and max.
>- 25% is also known as **``First Quartile (Q1)``**, 50% as Second Quartile or **``Median (Q2)``** and 75% as **``Third Quartile (Q3)``**
>#### Pregnancies
>>- Mean is slightly greater than median, hence the distribution is slightly right skewed
>>- There is a large difference between 75 %tile and max. Also the max is greater than 1.5 times the IQR(5) hence it contains positive outlier
>#### Glucose
>>- There is notably a large difference between min and 25 %tile, hence the predictor contains large negative outlier
>>- the distribution is left skewed
>#### BloodPressure
>>- BloodPressure distribution seems to be slightly left skewed
>>- It contains notebly large difference between min and 25 %tile & also between 75 %tile and max. Hence it contains both positive and negative outliers. As the difference between min and 25 %tile is greater than 3 * IQR(18), the min value in this case is called as ``Extreme Outlier``
>#### SkinThickness
>>- It is right skewed
>>- Max is greater than 3*IQR(32), hence the max value is an Extreme Outlier
>#### Insulin
>>- It is fairly right skewed as there is a noteble difference bewteen mean and median
>>- The distribution contains extreme positive outliers
>#### BMI
>>- As the mean and median are almost eqaul, hence it is an uniform distribution
>>- The difference of min and 25 %tile & max and 75 %tie suggests that the distribution contains outliers on both ends
>#### DiabetesPedigreeFunction
>>- It seems to be right skewed distribution
>>- Contains many positive outliers as there is a huge difference between median and max
>#### Age
>>- It seems to be right skewed distribution
>>- Contains many positive outliers as there is a huge difference between median and max

## Understanding the target variable

In [9]:
# Find count of unique target variable
len(pima['Outcome'].unique())
# OR
pima['Outcome'].nunique()

2

In [10]:
# What are the different values for the dependant variable
pima['Outcome'].unique()

array([1, 0], dtype=int64)

In [11]:
# Find out the value countsin each outcome
pima['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

**Observation**
>- Target variable/Dependent variable is discrete and categorical in nature
>- There are two unique outcomes of the dataset which indicates if the person has been diagnosed with diabetes (1) or not (0)
>- The ratio of diabetic to non-diabetic women in the dataset are almost 1:2

In [12]:
# Map the outcomes to categorical values (Diabetic(1) or Non-Diabetic(0))
pima['Outcome'] = pima['Outcome'].map({1:'Diabetic', 0:'Non-Diabetic'})
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,Diabetic
1,1,85,66,29,0,26.6,0.351,31,Non-Diabetic
2,8,183,64,0,0,23.3,0.672,32,Diabetic
3,1,89,66,23,94,28.1,0.167,21,Non-Diabetic
4,0,137,40,35,168,43.1,2.288,33,Diabetic
