# Data Science Case Study - Batch 9


### OBJECTIVE - This case study is around predicting whether a patient is suffering from Diabetes based on certain Diagnostic Measurements.

### Data Loading and Cleansing 

#### Step 1 - Import the libraries that will be used in the notebook

In [1]:
import numpy as np
import pandas as pd

#### Step 2 - Read the CSV file and see the first five rows of data

In [2]:
df = pd.read_csv("Diabetes.csv",header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Step 3 - As we can see that data is not having the header so we will specify the header to the data frame.

 0. Number of times pregnant
 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
 2. Diastolic blood pressure (mm Hg)
 3. Triceps skin fold thickness (mm)
 4. 2-Hour serum insulin (mu U/ml)
 5. Body mass index (weight in kg/(height in m)^2)
 6. Diabetes pedigree function
 7. Age (years)
 8. Class variable (0 or 1)


In [3]:
df.columns = ["NoTimePregnant", "GlucoseConcentration", "DiastolicBloodPressure", "TricepsSkinThickness","2HourSerumInsulin","BMI"
             ,"DPF","Age","Class"]
df.head()

Unnamed: 0,NoTimePregnant,GlucoseConcentration,DiastolicBloodPressure,TricepsSkinThickness,2HourSerumInsulin,BMI,DPF,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


#### Step 4 - Lets see what all cleansing is required

In [4]:
df.describe()

Unnamed: 0,NoTimePregnant,GlucoseConcentration,DiastolicBloodPressure,TricepsSkinThickness,2HourSerumInsulin,BMI,DPF,Age,Class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


#### We see that min GlucoseConcentration and Diastolic BloodPressure is 0 which is not correct data. So we have to identify such cases and fix them

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
NoTimePregnant            768 non-null int64
GlucoseConcentration      768 non-null int64
DiastolicBloodPressure    768 non-null int64
TricepsSkinThickness      768 non-null int64
2HourSerumInsulin         768 non-null int64
BMI                       768 non-null float64
DPF                       768 non-null float64
Age                       768 non-null int64
Class                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


#### Check if there is any NaN value.

In [12]:
df.isnull().values.any()

False

#### Check fo duplicated row.

In [13]:
sum(df.duplicated())

0

In [27]:
df.loc[~((df.GlucoseConcentration == 0) | (df.TricepsSkinThickness == 0) | (df.BMI == 0) | (df.DiastolicBloodPressure ==0))].describe()

Unnamed: 0,NoTimePregnant,GlucoseConcentration,DiastolicBloodPressure,TricepsSkinThickness,2HourSerumInsulin,BMI,DPF,Age,Class
count,532.0,532.0,532.0,532.0,532.0,532.0,532.0,532.0,532.0
mean,3.516917,121.030075,71.505639,29.182331,114.988722,32.890226,0.502966,31.614662,0.332707
std,3.312036,30.999226,12.310253,10.523878,123.007555,6.881109,0.344546,10.761584,0.471626
min,0.0,56.0,24.0,7.0,0.0,18.2,0.085,21.0,0.0
25%,1.0,98.75,64.0,22.0,0.0,27.875,0.25875,23.0,0.0
50%,2.0,115.0,72.0,29.0,91.5,32.8,0.416,28.0,0.0
75%,5.0,141.25,80.0,36.0,165.25,36.9,0.6585,38.0,1.0
max,17.0,199.0,110.0,99.0,846.0,67.1,2.42,81.0,1.0
