# Homework 2: Supervised Learning


# Dataset: [Pima Indians Diabetes Dataset](https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes)


# [Data Description:](https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.names)


# Due: 2/8 (beginning of class)


# Format: ipython notebook submission on GitHub

- send link to your repository to Mason and Lema


## Instructions: answer the following questions with data and visuals

- Describe the content of the dataset and its goals
- Describe the features and formulate a hypothesis on which might be relevant in predicting diabetes
- Describe the missing/NULL values. Decide if you should impute or drop them and justify your choice.
- Come up with a benchmark for the minimum performance that an algorithm should have on this dataset
- What's the best performance you can get with kNN? Is kNN a good choice for this dataset?
- What's the best performance you can get with Naive Bayes? Is NB a good choice for this dataset?
- What's the best performance you can get with Logistic Regression? Is LR a good choice for this dataset?
- What's the best performance you can get with Random Forest? Is RF a good choice for this dataset?
- If you could only choose one, which classifer from the above that you already ran is best? How do you define best? (hint: could be prediction accuracy, running time, interpretability, etc)


## Note: you should know by now, but here is the order of importance:

- Your analysis (there is no "right" answer, only good and bad defense of it)
- Your visuals
- Your coding implementation

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [14]:
#Load data

raw_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data',\
                    sep=',',header=None,\
                    names=['preg','glc_concen','blood_pres','skin','insulin','mass','pedigree','age','class'])


In [15]:
#Take a peak at the data structure

raw_data.head()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [16]:
raw_data.shape

(768, 9)

In [17]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 767
Data columns (total 9 columns):
preg          768 non-null int64
glc_concen    768 non-null int64
blood_pres    768 non-null int64
skin          768 non-null int64
insulin       768 non-null int64
mass          768 non-null float64
pedigree      768 non-null float64
age           768 non-null int64
class         768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 60.0 KB


In [18]:
raw_data.isnull().any()

preg          False
glc_concen    False
blood_pres    False
skin          False
insulin       False
mass          False
pedigree      False
age           False
class         False
dtype: bool

In [19]:
raw_data.describe()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


**Describe the content of the dataset and its goals**
==============
- The dataset has 768 observations and 9 variables.  
- All variables are of numerical values, among which two are composed of floating numbers and the rest are integers.
- The 'class' variable contains binary values, which is something the model ultimately tries to test and predict. 
- Since the 'class' variable are binary values, its mean equals to the percentage of people who have diabetes in 
  this sample dataset. It is not presesentive to the general Pima Indian population since the given observations are on
  21+ years old females only. 
- The dataset contains no missing values per se. However, if we look at the minimum value of each variable summarized in the       above table, the zero values present in 'blood_pres', 'skin','insulin','mass','pedigree' are likely to be the 'missing           values' that we were talking about. Since measurements for those indicators should be above-zero numbers by common sense.
-----------------

The **goal** would be to explore and determine what the best machine learning algorithm to use to predict on the diabetes rate among 
21+ years' old Pima Indian women based on the given dataset.

**Describe the features and formulate a hypothesis on which might be relevant in predicting diabetes**

**Describe the missing/NULL values. Decide if you should impute or drop them and justify your choice.**
==============
- The last column 'class' is the result we tend to predict. So we have 8 features.
- Based on my medical knowledge, 'plasma glucose concentration', '2-Hour serum insulin','Diabetes pedigree function' seem to 
   have high correlations with 'getting diabetes', while 'Diastolic blood pressure','Triceps skin fold thickness' and 'Age' 
   might have some influence on the diagnosis. 
- To understand which features are relevant and how much they contribute to the diabetes prediction, we could run a variable 
   correlation matrix. 
- The zeros values in certain features need to be manupulated before running correlation matrix. We could either 
       (1) drop the zeros
       (2) replace the zeros with some values (e.g.mean)
- The maximum value in 'insulin' column is significantly higher than mean (846 vs 80), which may indicate it as an outlier.

In [20]:
print "There are %.4f%% missing data in variable 'glc_concen'. " %(1-float(np.count_nonzero(raw_data['glc_concen']))/768)
print "There are %.4f%% missing data in variable 'blood_pres'. " %(1-float(np.count_nonzero(raw_data['blood_pres']))/768)
print "There are %.4f%% missing data in variable 'skin'. " %(1-float(np.count_nonzero(raw_data['skin']))/768)
print "There are %.4f%% missing data in variable 'insulin'. " %(1-float(np.count_nonzero(raw_data['insulin']))/768)
print "There are %.4f%% missing data in variable 'mass'. " %(1-float(np.count_nonzero(raw_data['mass']))/768)


There are 0.0065% missing data in variable 'glc_concen'. 
There are 0.0456% missing data in variable 'blood_pres'. 
There are 0.2956% missing data in variable 'skin'. 
There are 0.4870% missing data in variable 'insulin'. 
There are 0.0143% missing data in variable 'mass'. 


Since the percentage of missing data, aka, zero values is very small, we could drop them.

But we need to first replace all zero values with NaN before dropping them.

In [37]:
#Take a closer look at column 'insulin' to determine if it is necessary to drop 'outliers'. 

sorted(raw_data['insulin'],reverse=True)

[846,
 744,
 680,
 600,
 579,
 545,
 543,
 540,
 510,
 495,
 495,
 485,
 480,
 480,
 478,
 474,
 465,
 440,
 415,
 402,
 392,
 387,
 375,
 370,
 360,
 342,
 335,
 330,
 328,
 326,
 325,
 325,
 325,
 321,
 318,
 310,
 304,
 300,
 293,
 293,
 291,
 285,
 285,
 284,
 280,
 278,
 277,
 275,
 274,
 272,
 271,
 270,
 265,
 265,
 258,
 255,
 250,
 249,
 245,
 240,
 240,
 237,
 235,
 231,
 231,
 230,
 230,
 228,
 225,
 225,
 220,
 220,
 215,
 215,
 215,
 210,
 210,
 210,
 210,
 210,
 207,
 207,
 205,
 205,
 204,
 200,
 200,
 200,
 200,
 196,
 194,
 194,
 194,
 193,
 192,
 192,
 191,
 190,
 190,
 190,
 190,
 188,
 185,
 185,
 184,
 183,
 182,
 182,
 182,
 180,
 180,
 180,
 180,
 180,
 180,
 180,
 178,
 176,
 176,
 176,
 175,
 175,
 175,
 171,
 170,
 170,
 168,
 168,
 168,
 168,
 167,
 167,
 166,
 165,
 165,
 165,
 165,
 160,
 160,
 160,
 160,
 159,
 158,
 158,
 156,
 156,
 156,
 155,
 155,
 155,
 155,
 152,
 152,
 150,
 150,
 148,
 148,
 146,
 145,
 145,
 145,
 144,
 144,
 142,
 140,
 140,
 140

Given that there are numbers like 744, 680, 600, 846 doesn't seems to be that of an anomaly. So we can just keep it.

In [22]:
data = raw_data.copy().replace({'glc_concen':{0:np.nan},'blood_pres':{0:np.nan},'blood_pres':{0:np.nan},'skin':{0:np.nan},'insulin':{0:np.nan},\
                    'mass':{0:np.nan}})


data.head()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age,class
0,6,148,72,35.0,,33.6,0.627,50,1
1,1,85,66,29.0,,26.6,0.351,31,0
2,8,183,64,,,23.3,0.672,32,1
3,1,89,66,23.0,94.0,28.1,0.167,21,0
4,0,137,40,35.0,168.0,43.1,2.288,33,1


In [23]:
df=data.copy() 

df.dropna(axis=0,how='any',inplace=True)

df.shape

(392, 9)

Now the sample size has been reduced to 392 observations. 

In [24]:
df.describe()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age,class
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,3.30102,122.627551,70.663265,29.145408,156.056122,33.086224,0.523046,30.864796,0.331633
std,3.211424,30.860781,12.496092,10.516424,118.84169,7.027659,0.345488,10.200777,0.471401
min,0.0,56.0,24.0,7.0,14.0,18.2,0.085,21.0,0.0
25%,1.0,99.0,62.0,21.0,76.75,28.4,0.26975,23.0,0.0
50%,2.0,119.0,70.0,29.0,125.5,33.2,0.4495,27.0,0.0
75%,5.0,143.0,78.0,37.0,190.0,37.1,0.687,36.0,1.0
max,17.0,198.0,110.0,63.0,846.0,67.1,2.42,81.0,1.0


In [25]:
#run a feature correlation matrix

df.corr()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age,class
preg,1.0,0.198291,0.213355,0.093209,0.078984,-0.025347,0.007562,0.679608,0.256566
glc_concen,0.198291,1.0,0.210027,0.198856,0.581223,0.209516,0.14018,0.343641,0.515703
blood_pres,0.213355,0.210027,1.0,0.232571,0.098512,0.304403,-0.015971,0.300039,0.192673
skin,0.093209,0.198856,0.232571,1.0,0.182199,0.664355,0.160499,0.167761,0.255936
insulin,0.078984,0.581223,0.098512,0.182199,1.0,0.226397,0.135906,0.217082,0.301429
mass,-0.025347,0.209516,0.304403,0.664355,0.226397,1.0,0.158771,0.069814,0.270118
pedigree,0.007562,0.14018,-0.015971,0.160499,0.135906,0.158771,1.0,0.085029,0.20933
age,0.679608,0.343641,0.300039,0.167761,0.217082,0.069814,0.085029,1.0,0.350804
class,0.256566,0.515703,0.192673,0.255936,0.301429,0.270118,0.20933,0.350804,1.0


Based on the matrix table, there are preg-age, glc_concen-insulin, skin-mass these three pairs that have strong correlations. 
This will be reference for later algorithm choosing.

**Come up with a benchmark for the minimum performance that an algorithm should have on this dataset**

Create s 'dummy' models. The DummyClassifier will be a baseline to compare with other classifiers.

DummyRegressor predicts mean; DummyClassifier predicts the most common class. We'll try to beat them! 

In [38]:
from sklearn.dummy import DummyClassifier, DummyRegressor

dc = DummyClassifier()
dr = DummyRegressor()

In [39]:
#Data split

pima_labels = df['class']
pima_X = df.drop('class', axis=1)


from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(pima_X, pima_labels, 
                                                    test_size=0.2, random_state=7)
            

**What's the best performance you can get with kNN? Is kNN a good choice for this dataset?**

In [27]:
# 10 cross validation iterations with 20% test / 80% train
from sklearn.cross_validation import ShuffleSplit
cv = ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2, random_state=0)

In [28]:
# Standardization: bring all of the features onto the same scale

from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
# transform our training features
X_train_std = stdsc.fit_transform(X_train)
# transform the testing features in the same way
X_test_std = stdsc.transform(X_test)

In [29]:
#Now all features are on the same scale

pd.DataFrame(X_train_std, columns=X_train.columns).head()

Unnamed: 0,preg,glc_concen,blood_pres,skin,insulin,mass,pedigree,age
0,0.80241,-0.544995,-0.014177,0.294995,-0.733566,-0.30458,-1.113849,0.609127
1,0.492726,-0.741142,-1.281977,-0.081876,-0.603972,0.142798,-0.075459,-0.092185
2,-1.055699,0.141523,1.095149,0.012342,0.536456,-0.31856,-0.017617,-0.69331
3,-0.126644,-0.18539,0.302773,-1.306709,-0.413901,-0.933706,-1.155164,-0.69331
4,-0.436329,-0.316155,-0.172652,-0.647183,-0.508936,0.156779,-0.582259,-0.492935
