# Welcome to your Jupyter IPython Notebook

Jupyter IPython Notebooks allow you to interactively run commands written in Python and inspect their output.  There are also libraries available that allow you to make graphs directly in the notebook.  This is a very convenient way to explore and analyze data.

In this tutorial, you will execute each cell (box) separately.  You can do this by clicking on the cell with the mouse and then pressing Shift+Enter.  This will run the code in the cell, and print any output you've requested.

You can execute cells out-of-order, but it's a good practice to work in order.

You can edit a cell by clicking in it and then editing the text.

If you get an error when you execute a cell, you can simply re-execute that cell after fixing the code; you don't need to re-run all the cells in the notebook from the beginning.

If you want to "comment" out a line of text (so that it will not be interpreted as a line of code and executed when you run the cell), you can do that using the # sign.

All the commands you execute in the notebook will be remembered, in the sense that if you import a library of functions or read some data into a variable in one cell, it will still be available when you execute the next cell.  If you want to start over, you can go to the "Kernel" menu at the top of the notebook and click "Restart".  This will clear everything that you've executed from the computer's memory, but it will leave all of the code you've written just as it is.

This cell was written by selecting "Markdown" from the drop-down menu above (which by default says "Code").  The notebook then understands that a Markdown cell contains comments and not code, and running this cell with Shift+Enter simply formats the text.

## What data will we use?

We will be using the Pima Indians Diabetes Database, a publicly-available data set, and building a model to predict whether a patient will get diabetes based on several measurements of each patient. The dataset is described here:

https://www.kaggle.com/uciml/pima-indians-diabetes-database

Each row represents one patient, and the columns are:

- Pregnancies: Number of times pregnant

- Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

- BloodPressure: Diastolic blood pressure (mm Hg)

- SkinThickness: Triceps skin fold thickness (mm)

- Insulin: 2-Hour serum insulin (mu U/ml)

- BMI: Body mass index (weight in kg/(height in m)^2)

- DiabetesPedigreeFunction: Diabetes pedigree function

- Age: Age (years)

- Outcome: Class variable (0 or 1)

## Read in the data

We will use the pandas library to work with our data.  pandas defines a format for storing the data called a dataframe, as well as many functions for working with the dataframes.  pandas dataframes can also be used with many other useful Python packages.

In [None]:
# import the pandas library for working with dataframes
import pandas as pd

In [None]:
# read the csv-formatted data file into a pandas dataframe
df=pd.read_csv('../input/diabetes.csv')
# get shape of data frame
print('Shape (n_rows,n_columns) of dataframe:',df.shape)
# print top 5 rows of data frame
df.head()

## Inspect the data using some pandas commands

### Sub-select one or more columns as a dataframe and show only the top 5 rows


In [None]:
df[['Outcome','Pregnancies','Insulin']].head()

In [None]:
df[['Age']].head()

### Select rows where a condition is true, and find out how many rows are in the resulting dataframe

In [None]:
print(df[df.BMI>30].shape) # the first element is the number of rows, the second element is the number of columns
print('The number of rows where BMI>30 = ',df[df.BMI>30].shape[0]) # the first element is labeled 0, the second element is labeled 1

In [None]:
df[df.BMI<10].head()

In [None]:
df.BMI>30

### Select rows where BMI>30,
### select the columns Outcome, BMI, and Age, 
### and show only the top 5 rows of the resulting dataframe (using the .head() command)

In [None]:
df[df.BMI>30][['Outcome','BMI','Age']].head()

### Select rows where Outcome is 1 and Preganancies>0,
### select the columns Glucose and BloodPressure, 
### and show only the top 3 rows of the resulting dataframe (using the .head(nrows) command)

In [None]:
df[(df.Outcome==1)&(df.Pregnancies>0)][['Glucose','BloodPressure']].head(5)

### EXERCISE: 

How many patients in the study have Outcome is 1 and BloodPressure greater than 70?  

Fill in the code below where it says NONE.

In [None]:
df[(df.Outcome==1)&(df.BloodPressure>70)].shape

### Does the data have any missing values?

In [None]:
df.isnull().sum()

In [None]:
df.notnull().sum()

### Get a list of columns names

In [None]:
df.columns

### Get the column data types

In [None]:
df.dtypes

### Look at some summary statistics of our data frame

In [None]:
df.describe()

Notice that the "count" of values for each column is the same, and the same as the number of rows in the data frame.  That means that there are no missing (NULL) values.

### How many of each type of diagnosis are there?

In [None]:
df.Outcome.value_counts()

In [None]:
df.SkinThickness.value_counts()

### What is the mean of "SkinThickness" where the Outcome is 1?

In [None]:
df[df.Outcome==1].SkinThickness.mean()

### EXERCISE: what is the maximum of "BMI" where the outcome is 0? 

In [None]:
# use the function .max()
df[df.Outcome==0].BMI.max()

In [None]:
max(df[df.Outcome==0].BMI)

## Visualize the data: make scatter plots in 2 variables

In [None]:
# get a plotting library
import matplotlib.pyplot as plt
# make it interactive in the notebook
%matplotlib inline

In [None]:
# plot Glucose vs BloodPressure and color points according to Outcome
plt.figure()
plt.scatter(df[df.Outcome==1].Glucose,df[df.Outcome==1].BloodPressure,label='Diabetes',color='r',s=2)
plt.scatter(df[df.Outcome==0].Glucose,df[df.Outcome==0].BloodPressure,label='No Diabetes',color='b',s=2)
plt.legend()
plt.xlabel('Glucose')
plt.ylabel('BloodPressure')

Notice first that already you can see a trend that higher glucose is associated with diabetes (Outcome=1, red points), while lower glucose is associated with no diabetes (Outcome=0, blue points).

Notice also that there's a set of points with value 0 for Glucose, and another set with 0 for BloodPressure.  This doesn't make sense physically.  It looks like this data was filled with 0 when the value should have been NULL.  Let's check how many zeros appear in each column.

In [None]:
df.columns

In [None]:
c='Pregnancies'
df[df[c]==0][c].count()

In [None]:
for c in df.columns:
    print('For column',c,' there are',df[df[c]==0][c].count(),'zero values.')


For some of these columns, zero makes sense, like for Pregnancies and Outcome.  But for some of the others, like BloodPressure or BMI, zero definitely doesn't make sense.  Let's have a closer look at the data by making a histogram of the value of the data for each column. 

In [None]:
for c in df.columns:
    plt.figure()
    plt.hist(df[c],bins=15)
    plt.xlabel(c)
    plt.ylabel('frequency')
    plt.show()

From these histograms it seems that many of the zero values are indeed likely missing data which should have been labeled NULL, and will need to be considered before we train a model to classify the data.  

Also, if we go two cells back to where we printed the number of zeros in each column, we can also see that the column Insulin has 374 values (out of 768 rows total), almost 50% of the values, as zero.

When we're ready to build a model, we will first drop (delete) the insulin column since so many of the values are missing.  Then we will impute (fill in) the zeros in the columns where zero doesn't make sense.  We will make the choice to use the mean (average) of the non-zero values in each column to impute the values that are zero. 


### EXERCISE: inspect the data yourself by making scatter plots of different columns

### EXERCISE: find one feature column that does a pretty good job splitting the data on Outcome (the "target" or "label" column).  At which value would you split that feature column?

In [None]:
# example: plot histograms of Age for Outcome=1 and Outcome=0.
plt.figure()
plt.hist(df[df.Outcome==1]['Age'],bins=15,label='Diabetes',color='r',alpha=0.2)
plt.hist(df[df.Outcome==0]['Age'],bins=15,label='No Diabetes',color='b',alpha=0.2)
plt.xlabel('Age')
plt.ylabel('frequency')
plt.legend()
plt.show()

In [None]:
# choose a feature column, plot the histogram, and decide on a split value


### EXERCISE: how accurate is your classifier using just 1 split?

In [None]:
# example
# create a new column in the data frame with the predicted outcome based on your split (here, Age<30 means outcome=0, otherwise outcome=1)
df['PredictedOutcome']=np.where(df.Age<30,0,1) # np.where(condition, value if true, value if false)
# calculate accuracy
N_correct=df[df.PredictedOutcome==df.Outcome].shape[0]
N_total=df.shape[0]
accuracy=N_correct/N_total
print('number of correct examples =',N_correct)
print('number of examples in total =',N_total)
print('accuracy =',accuracy)

We will discuss different ways of measuring the quality of a classifier in the next section.

In [None]:
# now check the accuracy of your column and split
# create a new column in the data frame with the predicted outcome based on your split
# replace "NONE" with your code
df['PredictedOutcome']=np.where(df.NONE<NONE,0,1) # np.where(condition, value if true, value if false)
# calculate accuracy
N_correct=df[df.PredictedOutcome==df.Outcome].shape[0]
N_total=df.shape[0]
accuracy=N_correct/N_total
print('number of correct examples =',N_correct)
print('number of examples in total =',N_total)
print('accuracy =',accuracy)

    ### Splitting data into TRAIN/TEST

In [None]:
import numpy as np
import sklearn
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size = 0.3, random_state = 0)

train.describe()

### CODE FOR DROPPING/IMPUTING DATA AFTER TRAIN/TEST SPLIT

In [None]:
train.drop('Insulin',axis=1,inplace=True)
test.drop('Insulin',axis=1,inplace=True) # axis=1 means drop the column, not the row
# check that Insulin is no longer in the list of columns
train.columns

In [None]:
# numpy provides many useful functions, including allowing us to create new columns in our dataframe based on a condition
import numpy as np

def imputeColumns(dataset):
    # create a list of columns that we will impute with the average non-zero value in each column
    columnsToImpute=['Glucose', 'BloodPressure', 'SkinThickness','BMI']

    for c in columnsToImpute:
        avgOfCol=dataset[dataset[c]>0][[c]].mean()
        dataset[c+'_imputed']=np.where(dataset[[c]]!=0,dataset[[c]],avgOfCol)

imputeColumns(train)
imputeColumns(test)
# check that we've imputed the 0 values  
train[train.Glucose==0][['Glucose','Glucose_imputed']].head()

### Extracting input features and output feature

In [None]:
X_train = train[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness','BMI', 'DiabetesPedigreeFunction', 'Age']]
Y_train = train[['Outcome']]
X_test = test[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'BMI', 'DiabetesPedigreeFunction', 'Age']]
Y_test = test[['Outcome']]

# Building Decision Trees for Classification

We are going to discuss the process of building a decision tree classifier for the diabetes problem. The objective is to predict based on diagnostic measurements whether a patient has diabetes.

We are building a model that is going to make predictions, so we need to find a way to evaluate the quality of these predictions in order to trust them. Since predictions by definition is for some unseen input, we cannot depend on the data that we used to create the model. We first need to divide the dataset into two non-intersecting parts: training data that is going to be used for building the model and test data for evaluating the model predictions.

In [None]:
Y_train.describe()

In [None]:
Y_test.describe()

We are ready now to build our first classifier. We use the training data to build our decision tree model. Then we are going to evaluate its score using the test set. 

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create the classifier
decision_tree_classifier = DecisionTreeClassifier(random_state = 0)

# Train the classifier on the training set
decision_tree_classifier.fit(X_train, Y_train)

# Evaluate the classifier on the testing set using classification accuracy
decision_tree_classifier.score(X_test, Y_test)

#### Congratualtions! We got around 74% accuracy on our first classifier. Let us first visualize the decision tree built.

In [None]:
from sklearn import tree

dot_file = tree.export_graphviz(decision_tree_classifier, out_file='tree.dot', 
                                feature_names = list(X_train),
                                class_names = ['healthy', 'ill']) 


We noticed that the decision tree built is very deep and too complicated. This indicates that the model 
will not be able to generalize well. This phenomenon is called overfitting. Mainly, the model memorizes
the training data and would have high accuracy on the training data but will perform badly on unseen ones.

In [None]:
print("Accuracy on training set: {:.3f}".format(decision_tree_classifier.score(X_train, Y_train)))
print("Accuracy on test set: {:.3f}".format(decision_tree_classifier.score(X_test, Y_test)))

In [None]:
import graphviz
with open("tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

To avoid overfitting, we can attempt to reduce the complexity of the model. This can be done during building the model
(pre-pruning) or after building it (post-pruning). Sklearn provide built-in functions to control pre-pruning like
limiting the depth of the model. 

In [None]:
decision_tree_pruned = DecisionTreeClassifier(random_state = 0, max_depth = 4)

decision_tree_pruned.fit(X_train, Y_train)
decision_tree_pruned.score(X_test, Y_test)

In [None]:
print("Accuracy on training set: {:.3f}".format(decision_tree_pruned.score(X_train, Y_train)))
print("Accuracy on test set: {:.3f}".format(decision_tree_pruned.score(X_test, Y_test)))

In [None]:
pre_pruned_dot_file = tree.export_graphviz(decision_tree_pruned, out_file='pruned_tree.dot', 
                                feature_names = list(X_test),
                                class_names = ['healthy', 'ill'])
with open("pruned_tree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

What other parameters can be used for pre-pruning? Experiment with different parameters and check how the results vary.

hint: consult http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html