# GA Data Science (DAT18) - Lab 13 - Solutions
## Pair Programming

### Heart Disease Dataset
ref: [https://archive.ics.uci.edu/ml/datasets/Heart+Disease](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)

#### Features

    Dataset has 76 total attributes - 14 attributes are used:
    1. #3 (age)
    2. #4 (sex)
    3. #9 (cp)
    4. #10 (trestbps)
    5. #12 (chol)
    6. #16 (fbs)
    7. #19 (restecg)
    8. #32 (thalach)
    9. #38 (exang)
    10. #40 (oldpeak)
    11. #41 (slope)
    12. #44 (ca)
    13. #51 (thal)
    14. #58 (num) (the predicted attribute - 0 is healthy and 1,2,3,4 indicate heart disease) 

### Class Exercise: Implement Random Forest

#### Import the dataset into a pandas dataframe:

Note: You'll have to manually add column labels

In [None]:
import numpy as np
import pandas as pd

from bokeh.plotting import figure,show,output_notebook
output_notebook()

In [None]:

df = pd.read_csv("../data/heart_disease.csv",header=None)
df.columns = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
df.head()

#### Prepare and validate the data:

Investigate the data and check for missing values - we've used .info() before:

In [None]:
df.info()
print df['ca'].value_counts()
print df['thal'].value_counts()

#### Clean the data to ensure it can be used in a random forest algorithm

In [None]:
df = df.convert_objects(convert_numeric=True)
df.dropna(inplace=True)
df.info()

#### Select Features and convert Target to Boolean Class for Heart Disease (i.e., values 1, 2, 3 and 4 all indicate heart disease)

In [None]:
df['num'] = df['num'].replace(to_replace=[2.0, 3.0, 4.0], value=1.0)

features = df.ix[:, 0:13].values
target = df.num.values

In [None]:
feature_names = df.columns[0:13]
feature_names

#### Build the model and score with cross-validation

In [None]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.cross_validation import KFold

def cross_validate(X, y, classifier, k_fold) :

    # derive a set of (random) training and testing indices
    k_fold_indices = KFold(len(X), n_folds=k_fold,
                           shuffle=True, random_state=0)

    k_score_total = 0
    # for each training and testing slices run the classifier, and score the results
    for train_slice, test_slice in k_fold_indices :

        model = classifier(X[ train_slice  ],
                         y[ train_slice  ])

        k_score = model.score(X[ test_slice ],
                              y[ test_slice ])

        k_score_total += k_score

    # return the average accuracy
    return k_score_total/k_fold

model = RandomForestClassifier(random_state=0).fit
cross_validate(features, target, model, 10)

#### How important are the various features?

In [None]:
model = RandomForestClassifier(random_state=0).fit(features,target)
model.feature_importances_

#### Plot Feature importances

In [None]:
from bokeh.charts import Bar, show

p=Bar(model.feature_importances_, cat=list(feature_names),
      title="Random Forest Feature Importance",
      xlabel='Heart Disease Features', ylabel='Feature Importance', 
      width=600, height=600, legend=None)
show(p)

#### Bonus: Repeat the classification with Support Vector Machine

In [None]:
from sklearn.svm import SVC
model = SVC(kernel='linear').fit
cross_validate(features, target, model, 10)

In [None]:
model = SVC(kernel='linear').fit(features,target)

p=Bar(model.coef_, cat=list(feature_names),
      title="Linear SVC Feature Importance",
      xlabel='Heart Disease Features', ylabel='Feature Importance', 
      width=600, height=600, legend=None)
show(p)

In [None]:
model = SVC(kernel='rbf').fit
cross_validate(features, target, model, 10)

Note: coefs aren't available through kerneled SVCs.