<a href="https://colab.research.google.com/github/vt-ai-ml/fall2019-meetings/blob/master/Ensemble.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd                 # data frame
import numpy as np                  # matrix manipulation

### Ensemble Methods Tutorial
reference: https://www.datacamp.com/community/tutorials/ensemble-learning-python

We will be classifying whther a cell is a cancer cell or not. 

2 for non cancerous, 4 for cancerous

0 for non cancerous, 1 for cancerous (when normalized)

In [0]:
# read in the data
url = 'https://github.com/vt-ai-ml/fall2019-meetings/raw/master/data/cancer.csv'
data = pd.read_csv(url)

data.drop(['Sample code'],axis = 1, inplace = True) # drop the sample code column
data.head()

### Preprocessing

Preprocessing is a crucial step in machine learning. Sometimes raw data is unnatural to learn from. 

* Impute: when data is missing
* Normalize: when data has varying scales
    * really important when using neural networks

In [0]:
from sklearn import preprocessing   # preprocessing data

data.replace('?',0, inplace=True)   # replace '?' with 0

# Convert the DataFrame object into NumPy array otherwise you will not be able to impute
values = data.values

# replace '?' values with mean
imputer = preprocessing.Imputer()
imputedData = imputer.fit_transform(values)

# normalize our data so our values are now between [0, 1]
scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
normalizedData = scaler.fit_transform(imputedData)

normalizedData

In [0]:
from sklearn import model_selection    # for cross validation

# Get our input values (X) and output values (Y)
X = normalizedData[:,0:9]
Y = normalizedData[:,9]

# split data into training data & testing data
x_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, test_size=0.25, random_state=0)

kfold = model_selection.KFold(n_splits=10, random_state=0)

### Voting Classifier

Each model gets a vote -> the majority vote wins

In [0]:
from sklearn import tree            # decision trees
from sklearn import neighbors       # k nearest neighbors
from sklearn import linear_model    # logistic regression
from sklearn import ensemble        # ensemble methods

model1 = tree.DecisionTreeClassifier()
model2 = neighbors.KNeighborsClassifier()
model3 = linear_model.LogisticRegression(solver = 'lbfgs')

max_vote_model = ensemble.VotingClassifier(estimators=[('dt', model1), ('knn', model2), ('lr', model3)], voting='hard')
max_vote_model.fit(x_train, y_train)

vote_score = model_selection.cross_val_score(max_vote_model, X, Y, cv=kfold).mean()
max_vote_model.predict(x_test)

### Bagging

Train several base models on different parts of the data

In [0]:
from sklearn.svm import SVC
# bagging with KNN models
knn = neighbors.KNeighborsClassifier(9)   # bagging is generally good when k is low
bag = ensemble.BaggingClassifier(base_estimator=knn, n_estimators=3, random_state=0)

bag_score = model_selection.cross_val_score(bag, X, Y, cv=kfold).mean()

# Random Forests
rf = ensemble.RandomForestClassifier(n_estimators=9, random_state=0)

rf_score = model_selection.cross_val_score(rf, X, Y, cv=kfold).mean()

### Boosting

Train a model -> Tell next model to improve predictions for misclassified data

In [0]:
# AdaBoost
boost = ensemble.AdaBoostClassifier(random_state=0)

boost_score = model_selection.cross_val_score(boost, X, Y, cv=kfold).mean()

### Results

Random Forests is a really good model, and more advanced models use Random Forests as a basis. In fact, tree base models are generally better than deep learning models when it comes to business data.

In [0]:
print('Max Vote Classifier CV Score:  %f' % vote_score)
print('Bagging Classifier CV Score:   %f' % bag_score)
print('Random Forests CV Score:       %f' % rf_score)
print('AdaBoost Classifier CV Score:  %f' % boost_score)