# BME-230B: Homework 1

Your assignment for homework 1 is to redo the linear regression analysis, but using a classification method from SKLearn.

Goals and Requirements:
1. Select a classification method from [SKLearn](http://scikit-learn.org/)
    1. I'd recommend logistics regression or any forest method as they are more intuitive. SVM would be a much more difficult method to understand.
2. Write a short explanation of the method and how it works (look for explanations in documention, youtube, or online).
3. Try to achieve the highest accuracy / estimator quality.

## Method
Method Selected: Random Forest Regressor

#### Short Description


## Classification
Create training/test splits and train the classifier

## Parse and Split input data

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import randint as spRand
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

# Parse input
df = pd.read_csv("data/breast-cancer-wisconsin.data.csv")

# Re-encode values
encoder = preprocessing.LabelEncoder()
for col in df.columns:
    df[col] = encoder.fit_transform(df[col])
    
#Split data set into train/test
(train, test) = train_test_split(df, test_size=0.2)    
features = list(df.keys())
features.remove('id')
features.remove('class')
features.remove('mitoses')
features.remove('clump-thickness')


In [2]:
x = train[features]
y = train['class']

print(features)

['uniformity-of-cell-size', 'uniformity-of-cell-shape', 'marginal-adhesion', 'single-epithelial-cell-size', 'bare-nuclei', 'bland-chromatin', 'normal-nucleoli']


## Optimize Hyperparameters, Train Best Estimator, Test training

In [5]:
from sklearn.model_selection import RandomizedSearchCV

# use RandomizedSearchCV to randomly probe hyperparameters, output is highest scoring param_dist
def randomCV(features):
    clf = RandomForestClassifier()
    param_dist = {'bootstrap' : [False, True], 
                      'criterion' : ['gini','entropy'],
                      'max_depth' : [3 , None], 
                      'max_features' : spRand(1,len(features) + 1),
                      'min_samples_leaf' : spRand(1,11), 
                      'min_samples_split' : spRand(2,11), 
                      'n_estimators' : spRand(6, 128),
                      'warm_start' : [False, True]}

    n_iter_search = 16
    random_search = RandomizedSearchCV(clf, param_distributions=param_dist, n_iter=n_iter_search)
    return(random_search)
# Use every feature

random_search = randomCV(features)
random_search.fit(x, y)
fitScore = random_search.score(x,y)
print('Bulk Feature Score', fitScore)

y_pred = random_search.best_estimator_.predict(test[features])

Bulk Feature Score 0.9660107334525939


## Output Test Results

In [344]:
def predOutPut(features, y_pred):   
    test_out = test['class']
    featOut = []
    print('Prediction with test data set including:', features)
    print('Accuracy', accuracy_score(test_out, y_pred))
    print(confusion_matrix(test_out, y_pred))
    tn, fp, fn, tp = confusion_matrix(test_out, y_pred).ravel()
    fpr = float(fp) / (fp+tn)
    fnr = float(fn) / (tp+fn)
    print("False positive rate: (predicting malignant while benign)", fpr)
    print("False negative rate: (predicting benign while malignant)", fnr)
    print('Feature contribution as normalized to 1')
    featRank = list(zip(train[features], random_search.best_estimator_.feature_importances_))
    featRank.sort(key=lambda x: x[1], reverse=True)
    for feat in featRank:
        print(feat[0], feat[1])


In [341]:
predOutPut(features, y_pred)

Prediction with test data set including: ['uniformity-of-cell-size', 'uniformity-of-cell-shape', 'marginal-adhesion', 'single-epithelial-cell-size', 'bare-nuclei', 'bland-chromatin', 'normal-nucleoli']
Accuracy 0.9285714285714286
[[79  7]
 [ 3 51]]
False positive rate: (predicting malignant while benign) 0.08139534883720931
False negative rate: (predicting benign while malignant) 0.05555555555555555
Feature contribution as normalized to 1
uniformity-of-cell-size 0.2348686816112566
single-epithelial-cell-size 0.18267169456425394
uniformity-of-cell-shape 0.17033922829548756
normal-nucleoli 0.12364416907209613
marginal-adhesion 0.12294417416392453
bare-nuclei 0.08325964310354993
bland-chromatin 0.08227240918943142


## Questions
What feature contributes most to the prediction? How can we tell?

The uniformity measurements (cell size, cell shape) consistently provide the highest contribution to this predictive model when all feature are used. When low performing features are removed, their positions are less certain, but still often quite high. This is provided as an attribute of the classifier object by sci-kit learn, and is determined by the mean decrease impurity.

Explain in your own words the difference between regression and classification methods.

Classification methods seem to produce discrete results that are direct predictions of class label. Regression, on the other hand, produces continuous values that must be sorted into classes thereafter.

Is it best to use all the features or exclude some? Why do you think?

Manual exclusion of low performing features seemed to improve accuracy and confusion matrix results. If the features do not actually possess predictive value, including them in the classification provides the chance for erroneous classification. 