# Final Project

## Predict whether a mammogram mass is benign or malignant

We'll be using the "mammographic masses" public dataset from the UCI repository (source: https://archive.ics.uci.edu/ml/datasets/Mammographic+Mass)

This data contains 961 instances of masses detected in mammograms, and contains the following attributes:


   1. BI-RADS assessment: 1 to 5 (ordinal)  
   2. Age: patient's age in years (integer)
   3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
   4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
   5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
   6. Severity: benign=0 or malignant=1 (binominal)
   
BI-RADS is an assesment of how confident the severity classification is; it is not a "predictive" attribute and so we will discard it. The age, shape, margin, and density attributes are the features that we will build our model with, and "severity" is the classification we will attempt to predict based on those attributes.

Although "shape" and "margin" are nominal data types, which sklearn typically doesn't deal with well, they are close enough to ordinal that we shouldn't just discard them. The "shape" for example is ordered increasingly from round to irregular.

A lot of unnecessary anguish and surgery arises from false positives arising from mammogram results. If we can build a better way to interpret them through supervised machine learning, it could improve a lot of lives.

## Your assignment

Apply several different supervised machine learning techniques to this data set, and see which one yields the highest accuracy as measured with K-Fold cross validation (K=10). Apply:

* Decision tree
* Random forest
* KNN
* Naive Bayes
* SVM
* Logistic Regression
* And, as a bonus challenge, a neural network using Keras.

The data needs to be cleaned; many rows contain missing data, and there may be erroneous data identifiable as outliers as well.

Remember some techniques such as SVM also require the input data to be normalized first.

Many techniques also have "hyperparameters" that need to be tuned. Once you identify a promising approach, see if you can make it even better by tuning its hyperparameters.

I was able to achieve over 80% accuracy - can you beat that?

Below I've set up an outline of a notebook for this project, with some guidance and hints. If you're up for a real challenge, try doing this project from scratch in a new, clean notebook!


## Let's begin: prepare your data

Start by importing the mammographic_masses.data.txt file into a Pandas dataframe (hint: use read_csv) and take a look at it.

In [1]:
import json
import pandas as pd
import codecs

bet_data = json.load(codecs.open('/Users/weit/study/ML/bsAI/data0.json', 'r', 'utf-8-sig'))

#for bet in bet_data:
#    print(bet)

data = pd.DataFrame(bet_data)
data['legs']



0      [{'selection': {'eventID': None, 'eventTypeID'...
1      [{'selection': {'eventID': '5937002', 'eventTy...
2      [{'selection': {'eventID': '5936655', 'eventTy...
3      [{'selection': {'eventID': None, 'eventTypeID'...
4      [{'selection': {'eventID': None, 'eventTypeID'...
5      [{'selection': {'eventID': '5936956', 'eventTy...
6      [{'selection': {'eventID': '5937443', 'eventTy...
7      [{'selection': {'eventID': '5938855', 'eventTy...
8      [{'selection': {'eventID': '5436113', 'eventTy...
9      [{'selection': {'eventID': None, 'eventTypeID'...
10     [{'selection': {'eventID': None, 'eventTypeID'...
11     [{'selection': {'eventID': '5936655', 'eventTy...
12     [{'selection': {'eventID': '5934347', 'eventTy...
13     [{'selection': {'eventID': '5937443', 'eventTy...
14     [{'selection': {'eventID': '5936633', 'eventTy...
15     [{'selection': {'eventID': '5936443', 'eventTy...
16     [{'selection': {'eventID': '5936443', 'eventTy...
17     [{'selection': {'eventID

Make sure you use the optional parmaters in read_csv to convert missing data (indicated by a ?) into NaN, and to add the appropriate column names (BI_RADS, age, shape, margin, density, and severity):

In [2]:
data.count()

betCurrency                  316
betID                        316
betSource                    316
betTimestamp                 316
betType                      316
customer                     316
legs                         316
liabilityAmount              316
platformReferenceCurrency    316
sourceBetType                316
totalLegs                    316
totalStake                   316
dtype: int64

Evaluate whether the data needs cleaning; your model is only as good as the data it's given. Hint: use describe() on the dataframe.

In [3]:
data.loc[data['liabilityAmount'].str.contains("OK")]

Unnamed: 0,betCurrency,betID,betSource,betTimestamp,betType,customer,legs,liabilityAmount,platformReferenceCurrency,sourceBetType,totalLegs,totalStake
150,AUD,1132567200,"{'channel': 'ONLINE', 'locationID': '10.11.95....",1542141043000,SINGLE,"{'customerIsElite': 'false', 'stakeFactor': '0...","[{'selection': {'eventID': '5932868', 'eventTy...",OK 126.50 1.0000,AUD,SGL,1,50.0
173,AUD,1132567205,"{'channel': 'ONLINE', 'locationID': '10.11.91....",1542141059000,SINGLE,"{'customerIsElite': 'false', 'stakeFactor': '0...","[{'selection': {'eventID': '5937443', 'eventTy...",OK 1451.60 1.0000,AUD,SGL,1,760.0


There are quite a few missing values in the data set. Before we just drop every row that's missing data, let's make sure we don't bias our data in doing so. Does there appear to be any sort of correlation to what sort of data has missing fields? If there were, we'd have to try and go back and fill that data in.

In [4]:
cleaned_data = data[~data['liabilityAmount'].str.contains("OK")]
cleaned_data.count()

betCurrency                  314
betID                        314
betSource                    314
betTimestamp                 314
betType                      314
customer                     314
legs                         314
liabilityAmount              314
platformReferenceCurrency    314
sourceBetType                314
totalLegs                    314
totalStake                   314
dtype: int64

If the missing data seems randomly distributed, go ahead and drop rows with missing data. Hint: use dropna().

In [5]:
import statsmodels.formula.api as sm
import numpy

numpy.random.seed(1234)

all_classes = numpy.random.randint(2, size=314)

all_features = cleaned_data[['totalLegs', 'totalStake', 'liabilityAmount']].values

X_opt = cleaned_data[['totalLegs', 'totalStake', 'liabilityAmount']].values
regressor_OLS = sm.OLS(endog = all_classes, exog = X_opt).fit()
regressor_OLS.sumary()

TypeError: unsupported operand type(s) for -: 'str' and 'str'

Next you'll need to convert the Pandas dataframes into numpy arrays that can be used by scikit_learn. Create an array that extracts only the feature data we want to work with (age, shape, margin, and density) and another array that contains the classes (severity). You'll also need an array of the feature name labels.

In [6]:
import numpy

all_features = cleaned_data[['totalLegs', 'totalStake', 'liabilityAmount']].values

numpy.random.seed(1234)

all_classes = numpy.random.randint(2, size=314)

print(all_classes)

feature_names = ['totalLegs', 'totalStake', 'liabilityAmount']

all_features

[1 1 0 1 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0
 0 0 1 1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 1 1 0 0 0 1 1 1 1
 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 0 1
 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1
 0 1 0 1 1 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1
 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 1 1
 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 1 1 0 1 1
 1 1 1 0 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 1 0 0 1 1 1 0 0 1 0 0 1 1
 1 1 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 1]


array([['1', '0.01', '0.01'],
       ['1', '5.00', '11.87'],
       ['1', '5.00', '15.00'],
       ['1', '2.00', '62.00'],
       ['1', '15.00', '32.25'],
       ['1', '0.50', '1.25'],
       ['3', '5.00', '8.47'],
       ['3', '20.00', '238.12'],
       ['1', '10.00', '210.00'],
       ['1', '5.00', '31.50'],
       ['7', '7.00', '1018.62'],
       ['1', '2.00', '9.00'],
       ['1', '45.00', '72.67'],
       ['2', '6.50', '9.09'],
       ['1', '3.00', '24.00'],
       ['1', '6.00', '9.99'],
       ['1', '5.00', '8.33'],
       ['2', '100.00', '841.00'],
       ['1', '2.00', '10.00'],
       ['1', '0.50', '1.18'],
       ['1', '2.00', '4.60'],
       ['1', '2.00', '30.00'],
       ['1', '2.00', '10.75'],
       ['1', '10.00', '17.60'],
       ['1', '10.00', '18.50'],
       ['1', '10.00', '20.70'],
       ['1', '10.00', '18.10'],
       ['1', '10.00', '18.80'],
       ['1', '7.50', '14.32'],
       ['1', '12.50', '22.25'],
       ['3', '5.00', '35.06'],
       ['1', '10.00', '16.60'],

In [7]:
import matplotlib.pyplot as plt
import seaborn as sn

corrData = cleaned_data[['totalLegs', 'totalStake', 'liabilityAmount']].corr()
mask = numpy.array(corrData)
mask[numpy.tril_indices_from(mask)] = False
fig,ax = plt.subplots()

fig.set_size_inches(20, 10)

sn.heatmap(corrData, mask=mask, vmax=.8, square=True, annot=True)

ValueError: zero-size array to reduction operation minimum which has no identity

Some of our models require the input data to be normalized, so go ahead and normalize the attribute data. Hint: use preprocessing.StandardScaler().

In [8]:
from sklearn import preprocessing

scaler = preprocessing.StandardScaler()
all_features_scaled = scaler.fit_transform(all_features)
all_features_scaled



array([[-6.07254156e-01, -1.83243583e-01, -7.06145697e-02],
       [-6.07254156e-01, -1.57162235e-01, -7.01160070e-02],
       [-6.07254156e-01, -1.57162235e-01, -6.99844301e-02],
       [-6.07254156e-01, -1.72842404e-01, -6.80086758e-02],
       [-6.07254156e-01, -1.04895005e-01, -6.92592862e-02],
       [-6.07254156e-01, -1.80682489e-01, -7.05624434e-02],
       [ 4.98124424e-01, -1.57162235e-01, -7.02589339e-02],
       [ 4.98124424e-01, -7.87613897e-02, -6.06050617e-02],
       [-6.07254156e-01, -1.31028620e-01, -6.17871514e-02],
       [-6.07254156e-01, -1.57162235e-01, -6.92908142e-02],
       [ 2.70888158e+00, -1.46708789e-01, -2.77949280e-02],
       [-6.07254156e-01, -1.72842404e-01, -7.02366541e-02],
       [-6.07254156e-01,  5.19066864e-02, -6.75601375e-02],
       [-5.45648662e-02, -1.49322151e-01, -7.02328707e-02],
       [-6.07254156e-01, -1.67615681e-01, -6.96060942e-02],
       [-6.07254156e-01, -1.51935512e-01, -7.01950371e-02],
       [-6.07254156e-01, -1.57162235e-01

## Decision Trees

Before moving to K-Fold cross validation and random forests, start by creating a single train/test split of our data. Set aside 75% for training, and 25% for testing.

In [9]:
import numpy
from sklearn.model_selection import train_test_split

numpy.random.seed(8626)

(training_inputs, testing_inputs, training_classes
 ,testing_classes) = train_test_split(all_features_scaled, all_classes, train_size=0.75, test_size=0.25, random_state=1)
print(training_inputs)

print(testing_inputs)

[[ 4.98124424e-01 -7.87613897e-02 -6.75416411e-02]
 [ 4.98124424e-01  3.39376454e-01  7.87793646e-02]
 [-6.07254156e-01 -1.57162235e-01 -6.99423928e-02]
 [-5.45648662e-02  3.39376454e-01 -3.52615980e-02]
 [-6.07254156e-01 -7.87613897e-02 -6.86812730e-02]
 [-6.07254156e-01  2.57730712e-02 -6.71342994e-02]
 [ 1.05081371e+00 -5.26277744e-02 -5.94360037e-02]
 [ 1.60350300e+00 -4.74010514e-02 -6.14239489e-02]
 [-6.07254156e-01 -1.80682489e-01 -7.05519340e-02]
 [-6.07254156e-01 -1.78069127e-01 -7.05195653e-02]
 [ 1.60350300e+00 -1.04895005e-01 -7.06149900e-02]
 [-6.07254156e-01  6.00712606e-01 -5.73732320e-02]
 [-6.07254156e-01 -1.80682489e-01 -7.05624434e-02]
 [-5.45648662e-02 -1.67615681e-01 -7.06149900e-02]
 [ 1.05081371e+00 -1.31028620e-01  7.27184227e-02]
 [ 4.98124424e-01  7.80403016e-02 -6.60879903e-02]
 [-6.07254156e-01  7.80403016e-02 -6.68316306e-02]
 [ 4.98124424e-01 -1.51935512e-01 -7.06149900e-02]
 [-6.07254156e-01 -1.31028620e-01 -6.97322062e-02]
 [ 4.98124424e-01 -1.31028620e-

Now create a DecisionTreeClassifier and fit it to your training data.

In [10]:
from sklearn.tree import DecisionTreeClassifier

classifier = DecisionTreeClassifier(random_state=1)

classifier.fit(training_inputs, training_classes)



DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1,
            splitter='best')

Display the resulting decision tree.

In [11]:
from IPython.display import Image  
from sklearn.externals.six import StringIO  
from sklearn import tree
from pydot import graph_from_dot_data 

dot_data = StringIO()  
tree.export_graphviz(classifier, out_file=dot_data, feature_names=feature_names)  
graph = graph_from_dot_data(dot_data.getvalue())[0] 
Image(graph.create_png())  

Exception: "dot" not found in path.

Measure the accuracy of the resulting decision tree model using your test data.

In [13]:
predictions = classifier.predict(testing_inputs)
print(testing_classes)
print(predictions)
classifier.score(testing_inputs, testing_classes)

[1 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0
 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1
 0 0 1 1 1]
[1 0 1 0 1 1 0 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 1 1 0 1 1 0 0
 0 1 1 0 0 1 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1
 0 0 1 1 1]


0.5569620253164557

Now instead of a single train/test split, use K-Fold cross validation to get a better measure of your model's accuracy (K=10). Hint: use model_selection.cross_val_score

In [128]:
from sklearn.model_selection import cross_val_score

classifier = DecisionTreeClassifier(random_state=1)

cv_scores = cross_val_score(classifier, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.5633339442815248

Now try a RandomForestClassifier instead. Does it perform better?

In [129]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, random_state=1)

    
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.5601020283479962

## SVM

Next try using svm.SVC with a linear kernel. How does it compare to the decision tree?

In [130]:
from sklearn import svm

C = 1.0
svc = svm.SVC(kernel='linear', C=C)

In [131]:
cv_scores = cross_val_score(svc, all_features_scaled, all_classes, cv=10)

cv_scores.mean()

0.506152248289345

## KNN
How about K-Nearest-Neighbors? Hint: use neighbors.KNeighborsClassifier - it's a lot easier than implementing KNN from scratch like we did earlier in the course. Start with a K of 10. K is an example of a hyperparameter - a parameter on the model itself which may need to be tuned for best results on your particular data set.

In [132]:
from sklearn import neighbors
clf = neighbors.KNeighborsClassifier(n_neighbors=10)
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
cv_scores.mean()

0.5131140029325513

Choosing K is tricky, so we can't discard KNN until we've tried different values of K. Write a for loop to run KNN with K values ranging from 1 to 50 and see if K makes a substantial difference. Make a note of the best performance you could get out of KNN.

In [133]:
for n in range(1, 50):
    clf = neighbors.KNeighborsClassifier(n_neighbors=n)
    cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
    print(n, cv_scores.mean())

1 0.5313599706744867
2 0.4936339198435973
3 0.5163214809384165
4 0.49022482893450636
5 0.5005895650048876
6 0.4962854349951124
7 0.5097751710654936
8 0.49346285434995113
9 0.5189668866080156
10 0.5131140029325513
11 0.5349890029325512
12 0.5253115835777126
13 0.5283296676441838
14 0.5192509775171066
15 0.48709371945259045
16 0.5030272482893451
17 0.49365224828934495
18 0.5127046676441838
19 0.5192570869990224
20 0.5065615835777125
21 0.5387127321603128
22 0.5516220674486803
23 0.5387127321603129
24 0.5418438416422287
25 0.5419385386119258
26 0.5514204545454545
27 0.5194525904203322
28 0.532261119257087
29 0.5291361192570869
30 0.5194586999022481
31 0.5132025904203322
32 0.5193578934506353
33 0.5387066226783969
34 0.5354869257086998
35 0.5131964809384164
36 0.522974706744868
37 0.5293255131964809
38 0.5263960166177908
39 0.5289283968719453
40 0.5262952101661779
41 0.5229624877810363
42 0.5390915200391007
43 0.5200329912023461
44 0.5166116813294234
45 0.5135813782991202
46 0.500778958944

## Naive Bayes

Now try naive_bayes.MultinomialNB. How does its accuracy stack up? Hint: you'll need to use MinMaxScaler to get the features in the range MultinomialNB requires.

In [134]:
from sklearn.naive_bayes import MultinomialNB

scaler = preprocessing.MinMaxScaler()

all_features_minmax = scaler.fit_transform(all_features)

clf = MultinomialNB()

cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)

cv_scores.mean()



ValueError: Input X must be non-negative

## Revisiting SVM

svm.SVC may perform differently with different kernels. The choice of kernel is an example of a "hyperparamter." Try the rbf, sigmoid, and poly kernels and see what the best-performing kernel is. Do we have a new winner?

In [135]:
def svm_score_from_kernel(ker, normalized_all_features, all_classes, cross_val=10):
    svc = svm.SVC(kernel=ker, C=1.0)
    cv_scores = cross_val_score(svc, normalized_all_features, all_classes, cv=cross_val)
    return cv_scores.mean()

In [136]:
svm_score_from_kernel('rbf', all_features_scaled, all_classes, 10)



0.46774499022482896

In [137]:
svm_score_from_kernel('sigmoid', all_features_scaled, all_classes, 10)



0.5647146871945259

In [138]:
svm_score_from_kernel('poly', all_features_scaled, all_classes, 10)



0.5190554740957966

## Logistic Regression

We've tried all these fancy techniques, but fundamentally this is just a binary classification problem. Try Logisitic Regression, which is a simple way to tackling this sort of thing.

In [139]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
cv_scores = cross_val_score(clf, all_features_scaled, all_classes, cv=10)
cv_scores.mean()



0.5090817448680351

## Neural Networks

As a bonus challenge, let's see if an artificial neural network can do even better. You can use Keras to set up a neural network with 1 binary output neuron and see how it performs. Don't be afraid to run a large number of epochs to train the model if necessary.

In [140]:
from tensorflow.keras.layers import Dense
from tensorflow.keras.models import Sequential

def create_model():
    model = Sequential()
    #4 feature inputs going into an 6-unit layer (more does not seem to help - in fact you can go down to 4)
    model.add(Dense(6, input_dim=4, kernel_initializer='normal', activation='relu'))
    # "Deep learning" turns out to be unnecessary - this additional hidden layer doesn't help either.
    #model.add(Dense(4, kernel_initializer='normal', activation='relu'))
    # Output layer with a binary classification (benign or malignant)
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
    # Compile model; adam seemed to work best
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [141]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score

estimator = KerasClassifier(build_fn=create_model, epochs=100, verbose=0)

cv_scores = cross_val_score(estimator, all_features_scaled, all_classes, cv=10)
print(cv_scores.mean())
print(normalized_all_features)



ValueError: Error when checking input: expected dense_2_input to have shape (4,) but got array with shape (3,)

## Do we have a winner?

Which model, and which choice of hyperparameters, performed the best? Feel free to share your results!