## Neural Network Introduction #2

This exercise is adapted from https://www.springboard.com/blog/beginners-guide-neural-network-in-python-scikit-learn-0-18/

We'll use SciKit Learn's built in Breast Cancer Data Set which has several features of tumors with a labeled class indicating whether the tumor was Malignant or Benign. We will try to create a neural network model that can take in these features and attempt to predict malignant or benign labels for tumors it has not seen before. Let's go ahead and start by getting the data!

In [2]:
#rom sklearn.datasets import load_breast_cancer
#cancer2 = load_breast_cancer()

Check out the dataframe - what are the first few rows of data?

In [14]:
# find out the attributes in the dataset
import pandas as pd

data_path = r"Data/breast.csv"
df_cancer = pd.read_csv(data_path)
df_cancer.head()

Unnamed: 0,id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,Unnamed: 32
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,


In [15]:
# find out the total instances and number of features
print(f"instances: {df_cancer.shape[0]}, features: {df_cancer.shape[1]}")

instances: 569, features: 33


In [16]:
# find out the total instances and number of features
# cancer['data'].shape
df_cancer.drop(columns=['Unnamed: 32','id'], inplace=True)
df_cancer.dropna(inplace=True)
Y=df_cancer.pop('diagnosis')

In [17]:
# use describe to find out more about the data
df_cancer.describe()

Unnamed: 0,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,symmetry_mean,fractal_dimension_mean,...,radius_worst,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


**Q:** what can you say about this dataset?

->It has values that have different ranges

Now, set up the data (x) and labels (y)


In [18]:
data = df_cancer
labels = Y

In [24]:
labels

0      M
1      M
2      M
3      M
4      M
      ..
564    M
565    M
566    M
567    M
568    B
Name: diagnosis, Length: 569, dtype: object

#### Train Test Split
 
Let's split our data into training and testing sets, this is done easily with SciKit Learn's train_test_split function from model_selection:

In [19]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(data, labels, test_size=0.2)

#### Data Preprocessing
 
The neural network may have difficulty converging before the maximum number of iterations allowed if the data is not normalized. Multi-layer Perceptron is sensitive to feature scaling, so it is highly recommended to scale your data. Note that you must apply the same scaling to the test set for meaningful results. There are a lot of different methods for normalization of data, we will use the built-in StandardScaler for standardization.

In [27]:
# Import the StandardScalar library
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
standard_scaler = StandardScaler()
label_encoder = LabelEncoder()

# Fit only to the training data
standard_scaler.fit(x_train)
label_encoder.fit(y_train)



LabelEncoder()

In [28]:
# Now apply the transformations to the data:
x_train_scaled = standard_scaler.transform(x_train)
x_test_scaled = standard_scaler.transform(x_test)
y_train_encoded = label_encoder.transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

#### Training the model
 
Now it is time to train our model. SciKit Learn makes this incredibly easy, by using estimator objects. In this case we will import our estimator (the Multi-Layer Perceptron Classifier model) from the neural_network library of SciKit-Learn!

In [22]:
from sklearn.neural_network import MLPClassifier



Next we create an instance of the model, there are a lot of parameters you can choose to define and customize here, we will only define the hidden_layer_sizes. For this parameter you pass in a tuple consisting of the number of neurons you want at each layer, where the nth entry in the tuple represents the number of neurons in the nth layer of the MLP model. There are many ways to choose these numbers, but for simplicity we will choose 3 layers with the same number of neurons as there are features in our data set:

In [23]:
# create a Multilayerperceptron classifier and call it mlp
mlp = MLPClassifier()

Now that the model has been made we can fit the training data to our model, remember that this data has already been processed and scaled:

In [29]:
mlp.fit(x_train_scaled, y_train_encoded)

MLPClassifier()

**Q:** What do you see in the output? What does it tell you?

-> Training returns a mlpclassifier object

#### Predictions and Evaluation
 
Now that we have a model it is time to use it to get predictions! We can do this simply with the predict() method off of our fitted model:

In [36]:
predictions = mlp.predict(x_test_scaled)

Now we can use SciKit-Learn's built in metrics such as a classification report and confusion matrix to evaluate how well our model performed:

In [37]:
from sklearn.metrics import classification_report, confusion_matrix

confusion = confusion_matrix(y_test_encoded, predictions)
report = classification_report(y_test_encoded, predictions)

print(confusion)
print(report)

[[66  1]
 [ 0 47]]
              precision    recall  f1-score   support

           0       1.00      0.99      0.99        67
           1       0.98      1.00      0.99        47

    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114



**Q:** what conclusion can you make from the confusion matrix?

-> the mlp model is only has one entry predicted wrong

#### Weights and biases

The downside however to using a Multi-Layer Preceptron model is how difficult it is to interpret the model itself. The weights and biases won't be easily interpretable in relation to which features are important to the model itself.

To extract the MLP weights and biases after training your model, you use its public attributes coefs_ and intercepts_.

In [38]:
# Print the coefficient values and interpret it
print(mlp.coefs_)



[array([[-0.08211435,  0.10386987,  0.09058834, ..., -0.09380102,
         0.13095185,  0.24710683],
       [ 0.24821863,  0.18512779,  0.08353042, ...,  0.14741224,
         0.17986944,  0.27029771],
       [-0.19997588,  0.18865092,  0.12774989, ..., -0.14479984,
        -0.03567909,  0.00345076],
       ...,
       [ 0.08212577,  0.24357063, -0.03791968, ..., -0.09736177,
         0.07632589, -0.04045178],
       [-0.19158365, -0.23027418,  0.22878812, ...,  0.00611657,
         0.09522059,  0.31393152],
       [ 0.03548574,  0.2536377 ,  0.09812305, ..., -0.01799539,
         0.01685768, -0.28865469]]), array([[ 0.50117954],
       [ 0.29572852],
       [ 0.20757281],
       [-0.2687876 ],
       [-0.12168653],
       [ 0.17218975],
       [-0.16097806],
       [-0.16386283],
       [ 0.15557439],
       [-0.43908778],
       [ 0.30013458],
       [-0.31123639],
       [-0.15308637],
       [ 0.28801845],
       [-0.23852941],
       [ 0.2985864 ],
       [ 0.25296528],
       [-0.

In [39]:
# Print the intercepts values and interpret it
print(mlp.intercepts_)

[array([ 8.52200257e-02, -5.91072729e-02, -2.18608078e-01,  2.17097181e-01,
        2.41527819e-01,  6.92117063e-02,  1.39581337e-01,  1.36560864e-01,
        7.87311411e-02,  1.92662123e-01,  3.45327316e-05,  2.05385356e-01,
        5.37227521e-02,  1.62755143e-01,  2.10265859e-01,  9.69122811e-02,
       -3.39084232e-02, -3.92318652e-02,  8.64154030e-02,  1.36114997e-01,
       -1.27318746e-01,  3.53357836e-01, -3.12014919e-02,  3.91578690e-02,
        5.65066413e-02,  1.09461445e-01,  1.18746287e-01,  1.05001573e-01,
        8.69648721e-02,  3.34636102e-01, -2.15087196e-01,  2.35716055e-02,
        4.94797757e-02,  1.95071047e-01,  2.70156817e-01,  1.85158172e-01,
       -1.53956180e-01,  4.48038385e-02,  1.60866200e-02,  1.05750700e-01,
       -2.48983954e-01, -7.90968684e-02,  7.91537286e-02, -3.30097780e-02,
        2.04664130e-02, -8.07687109e-02,  2.96954990e-01,  9.97711731e-02,
       -4.27824844e-03,  7.39138568e-02,  1.87903198e-01,  1.78802519e-01,
       -2.23148363e-02, 

**Q:** What do you understand from the two values?

-> just a bunch of numbers

#### Additional optional tasks...

select a few known supervised techniques and compare their performance. Use 10 fold cross validation

In [40]:
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, classification_report

seed = 1234

def train_and_test_model(x_train, y_train, x_test, y_test, model):
    cv = StratifiedKFold(10, shuffle=True, random_state=seed)
    score = cross_val_score(model, x_train, y_train, cv=cv)
    print(score)

    model.fit(x_train, y_train)
    predictions = model.predict(x_test)
    print(confusion_matrix(y_test, predictions))
    print(classification_report(y_test, predictions))

### Decision trees


In [42]:
dt = DecisionTreeClassifier()
train_and_test_model(x_train_scaled, y_train_encoded, x_test_scaled, y_test_encoded, dt)

[0.84782609 0.91304348 0.95652174 0.93478261 0.93478261 0.91111111
 0.95555556 0.97777778 0.97777778 0.93333333]
[[54 13]
 [ 4 43]]
              precision    recall  f1-score   support

           0       0.93      0.81      0.86        67
           1       0.77      0.91      0.83        47

    accuracy                           0.85       114
   macro avg       0.85      0.86      0.85       114
weighted avg       0.86      0.85      0.85       114



### Random forest

In [43]:
rf = RandomForestClassifier()
train_and_test_model(x_train_scaled, y_train_encoded, x_test_scaled, y_test_encoded, rf)

[0.91304348 0.91304348 0.95652174 1.         0.97826087 0.97777778
 1.         0.95555556 0.97777778 0.93333333]
[[64  3]
 [ 1 46]]
              precision    recall  f1-score   support

           0       0.98      0.96      0.97        67
           1       0.94      0.98      0.96        47

    accuracy                           0.96       114
   macro avg       0.96      0.97      0.96       114
weighted avg       0.97      0.96      0.97       114



### Gradient boosting

In [44]:
gbt = GradientBoostingClassifier()
train_and_test_model(x_train_scaled, y_train_encoded, x_test_scaled, y_test_encoded, gbt)

[0.91304348 0.95652174 0.97826087 1.         0.95652174 0.93333333
 1.         0.95555556 0.97777778 0.97777778]
[[64  3]
 [ 2 45]]
              precision    recall  f1-score   support

           0       0.97      0.96      0.96        67
           1       0.94      0.96      0.95        47

    accuracy                           0.96       114
   macro avg       0.95      0.96      0.95       114
weighted avg       0.96      0.96      0.96       114

