# Classification with Neural Networks

**Neural networks** are a powerful set of machine learning algorithms. Neural network use one or more **hidden layers** of multiple **hidden units** to perform **function approximation**. The use of multiple hidden units in one or more layers, allows neural networks to approximate complex functions. Neural network models capable of approximating complex functions are said to have high **model capacity**. This property allows neural networks to solve complex machine learning problems. 

However, because of the large number of hidden units, neural networks have many **weights** or **parameters**. This situation often leads to **over-fitting** of neural network models, which limits generalization. Thus, finding optimal hyperparameters when fitting neural network models is essential for good performance. 

An additional issue with neural networks is **computational complexity**. Many optimization iterations are required. Each optimization iteration requires the update of a large number of parameters.  

## Example: German Credit Dataset

You will try a more complex example using the credit scoring data. You will use the prepared data which had the following preprocessing:
1. Cleaning missing values.
2. Aggregating categories of certain categorical variables. 
3. Encoding categorical variables as binary dummy variables.
4. Standardizing numeric variables. 

As a first step, execute the code in the cell below to load the required packages to run the rest of this notebook. 

In [2]:
from sklearn.neural_network import MLPClassifier
from sklearn import preprocessing
#from statsmodels.api import datasets
from sklearn import datasets ## Get dataset from sklearn
import sklearn.model_selection as ms
import sklearn.metrics as sklm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import numpy.random as nr

%matplotlib inline

Execute the code in the cell below to load the features and labels as numpy arrays for the example. 

In [3]:
Features = np.array(pd.read_csv('Credit_Features.csv'))
Labels = np.array(pd.read_csv('Credit_Labels.csv'))
Labels = Labels.reshape(Labels.shape[0],)
print(Features.shape)
print(Labels.shape)

(1000, 35)
(1000,)


Neural network training is known to be problematic when there is significant class imbalance. Unfortunately, neural networks have no method for weighting cases. Some alternatives are:
1. **Impute** new values using a statistical algorithm. 
2. **Undersample** the majority cases. For this method a number of the cases equal to the minority case are Bernoulli sampled from the majority case. 
3. **Oversample** the minority cases. For this method the number of minority cases are resampled until they equal the number of majority cases.

The code in the cell below oversamples the minority cases; bad credit customers. Execute this code to create a data set with balanced cases. 

In [4]:
temp_Labels = Labels[Labels == 1] 
temp_Features = Features[Labels == 1,:]
print(np.unique(Labels,return_counts=True))
print(temp_Features.shape)
temp_Features = np.concatenate((Features, temp_Features), axis = 0)
temp_Labels = np.concatenate((Labels, temp_Labels), axis = 0) 

print(np.unique(temp_Labels,return_counts=True))
print(temp_Features.shape)
print(temp_Labels.shape)

(array([0, 1]), array([700, 300]))
(300, 35)
(array([0, 1]), array([700, 600]))
(1300, 35)
(1300,)


Nested cross validation is used to estimate the optimal hyperparameters and perform model selection for a neural network model. 3 fold cross validation is used since training neural networks is computationally intensive. Additional folds would give better estimates but at the cost of greater computation time. Execute the code in the cell below to define inside and outside fold objects. 

In [5]:
nr.seed(123)
inside = ms.KFold(n_splits=3, shuffle = True)
nr.seed(321)
outside = ms.KFold(n_splits=3, shuffle = True)

The code in the cell below estimates the best hyperparameters using 3 fold cross validation. In the interest of computational efficiency, values for only 4 parameters will be searched. There are several points to note here:
1. In this case, a grid of four hyperparameters: 
  - **alpha** is the l2 regularization hyperparameter, 
  - **early_stopping** determines when the training metric becomes worse following an iteration of the optimization algorithm stops the training at the previous iteration. Early stopping is a powerful method to prevent over-fitting of machine learning models in general and neural networks in particular,
  - **beta_1** and **beta_2** are hyperparameters that control the adaptive learning rate used by the **Adam** optimizer,
3. The model is fit on the grid, and
4. The best estimated hyperparameters are printed. 

This code searches over a 3X3X3X2 or 54 element grid using 3 fold cross validation. Using even this modest search grid and number of folds requires the model to be trained 162 times. Execute this code and examine the result, but expect execution to take some time. 

Once you have executed the code, answer **Question 2** on the course page.

In [None]:
## Define the dictionary for the grid search and the model object to search on
param_grid = {#"alpha":[0.0000001,0.000001,0.00001], 
              #"early_stopping":[True, False], 
              "beta_1":[0.95,0.90,0.80], 
              "beta_2":[0.999,0.9,0.8]}

## Define the Neural Network model
nn_clf = MLPClassifier(hidden_layer_sizes = (100,100),
                       max_iter=300)

## Perform the grid search over the parameters
nr.seed(3456)
nn_clf = ms.GridSearchCV(estimator = nn_clf, param_grid = param_grid, 
                      cv = inside, # Use the inside folds
                      scoring = 'recall',
                      return_train_score = True)

nr.seed(6677)
nn_clf.fit(temp_Features, temp_Labels)
#print(nn_clf.best_estimator_.alpha)
#print(nn_clf.best_estimator_.early_stopping)
print(nn_clf.best_estimator_.beta_1)
print(nn_clf.best_estimator_.beta_2)

Now, you will run the code in the cell below to perform the outer cross validation of the model. The multiple trainings of this model will take some time. 

In [16]:
nr.seed(498)
cv_estimate = ms.cross_val_score(nn_clf, temp_Features, temp_Labels, 
                                 cv = outside) # Use the outside folds

print('Mean performance metric = %4.3f' % np.mean(cv_estimate))
print('SDT of the metric       = %4.3f' % np.std(cv_estimate))
print('Outcomes by cv fold')
for i, x in enumerate(cv_estimate):
    print('Fold %2d    %4.3f' % (i+1, x))

Mean performance metric = 0.877
SDT of the metric       = 0.003
Outcomes by cv fold
Fold  1    0.875
Fold  2    0.874
Fold  3    0.881


Examine these results. Notice that the standard deviation of the mean of Recall is an order of magnitude less than the mean itself. This indicates that this model is likely to generalize well, but the level of performance is still unclear. 

Now, you will build and test a model using the estimated optimal hyperparameters. However, there is a complication. The training data subset must have the minority case oversampled. Execute the code in the cell below to create training and testing dataset, with oversampled minority cases for the training subset.

In [17]:
## Randomly sample cases to create independent training and test data
nr.seed(1115)
indx = range(Features.shape[0])
indx = ms.train_test_split(indx, test_size = 300)
X_train = Features[indx[0],:]
y_train = np.ravel(Labels[indx[0]])
X_test = Features[indx[1],:]
y_test = np.ravel(Labels[indx[1]])

## Oversample the minority case for the training data
y_temp = y_train[y_train == 1] 
X_temp = X_train[y_train == 1,:]
X_train = np.concatenate((X_train, X_temp), axis = 0)
y_train = np.concatenate((y_train, y_temp), axis = 0) 

The code in the cell below defines a neural network model object using the estimated optimal model hyperparameters and then fits the model to the training data. Execute this code.

In [18]:
nr.seed(1115)
nn_mod = MLPClassifier(hidden_layer_sizes = (100,100), 
                       #alpha = nn_clf.best_estimator_.alpha, 
                       #early_stopping = nn_clf.best_estimator_.early_stopping, 
                       beta_1 = nn_clf.best_estimator_.beta_1, 
                       beta_2 = nn_clf.best_estimator_.beta_2,
                       max_iter = 300)
nn_mod.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.95,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(100, 100), learning_rate='constant',
       learning_rate_init=0.001, max_iter=300, momentum=0.9,
       nesterovs_momentum=True, power_t=0.5, random_state=None,
       shuffle=True, solver='adam', tol=0.0001, validation_fraction=0.1,
       verbose=False, warm_start=False)

As expected, the hyperparameters of the  neural network model object reflect those specified. 

The code in the cell below scores and prints evaluation metrics for the model, using the test data subset. 

Execute this code, examine the results, and answer **Question 3** on the course page. 

In [19]:
def score_model(probs, threshold):
    return np.array([1 if x > threshold else 0 for x in probs[:,1]])

def print_metrics(labels, probs, threshold):
    scores = score_model(probs, threshold)
    metrics = sklm.precision_recall_fscore_support(labels, scores)
    conf = sklm.confusion_matrix(labels, scores)
    print('                 Confusion matrix')
    print('                 Score positive    Score negative')
    print('Actual positive    %6d' % conf[0,0] + '             %5d' % conf[0,1])
    print('Actual negative    %6d' % conf[1,0] + '             %5d' % conf[1,1])
    print('')
    print('Accuracy        %0.2f' % sklm.accuracy_score(labels, scores))
    print('AUC             %0.2f' % sklm.roc_auc_score(labels, probs[:,1]))
    print('Macro precision %0.2f' % float((float(metrics[0][0]) + float(metrics[0][1]))/2.0))
    print('Macro recall    %0.2f' % float((float(metrics[1][0]) + float(metrics[1][1]))/2.0))
    print(' ')
    print('           Positive      Negative')
    print('Num case   %6d' % metrics[3][0] + '        %6d' % metrics[3][1])
    print('Precision  %6.2f' % metrics[0][0] + '        %6.2f' % metrics[0][1])
    print('Recall     %6.2f' % metrics[1][0] + '        %6.2f' % metrics[1][1])
    print('F1         %6.2f' % metrics[2][0] + '        %6.2f' % metrics[2][1])
    
probabilities = nn_mod.predict_proba(X_test)
print_metrics(y_test, probabilities, 0.5)     

                 Confusion matrix
                 Score positive    Score negative
Actual positive       168                44
Actual negative        41                47

Accuracy        0.72
AUC             0.71
Macro precision 0.66
Macro recall    0.66
 
           Positive      Negative
Num case      212            88
Precision    0.80          0.52
Recall       0.79          0.53
F1           0.80          0.53


The performance of the neural network model is less than ideal. For the negative (bad credit) case the recall is perhaps adequate, but the precision is poor. Perhaps the oversampling does not help much in this case. Challenge yourself - try and perform the undersampling method and compare the result!

## Summary

In this lab you have accomplished the following:
1. Used neural models to classify the cases of the iris data. The model with greater capacity achieved significantly better results. 
2. Used 3 fold to find estimated optimal hyperparameters for a neural network model to classify credit risk cases. Oversampling of the minority case for the training data was required to deal with the class imbalance. Despite this approach, the results achieved are marginal at best. Perhaps, a model with greater capacity would achieve better results, or a different approach to dealing with class imbalance would be more successful.  