# Chapter 20 - Neural Networks and Deep Learning

So far, we have only addressed machine learning with classification tasks, i.e. where there is a finite discrete set of target labels, the classes. The other main task in supervised machine learning is regression, where the task is to predict a continuous value. Regression leads to different metrics and algorithms, as well as some new concepts such as loss functions and regularization terms, as it is closer to optimization. 

## Neural Networks

A regression task is one, where the data labels are continuous numbers. In this chapter we will use the diabetes dataset, XXXXXXX 

### Data preprocessing

We are now going the use the iris dataset again, as in the previous chapter. We will, however, do two things differently: 1) we will **scale** the feature values and 2) we will divide the data into 2 subsets, the **training, validation and testing set**.

**Scaling:** since we don't make assumptions about the data, it could be that the data values are not suitable  for the machine learning algorithms we apply. E.g. this might cause an error or the optimizer might not converge, so that we don't get a meaningful result. 
There are different scaling requirements for different models. E.g. decision trees and knn classifiers usually need to scaling. Many other models, like logistic regression or neural networks, as sensitive to scaling.   Standardization, like in chapter 8, to a mean of 0 and a standard deviation of 1 works well for many machine learning. An alternative approach is normalization, which ensures that all values are between 0 and 1. 

In the code below, we first split the training set from the rest and split the rest into test and validation set. Then we use the Scikit-Learn StandardScaler to standardize the dataset. 

In [7]:
from sklearn.datasets import load_diabetes
dataset = load_diabetes()

seed = 0 # this is used with the train_test_split to avoid random behaviour for this demo
X = dataset.data
y = dataset.target

In [15]:

from sklearn.model_selection import train_test_split
X_train, X_rest, y_train, y_rest = train_test_split(X, y, test_size=.4,random_state=seed)
X_test, X_val, y_test, y_val = train_test_split(X_rest, y_rest, test_size=.5,random_state=seed)
print(len(y_train),len(y_val),len(y_test))

# most classifiers work better with scaled input data
from sklearn.preprocessing import StandardScaler
sclr = StandardScaler()
sclr.fit(X_train) # scale to 0 mean and std dev 1 on training data

X_train_scl = sclr.fit_transform(X_train) # scale all 3 sets:
X_val_scl = sclr.fit_transform(X_val)
X_test_scl = sclr.fit_transform(X_test)
[np.mean(y_val),np.std(y_val)]

265 89 88


[147.22471910112358, 66.47255965222679]

### Feed-Forward Neural Network

We now use another classifier: the K-Nearest--Neighbor (KNN) classifier. This classifier is very simple: is views the feature values of an item as a vectors. For a new feature vector, we calculate the $k$ closest feature vectors in our training set. We then look up the classes belonging to these $k$ feature vectors and choose the most frequent one as our KNN prediction. The number $k$ determines the behaviour. In the simplest case we can choose $1$, which makes it easy to select the predicted class (no ties).  

This is a very simple classifier, but it can be quite effective. In order to select whether to use KNN or Decision Trees, we train both types of model on our training set and calculate the performance on the validation set. We also calculate the performance on the test set, which is a more realistic estimate of the performance on unseen data. 

In [2]:
from sklearn.neural_network import MLPRegressor

# train both model 
mlp = MLPRegressor(hidden_layer_sizes=100, random_state=0, max_iter=500)
mlp.fit(X_train_scl,y_train)

print("MLP regression: ",trainValTestMse(mlp))
print("MLP regr R^2: ",mlp.score(X_train_scl, y_train))

NameError: name 'X_train_scl' is not defined

### Recurrent Neural Networks

### Convolutional Neural Networks

### Sequence Models to Sequence Models

In **k-Fold Cross-Validation**, we divide the whole dataset into $k$ subsets (folds) of approximately equal size. Each fold is used once used as the test set, while the remaining $k-1$ are used as the training set. In this way we get $k$ different samples of the performance. In the extreme case of $k = n$, where $n$ is the size of the dataset, it is called **leave-one-out cross-validation**.

When taking a random samples, we will get variation in the distribution of classes between the   ****

### Transformers

10-fold Stratifed CV, test accuracy, mean:  0.7637254901960785 , std dev: 0.05580585675126154
Leave-one-out, test accuracy, mean:  0.7696629213483146 , std dev: 0.4210485825292524


As we can see, the training set predictions are all correct, as was already clear from the diagram. 



### Summary



### Exercises