# Checkpoint 1: Neural networks and deep learning
---
*Responsible:* Guillermo Hamity (<ghamity@ed.ac.uk>)

In this checkpoint exercise, we will use neural networks to predict the **type** of weather *given* the available ground observations. You will be using observation data from **June 2019** across all UK Met Office weather stations.

### Notes on the Dataset
* You will be using weather observation data from the UK Met Office Datapoint service
* Ground observations are made hourly at weather stations across the length of the UK 
* The data sample covers data from June 2019
* Data collections for each day starts at 6.30pm. All observation data is listed in one day blocks
* The time value column refers to the number of minutes after midnight 
* `Null` values for some features are expected (e.g. Wind Gust)
* Data import and preparation is already provided 


This week, I am not providing example notebooks like `lecture2.ipynb` and `data-science-tools.ipynb` for Unit 2, though these may still be useful to you. Instead, I am **providing the imports for all of the modules and classes that you should need.** Think of these as LEGO blocks; you have the ones you need but may look up how to "assemble" them.

### Notes on assessment
* Try and calculate the answers to the exercises provided. If you are unable to complete the question, describe which approach you _would_ have taken to solve the problem
* Code must be understandable and reproducible. Before grading the notebook kernel **may** be restarted and re-run, so make sure that your code can run from start to finish without any (unintentional) errors
* If you are unsure on how to proceed please **ask one of the TAs** during the workshop
- Notebooks should be submitted by **10am on Friday 9 October 2021** 
- This CP exercise sheet is divided into **6 sections**, corresponding to parts of the lecture, giving a maximum of **10 marks** in total:

| <p align='left'> Title                         | <p align='left'> Exercise nos. | <p align='left'> Number of marks |
| ------------------------------------- | ----- | --- |
| <p align='left'> 1. Conceptual questions               | <p align='left'>  1–5  | <p align='left'> 3 |
| <p align='left'> 2. Data preprocessing and RandomForest                | <p align='left'>  6–9  | <p align='left'> 2. |
| <p align='left'> 3. Neural networks in `scikit-learn`  | <p align='left'>  10–11 | <p align='left'> 1.5 | 
| <p align='left'> 4. Neural networks in `Keras`         | <p align='left'> 12–13 | <p align='left'> 2 |
| <p align='left'> 5. Regularisation                     | <p align='left'> 14–15 | <p align='left'> 1.5 |
| <p align='left'> **Total** | | <p align='left'> **10** |

- The total number of marks allocated for this CP is 10,
    - 1 additional mark can be given (maximimally up to 10 marks in total) for "bonus" exercise on hyperparameter optimisation. If you are pressed for time, focus on the first five sections; those are the core ones.
    - Half marks may be deducted for code legibility (i.e. very difficult to tell what you are doing), or for badly formated plots (i.e. no legends, axis labels etc.). The TAs will use their discression for this so comment code when applicable and keep relevant information in your plots.

_Note:_ You can suppress double-printing of plots from the `plot` module by either _(a)_ adding a semicolon after the function call (_i.e._ `plot.<method>(...);`), or _(b)_ by capturing the return `pyplot.Figure` object as a variable (_i.e._ `fig = plot.<method>(...)`).

In [None]:
import sys
sys.version

## Preamble

In [None]:
# Standard import(s)
import numpy as np
import pandas as pd
import random as rn
import sklearn
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.inspection import PartialDependenceDisplay
%matplotlib inline

# Suppress unnecessary ConvergenceWarnings and DeprecationWarnings
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.filterwarnings(action='ignore', category=ConvergenceWarning)
warnings.filterwarnings(action='ignore', category=DeprecationWarning)

# Set a random seed variable to make workbook reproducible
seed=5
np.random.seed(seed)
rn.seed(seed)
os.environ['PYTHONHASHSEED']=str(seed)
tf.compat.v1.set_random_seed(seed)

# Switch off multi-threading for TensorFlow
from tensorflow.python.keras import backend as K
config = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1,
                                  inter_op_parallelism_threads=1)
sess = tf.compat.v1.Session(graph=tf.compat.v1.get_default_graph(), config=config)
K.set_session(sess)

In [None]:
# Load in the prepared weather data
obs = pd.read_csv('weather.csv')
obs.head()

In [None]:
# Check the shape 
obs.shape
obs.describe()

For this exercise we will use **8** input features (provided) and clean the data:

In [None]:
# Define 8 input feature variables, 1 target variable data, and names of the 3 weather types
features = ['Latitude', 'Elevation', 'Temperature', 'Visibility', 'WindSpeed', 'Pressure', 'Humidity', 'WindDirection']
output   = ['Type']
wtype    = ['Clear', 'Cloudy', 'Precip']

Define derived dataset containing only the relevant columns and rows.

In [None]:
# Reduce to feature and type columns
dataset = obs[features + output]

# Drop duplicates and null values 
dataset = dataset.drop_duplicates().dropna()

# Drop unrecorded weather type
dataset = dataset[dataset.Type != 3]

# Check shape 
dataset.shape

## 1. Conceptual questions (3 Marks)
---
This section covers **5** exercises on conceptual understanding of neural networks.

#### 1.1. Which are the most used activation functions and why do we (typically) need non-linear activation functions in neural networks? (0.5 mark)

The most used activation functions are Sigmoid activation, ReLu, Softmax, and tanh.
We usually need non-linear activation functions in neural networks because the hidden layers are not useful when using linear activation functions (the composition of linear functions is itself a linear function)

Furthermore, a non-linear activation function permits the stacking of multiple layers of neurons (in order to create the deep neural network) allowing the subequent layers to build off each other. It can be thought as if you put two consecutive linear layers they have the same power as a single linear layer.

#### 1.2. Why do we need deep neural networks and which are the main differences between deep and shallow learning? (0.5 Mark)

With deep learning (deep NN) the performance increases when you increse the amount of data (this is because it doesn't just only predicts an ouput Y from an input X, but it also understande basic features of the input, being able to make abstractions of the features of the input and to make predictions based on those characteristics). This abstraction component is not found in shallow lwearning algorithm. Also, with other algorithms, the performance gets better when increasing the amount of data but only up to a constat performance level.

With normal shallow learning we are not able to explot high dimensional data (we can do that with deep learning). With shallow learning we have a manual feature engeneering (we don't have that component with deep learning). With shallow learning we use simple algorithms to obtain the output, while with dep learning they don't have to be simple.

Other characteristics of deep learning that are not found in shallow learning are that: - the network can learn relevant features. - It can perform feature extraction and deep combination simultaneously. - It performs feature selection to find the best subset of features.

#### 1.3. Discuss the Bias-variance trade-off and its relation to underfitting and overfitting of a model. Which are the caractheristics of an ideal model?  (0.5 mark)

Bias-variance trade-off is a fundamental principle for understanding the generalization of predictive learning power. The absic idea is that in a mdoel there is a tradeoff between the model's ability to minimize bias and variance.  

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict.
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. 

Underfitting model: The model does not have enough capacity/flexibility. It has high bias, meaning hat it is a model with poor performance on training data. This model cannot learn relevant structure in the dataset.

Overfitting model: The model has too much capacity/flexibility. In this case the errors have high variance, hence the model shows low losses on training but high losses on testing dataset. It basically overfits as it learns random structure in data set that does not represents a generalization.

Ideal model: Sufficient capacity/flexibility. The model basically learns levant structure in dataset with good generalization.

#### 1.4. Given a neural network with 4 input nodes, 2 layers with 5 nodes each, and 1 output node, what is the total number of free (trainable) parameters in the network? Does it matter which activation function(s) are used?  (0.5 mark)

If this neural network is densely connected then we can calulate 61 trainable parameters.

The activation function that is chosen does not change the number of trainable parameters. However, that choice will impact other thingsm like the performance of the network we are constructing or the capacity

#### 1.5. What are appropriate choice for _(a)_ the number of output nodes and _(b)_ output activation function(s) for each of the following tasks, and why? (0.5 mark)

1. Regression of the $x$, $y$, and $z$ coordinates of a single particle in an arbitrary coordinate system

Activation function: ReLu -- Nodes: 3 ouput layers (3 - axis, arbitrary space coordinates )

2. Regression of particle energy of a single particle

Activation function: ReLu -- Nodes: 1 ouput layer (energy goes from 0 to infinity)

3. Classification of two processes (signal vs. background)

Activation function: Sigmoid -- Nodes: 2 output layers  (discriminator)

4. Classification among *N* classes (dog vs. cat vs. fish vs. ...)

Activation function: Softmax -- Nodes: N ouput layers

**1.6. Given some data points and regression/classification problem, write the appropite cost function and compare your solution to that from sklearn (0.5 marks)** 

**Regression** 

A good **loss function** for regression is the **Mean Squared Error**. 

For $N$ samples with targets $Y$, our prediction $\bar{Y}$ has an MSE of:


$\mathrm{MSE} = \frac{\sum[(\bar{Y}-Y)^2]}{N}$

In [None]:
from sklearn import metrics # Import scikit-learn metrics module for accuracy calculation

In [None]:
## Regression Problem

# 3 Targets for regression 
Y = np.array([0.,1.,0.5])
# 3 Predicted values (at random)
YPred = np.random.rand(3)


In [None]:
#cost function (Mean Square Error):
def mse(YPred,Y):

    N = Y.shape[0]
    mse_res = np.sum((YPred - Y)**2)/N
    
    return mse_res

In [None]:
# Comparing our function to the sklearn MSE
prediction = np.random.rand(Y.shape[0])
print ("My MSE function is {}".format("Correct" if mse(prediction,Y) == metrics.mean_squared_error(prediction,Y) else "Wrong"))

**Classification**

Log Loss from the lecture notes is appropiate for binary classification, where our prediction is a probaility of `label = 1`.

In [None]:
#10 Random class labels (0 or 1)
Y = np.random.randint(0,2,10)
# 10 Random Probabilities
YPred = np.random.rand(10)

In [None]:
def logloss(YPred,Y):
    N = Y.shape[0]   
    logloss_res = (-1/N)*np.sum((Y*np.log(YPred) + (1-Y)*np.log(1-YPred)))
    
    return  logloss_res

In [None]:
# Check it matches the sklearn log_loss
logloss(YPred,Y) == metrics.log_loss(Y.astype(int),YPred)

## 2. Data preprocessing and RTs (2 mark)
---
This section covers **4** exercises on data preparation, feature standardisation, and dataset splitting.

In [None]:
# Relevant import(s) for this section
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.tree import export_graphviz
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics # Import scikit-learn metrics module for accuracy calculation
from sklearn import preprocessing # Import preprocessing for String-Int conversion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

---
**_Comment on target format and one-hot encoding:_** By default, the target column (`Type`) contains one integer (0, 1, or 2) for each example, the integer specifying one of three possible types of weather. However, for doing multi-class classification (which this is), we want our neural network to have one output node per class (_i.e._ 3 output nodes in this case), such that the activation of each output node is interpreted as the likelihood for a given sample being of the type in question. Therefore, the target should also be a 3-element vector for each sample; this vector should be all zeros, except for a $1$ at the index corresponding to the type in question. This is called **one-hot encoding**, and a few examples are shown below:

- type = 0 $\to$ one-hot = $[1, 0, 0]$ for 3 classes
- type = 1 $\to$ one-hot = $[0, 1, 0]$ for 3 classes
- type = 2 $\to$ one-hot = $[0, 0, 1]$ for 3 classes

This is the target towards which a neural network classifier is trained: That is, ideally, for an example of type 0, the network will output a large activation ($\approx 1$) on the first output node (interpreted as a large likelihood for the first weather type), and very small activations ($\ll 1$) on the two other output nodes (intepreted as small likelihoods for the two other weather types); and so on.

The same type of one-hot encoding can be performed for any number of target classes $N_{c}$, which just results in $N_{c}$-element target vectors with a single non-zero entry each.

To be user friendly, however, `scikit-learn` allows us to use integer targets for multi-class classification — it does the one-hot encoding for us "under the hood." Similarly, `keras`, _can_ also allow us to use integer targets for multi-class classification, provided we use the appropropriate loss (`sparse_categorical_crossentropy`). Otherwise (if we use `categorical_crossentropy` loss), it expects one-hot encoded targets. Which approach you choose is up to you — but now you know what goes on.

---

#### 2.1. Prepare the feature and target arrays (0.5 mark)
- Randomly sample **3,500** observations per weather type (**10,500** observations in total) from `dataset` into a new `pandas.DataFrame`; call it `sample`.
- One-hot encode the **wind direction** variable (_i.e._ $N$ to $[1, 0, \ldots, 0]$, $NNE$ to $[0, 1, \ldots, 0]$, _etc._ ), to allow us to input it to the neural network. There are 16 unique directions so we need to transform 1 feature into 16 features.
The exact order of the encoding (_i.e._ which direction corresponds to which index) doesn't matter. *Hint:*
  - *Either:* Use the scikit-learn `ColumnTransformer` with the `OneHotEncoder` applied to the `WindDirection` column, and let the remainder of the features pass through un-transformed.
  - *Or:* Use the `OneHotEncoder` class directly on the `WindDirection` column (use `sparse=False` in the `OneHotEncoder` constructor), and then concatenate with a `numpy.array` containing the remaining features.
- Define `numpy.arrays` named `X` and `y` containing the training features (the 7 unmodified ones plus the one-hot encoded wind directions) and target, respectively.
- Argue whether the shapes of `X` and `y` are as expected/as they should be.


In [None]:
#For convenience concatenate the wind direction and other features 
# in the order that you can use this feature name variable
feature_names = list(range(16))+features[:-1]
print(feature_names)

In [None]:
# Randomly sampling 3500 observations per weather type
sample = dataset.groupby(["Type"]).sample(n=3500)
sample[features]

In [None]:
# One-hot encoding the 'WindDirection' variable
# 16 unique directions
wind_trans = ColumnTransformer([('OneHotW', OneHotEncoder(), ['WindDirection']),],remainder='passthrough')

X = wind_trans.fit_transform(sample[features]) 
y = sample['Type'].values

print(X.shape)
#explain Y.shape and X.shape

In [None]:

# The array has the expected shape:
# ROWS: 3x3500 = 10500 samples from the random sampling (target value: Weather Typer).
# COLUMNS: the 7 untransformed features plus the 16 transformed features (from the hotencoding) minus winddirection

#### 2.2. Train a Random Forest, evaluate performance, explore features (1.5 mark)

Decision trees work well with a mixture of features (of different scales, and both binary and continuous data), so we will train a random forest to do the job of categorisation.

You are given the train test split (70% training):

In [None]:
#Import random fosets and confusion matrix metric
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,ConfusionMatrixDisplay
from sklearn.model_selection import GridSearchCV

# split dataset into training set and test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
print(x_train.shape,x_test.shape,y_train.ravel().shape)

1. Train a `RandomForestClassifier` with a `GridSearchCV` over the following input parameters to the mode'. Split the dataset into only 3 cross validation folds to make it a little faster (Hint: see `GreidSearchCV` function documentation)
2. Check the overal accuaracy on the testing set
3. What is the best set of hyperparametrs the scan has found? 

*Hint:* the final random forest that is chosen can be returned with th the `best_estimator_` member of the `GridSearchCV` object

In [None]:
# We scan a broad range of parameters to use for the RandomForest
rf_dic={
    "n_estimators":[10,50,200,500],
    "max_features": ["sqrt","log2"],
    "criterion": ["gini"],
    "max_depth": [4,8,30]
    }

In [None]:
# Implementing GridSearchCV, data splitted into 3 cross validation folds
grid_search = GridSearchCV(RandomForestClassifier(), rf_dic, n_jobs=7, cv=3)
grid_search.fit(x_train, y_train)



In [None]:
# Overall accuracy 
y_pred = grid_search.predict(x_train)
print("Accuracy Training:",metrics.accuracy_score(y_train, y_pred))
y_pred = grid_search.predict(x_test)
print("Accuracy Testing:",metrics.accuracy_score(y_test, y_pred))

In [None]:
print("Best set of hyperparameters: ",grid_search.best_estimator_.get_params())

---

**Understanding Classification Accuracy**

4. Use the `confusion_matrix` method on the **test data** to return the confusion matrix normalised over the true lables, i.e. sum over rows should sum to 100%. Use the given colormap to plot the confusion matrix in a heatmap.
    - Define the axis tick names to represent Clear, Cloudy or Precip
    - Use suitable x and y axis labels
    
5. What are the true positive rates for clear, cloudy and perp? 
6. What is the probability that rain is forcast on a sunny day?

In [None]:
test_matrix = metrics.confusion_matrix(y_test, y_pred, normalize='true')

colormap = sns.diverging_palette(220, 10, as_cmap=True)
display_labels = ['clear', 'cloudy', 'precipitation']
fig, ax = plt.subplots(figsize=(10, 10))
# Generate Heat Map, allow annotations and place floats in map
sns.heatmap(test_matrix, cmap=colormap, annot=True, fmt=".2f", xticklabels=display_labels, yticklabels=display_labels)
plt.title("Confussion Matrix: Understanding some weather probabilities.")
plt.ylabel("Normalised probability (true)")
plt.xlabel("Normalised probability (predicted)")

plt.show()

The probability of rain on a sunny day (clear day) is of only 2%

---
**Understanding Feature Importance**

There are several ways to understand which **features are important** to the 
decision tree. The most common is to look at `feature_importances_` list which is calculated at training time. This quantifies by how much each feature splits the dataset, the higher the number, the more imporant the feature. In random forests were we have 100s of trees, the importance is an gregate.


*Note:* below the code assumes the random forest CV search is still `grid_search`

In [None]:
# Given plotting example for feature importance
fig, ax = plt.subplots(figsize=(10, 10))
ax.barh(range(23), grid_search.best_estimator_.feature_importances_)
ax.set_yticks(list(range(23)))
ax.set_yticklabels(feature_names)
ax.set_title("Training Feature Importance")
plt.show()

E.g. In the RF I trained, wind direction has little impact on the performance, while Pressure, Visibility and Humidity seem like natural important features.

The problem with `feature_importances_` is that they are **calulated and biased towards the training dataset**, so may not represent the most relevant features for classifying on the **testing dataset**.


We can use `permutation_importance` to get a more accurate representation on the feature importance. 

In [None]:
from sklearn.inspection import permutation_importance

This function will randomly permute (shuffle) one feature at a time, and look at how much the accuracy changes. We can perform this permutation several times (`n_repeats`) and get an average impact on the accuracy, and a std deviation.

7. Complete the permutation importance function below
    - Use the test dataset
    - Permute each feature 20 times


In [None]:
result = permutation_importance(grid_search, X, y, n_repeats=20
    , random_state=42, n_jobs=2
)

8. Make the feature importance plot as above using 
    - `result.importances_mean` as the feature importances
    - `result.importances_std` for the `barh` parameter `xerr=`
    - Comment on how the imporatnces change on the testing dataset

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
ax.barh(range(23), result.importances_mean, xerr=result.importances_std)
ax.set_yticks(list(range(23)))
ax.set_yticklabels(feature_names)
ax.set_title("Features importance")
plt.show()

---
Finally we can look at the impact of individual features on the probablity of a **particular class**.

Using `PartialDependenceDisplay` we choose a set of features that we allow to vary within a range, while other features remain fixed. We can look at how the probability estimate changes on average for any one of our targets. 

9. Complete the `PartialDependenceDisplay.from_estimator` function by:
    - adding your random forest estimator
    - using the first 100 data points of the test dataset as input
    - Use the `Humidity`, `Pressure` and `Visibility` features. These feature values are scanned while the others remain fixed (*Hint:* `features` parameter)
    - Look at the impact on the `Precipitaion` class probability (*Hint:* `target` parameter)
10. Comment on the trends shown over the 3 features on the probability it will rain.

In [None]:
sklearn.__version__

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.set_title("Decision Tree")
tree = PartialDependenceDisplay.from_estimator(grid_search, x_test, features= [19,21,22], feature_names = feature_names, target = 2,ax=ax)

plt.show()

The plots show: the less visibility the more probable is to rain, the lower the pressure the more probable is to rain, and the more humidity the most probable is to rain.

## 3. Neural networks in `scikit-learn` (1.5 mark)
---
This section covers exercises on constructing and training neural networks using the `scikit-learn` library, as well as evaluating neural network performance. `scikit-learn` provide many, very easy to use ML algorithms, including neural networks. These are called `MLPClassifier` (MLP = multi-layer perceptron; a historic name for densely connected, feed-forward neural networks) when used for classification, and `MLPRegressor` when used for regression. We will focus on the former for now.

In [None]:
# Relevant import(s) for this section
from sklearn.neural_network import MLPClassifier


#### 3.1. Standardise the relevant features  and split data (0.5 mark)
We need some additional processing of the input features to make them appropiate for the neural network. 

- Use our feature array `X`, and standardize the features.

    _Note:_ You shouldn't standardise the one-hot encoded wind directions; they already have the desired format. Perform a sanity check to make sure that the resulting features have the expected distributional properties (mean and standard deviation; or minimum and maximum value).
    - Hint:

        - Use the scikit-learn `StandardScaler`
        - Or use the scikit-learn `MinMaxScaler`

- Perform a sanity check to make sure that the resulting features have the expected distributional properties (mean and standard deviation; or minimum and maximum value).
    - The number of columns should match, and depending on the choice of standardisation, the last 7 columns should either have:
      - (Using `StandardScaler`) means = 0 and standard deviations = 1; or
      - (Using `MinMaxScaler`) min = 0, max = 1
      
- Reserve **30%** of data for testing. Check whether the resulting arrays have the expected shapes.

In [None]:
# Standarize the features of the array 
X[:,16:] = StandardScaler().fit_transform(X[:,16:]) 

# Check that mean has to be 0 and the standard deviation 1
print(X[:,17].mean())
print(X[:,17].std())

In [None]:
#split data sets in 0.7 and 0.3
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

print(x_train.shape,
x_test.shape,
y_test.shape,
y_train.shape)


Shapes as expected (total 10500, 70% -> 7350, 30% -> 3150)

#### 3.2. Construct, train, and evaluate a neural network  (1 mark)

- Create an `MLPClassifier` which
    - has **1 hidden layer of 50 neurons** 
    - has **no regularization term**
    - trains for a maximum of **100 epochs** 
    - uses a batch size of **32**
- Fit the classifier using the standard `.fit()` member method.
- Plot the loss function value as a function of number of epochs (0.5 of mark).
  You can access the loss history through the `.loss_curve_` attribute of the `MLPClassifier` instance. 

- Using the testing dataset: 
    - Compute the overall accuracy for the classifier using the `MLPClassifier`'s `.score()` member method for both testing and training datasets.
    - Compute the confusion matrix (normalised in true labels), and plot it 
- Discuss the results

In [None]:
clf = MLPClassifier(hidden_layer_sizes=(50,), batch_size=32, max_iter=100, alpha=0).fit(x_train, y_train)
plt.xlabel("Number of trainning epochs")
plt.ylabel("Loss")
plt.plot(clf.loss_curve_)


ConfusionMatrixDisplay.from_estimator(clf, X, y, normalize='true', display_labels = ['clear', 'cloudy', 'precipitation'])




In [None]:
print("Trainning accuracy: ", clf.score(x_train, y_train)) #Accuracy trainning
print("Testing accuracy: ",clf.score(x_test, y_test)) #Accuracy test

## 4. Neural networks in `Keras` (2 marks)
---
This section covers exercises on constructing and training neural networks using the `Keras` library. `scikit-learn` is very easy to use, but libraries like `Keras` provide a lot more flexibility, which is why we will be using these extensively in the last two units of the _'Data science tools and machine learning'_ track.

In [None]:
# Relevant import(s) for this section
from tensorflow.python.keras.models import Model
from tensorflow.python.keras.layers import Input, Dense

#### 4.1. Construct a neural network in `Keras` (1 mark)

- Create a `keras.Model` using the **Keras functional API**. The network should have:
    - An input layer with the same number of nodes as the number of features in `X`.
    - A single, densely connected hidden layer with **50 nodes** equipped with **ReLU activation**.
    - A densely connected output layer with **3 nodes** (the number of types of weather we're classifying) equipped with **softmax activation**.
- Compile the model the using the **Adam optimiser**, add `'accuracy'` as metric, and use either:
    - `categorical_crossentropy` loss, if you have one-hot encoded the targets `y`, or
    - `sparse_categorical_crossentropy` loss if you hare using integer-valued targets.
- Use the `.summary()` member method to print an overview of the model you have created, explain the output.

In [None]:
# Define network 
input = Input(shape=(23,))
x = Dense(50, activation="relu")(input) #x name of the hidden layer
output = Dense(3,activation="softmax")(x)

model = Model(input, output)
model.compile("adam", loss="sparse_categorical_crossentropy", metrics = "accuracy" )
model.summary()


#### 4.2. Train a `Keras` neural network (1 mark)

- Use the `.fit()` member method to train the network on the **training dataset** for **100 epochs** with a **batch size of 32**. Use **20% of the data for validation** and make sure to have `Keras` **shuffle** the training data between epochs. Save the fit history by doing `history_mld = .....`
- Print the classification accuracy using the `.evaluate()` member method, for both the training and testing dataset. Comment on the results.
- Plot val_loss and loss functions from the fit history. On the same plot, plot the sklearn curve from the excercise above. Note the sklearn NN does not provide a complementary validation loss history, so only plot the training loss.
- Comment on the results of the overall accuracy compared to the scikit-learn method.

In [None]:
history_mdl = model.fit(x_train, y_train, epochs=100, batch_size=32, shuffle=True, validation_split = 0.2)


In [None]:
print(history_mdl.model.evaluate(x_train, y_train)) #Accuracy trainning (if you want to show only the accuracy use [1])
print(history_mdl.model.evaluate(x_test, y_test)) #Accuracy test

- Plot val_loss and loss functions from the fit history. On the same plot, plot the sklearn curve from the excercise above. Note the sklearn NN does not provide a complementary validation loss history, so only plot the training loss.

In [None]:
plt.plot(history_mdl.history['loss'], label = "Keras Loss")
plt.plot(history_mdl.history['val_loss'], label = "Keras Validation Loss")
plt.plot(clf.loss_curve_, label = "Sklearn loss")
plt.xlabel("Number of trainning epochs")
plt.legend()




## 5. Regularisation (1.5 marks)
---
This section covers **2** exercises on the impact of weight regularisaton. Note that $L_{1}$- and $L_{2}$-regularisation may also be applied to the activation of intermediate layers. Also, a similar regularising effect could be achieved using **dropout** regularisation, which you are encouraged to try out, but which we won't study in this CP exercise.

In [None]:
# Relevant import(s) for this section
from tensorflow.python.keras.regularizers import l1_l2

#### 5.1. Define `Keras` model factory method (0.5 mark)

- Define a python function called `big_model_fn` which takes the followng three arguments:
    - `l1`: A float specifying the $L_{1}$ regularisation factor (default value: 0)
    - `l2`: A float specifying the $L_{2}$ regularisation factor (default value: 0)
    - `name`: A string, specifying the name of the model (default value: None)
- Indside the function, you should:
    - Construct a `Keras` model using the functional API, which has:
        - An input layer with the same number of nodes as the number of features in `X`.
        - **Two** densely connected hidden layer with **100 nodes** each, both equipped with **ReLU activation**.
        - Both hidden layers should be subject to kernel regularisation (_i.e._ weight regularisation) with the regularisation factors specified as an input.
        - A densely connected output layer with **3 nodes** (the number of types of weather we're classifying) equipped with **softmax activation**.
        - A name given by the corresponding argument.
    - Compile the model in the same way as in **Exercise 14.**
- The function should return the compiled `Keras` model. 

The method will provide a convenient way of constructing and compiling a number of "big"/deep `Keras` models which differ only by their regularisation and name.

In [None]:
def big_model_fn(l1=0, l2=0, name=None):

    input = Input(shape=(23,))
    X_A = Dense(100, activation="relu", kernel_regularizer= "l1_l2")(input)
    X_B = Dense(100, activation = "relu",kernel_regularizer= "l1_l2")(X_A) 
    output = Dense(3,activation="softmax")(X_B)
    model = Model(input, output, name=name)
    

    model.compile("adam", loss="sparse_categorical_crossentropy", metrics = "accuracy" )




    return model

#### 5.2. Train "big" models with and without regularisation (1 mark)

- Construct three "big" model using the factory method:
     - One with default parameters
     - One with `l1=0.003` and  `name='model_L1'`
     - One with `l2=0.03`  and `name='model_L2'`
- Train each one as in **Exercise 15.**
- Compare first the loss history of the un-regularised "big" model to that of the small model from **Exercise 15** using the `plot.loss()` method.
- Then, compare the loss histories of all three "big" models with that of the small model.
- Plot the loss and val loss of all 4 models. Target these points:
    - Compare the performance of deep vs shallow models on the testing sets
    - Compare the level of ovetraining (training vs testing loss)
    - Note: Don't be alarmed if the shallow network performs slightly better that the deeper ones, this is dataset dependant.
- Copy the same plotting code, but this time plot the training and validation accuracy
- Discuss the results.

In [None]:
default = big_model_fn().fit(x_train, y_train, epochs=100, batch_size=32, shuffle=True, validation_split = 0.2)
model_L1 = big_model_fn(l1=0.003, name= "model_L1").fit(x_train, y_train, epochs=100, batch_size=32, shuffle=True, validation_split = 0.2)
model_L2 = big_model_fn(l2= 0.03,name= "model_L2").fit(x_train, y_train, epochs=100, batch_size=32, shuffle=True, validation_split = 0.2)

- Compare first the loss history of the un-regularised "big" model to that of the small model from **Exercise 15** using the `plot.loss()` method.

In [None]:
plt.plot(history_mdl.history['loss'], label = "Keras Loss")
plt.plot(default.history['loss'], label = "Default Loss")
plt.xlabel("Number of trainning epochs")
plt.legend()


- Then, compare the loss histories of all three "big" models with that of the small model.

In [None]:
plt.plot(history_mdl.history['loss'], label = "Keras Loss")
plt.plot(default.history['loss'], label = "Default Loss")
plt.plot(model_L1.history['loss'], label = "Model_L1 Loss")
plt.plot(model_L2.history['loss'], label = "Model_L2 Loss")
plt.plot(history_mdl.history['val_loss'], label = "Keras val_Loss")
plt.plot(default.history['val_loss'], label = "Default val_Loss")
plt.plot(model_L1.history['val_loss'], label = "Model_L1 val_Loss")
plt.plot(model_L2.history['val_loss'], label = "Model_L2 val_Loss")
plt.xlabel("Number of trainning epochs")
plt.legend()

In [None]:
plt.plot(history_mdl.history['accuracy'], label = "Keras trainning accuracy")
plt.plot(default.history['accuracy'], label = "Default trainning accuracy")
plt.plot(model_L1.history['accuracy'], label = "Model_L1 trainning accuracy")
plt.plot(model_L2.history['accuracy'], label = "Model_L2 trainning accuracy")
plt.plot(history_mdl.history['val_accuracy'], label = "Keras val_accuracy")
plt.plot(default.history['val_accuracy'], label = "Default val_accuracy")
plt.plot(model_L1.history['val_accuracy'], label = "Model_L1 val_accuracy")
plt.plot(model_L2.history['val_accuracy'], label = "Model_L2 val_accuracy")
plt.xlabel("Number of trainning epochs")
plt.legend()