<a href="https://colab.research.google.com/github/shstreuber/Data-Mining/blob/master/Module7_NeuralNetworks2024.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **1. Introduction to Neural Networks**
Artificial Neural Networks are modeled on the human brain. Human brain cells, called neurons, form a complex, highly interconnected network and send electrical signals to each other to help humans process information. This means that human brain cells are designed for learning.

It also means that Neural Networks are true learning networks in that they **self-optimize** with a set of **learning functions** designed to reduce the error of each processing cycle. This is why Neural Networks work well with large datasets and unstructured data.
<center>
<img src="https://miro.medium.com/v2/resize:fit:828/format:webp/1*SJPacPhP4KDEB1AdhOFy_Q.png" height =400>
</center>

Want to know more about how human and computer neural networks are different and yet oh so similar? Look [here](https://towardsdatascience.com/the-differences-between-artificial-and-biological-neural-networks-a8b46db828b7) (this is the article from which the picture above stems).

At the end of this module, you will be able to:
* Describe how a simple Neural Network works
* Identify the summation and the activation functions
* Take appropriate preprocessing steps for Neural Networks
* Build a simple Neural Network

Let's get started!


## **What is a Neural Network?**
Neural networks are modeled after the human brain. As the name indicates, Neural Networks consist of neurons, where the data processing happens, and dendrites and axons that make up the pathways between neurons.

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/1200px-Neural_network_example.svg.png" width="200">
</div>

The data enter the Neural Network through the **INPUT LAYER** (green) and the classification results are found in the **OUTPUT LAYER** (purple). As the data makes its way through the network, its value is determined by weights. These weights are adjusted by an algorithm called a perceptron, whose goal is to minimize error values. That happens int he **HIDDEN LAYER** (teal).

Each neuron in the teal hidden layer contains essentially two functions:
1. The **SUMMATION FUNCTION**, which aggregates input data and passes its output on to the
2. The **ACTIVATION FUNCTION**, which applies a previously defined algorithm to the data it has received from the summary function.

Below is a close-up (from [an awesome article on Towards Data Science](https://towardsdatascience.com/inroduction-to-neural-networks-in-python-7e0b422e6c24) that you should read!) of what happens inside one of the teal dots above. The **SUMMATION FUNCTION** is blue and the **ACTIVATION FUNCTION** is purple. Note that the output is red:

<div>
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/neuralnetwork.png" width="600">
</div>

And now take a look at the instructor video explaining this in detail and demonstrating how this works mathematically:




In [7]:
from IPython.display import IFrame  # This is just for me so I can embed videos
IFrame(src="https://www.youtube.com/embed/NyAcHViPlCg", width=560, height=315)

#**0. Preparation and Setup**

As before, we are following the basic classification steps:

1. **Exploratory Data Analysis (EDA)** to see how the data is distributed and to determine what the class attribute in the dataset should be. This will be the attribute you'll predict later on
2. **Preprocess the data** (remove n/a, transform data types as needed, deal with missing data) --> here is where we will need to take a few additional steps to **configure our data for the Neural Network**
3. **Split the data** into a training set and a test set
4. **Build the model** based on the training set
5. **Test the model** on the test set
6. **Determine the quality of the model** with the help of a Confusion Matrix and a Classification Report.

As with our previous problems, we will use the insurance dataset again.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # for visualization
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import warnings
warnings.filterwarnings("ignore")
np.random.seed(42)

#Reading in the data as insurance dataframe
insurance = pd.read_csv("https://raw.githubusercontent.com/shstreuber/Data-Mining/master/data/insurance_with_categories.csv")

#Verifying that we can see the data
insurance.head()

# **1. Exploratory Data Analysis (EDA)**
As before, we have the option to either do this in a code cell just like above, or to import the HTML-based [ydata_profiling](https://docs.profiling.ydata.ai/latest/) package.

##Your Turn
Remember the [pandas_profiling package](https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/) that you encountered in the module on kNN and Naive Bayes? Install it below to complete this week's EDA section:

In [None]:
# Visualizing the data with a pairplot because why not?
# Let's investigate what kinds of relationships exist between the variables.

sns.pairplot(insurance,hue="region", height=3, diag_kind="kde")

We can see that the dataset is somewhat structured with linear relationships among age and charges. However, we don't see a clear distribution for region--and we remember that the results from our Random Forest analysis weren't too convincing at about 38% accuracy. Let's see if a Neural Network gives us a better understanding of how the numeric predictors in the dataset can help us determine the region attribute.

# **2. Preprocessing**
Let's get started!

##**2.1 Reducing the Data**
You have done this before.

##Your Turn
In the space below, build an insurance_nn dataset consisting of age, bmi, children, charges, and--again--region as the class attribute. You will need this in order to progress through this workbook.



##**2.2 Preparing the Data for use with a Neural Network**
In this section, you will see that preparing data to work with a Neural Network requires a bit more preprocessing than what you may be used to from previous algorithms.

### 2.2.1 Encoding
Our Neural Network code will require **numeric labels in the output layer**. It will not work with categorical variables. This is why we will need [LabelEncoder()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html) to transform the non-numerical labels in 'region' (as long as they are hashable and comparable) to numerical labels.

In [None]:
# Replace southwest with 0, southeast with 1, northwest with 2, and northeast with 3
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
insurance_nn["region"] = labelencoder.fit_transform(insurance_nn["region"])
region = pd.DataFrame({'region': ['southwest', 'southeast', 'northwest', 'northeast']})
insurance_nn.head()

**NICE!**

**UH, WAIT!:** LabelEncoder introduced **A NEW PROBLEM**: The numbers in 'region' look like ordinal relationships--but southeast(1) is not higher than southwest(0) and southeast(1) is not smaller than northwest(2).

That's why, in our case, LabelEncoder was the **WRONG SOLUTION**, and we need instead a **four-dimensional vector**.

With a four-dimensional vector, our Neural Network can then assign presence (=1) or absence (=0) to each of our four labels. Thus, southwest would be [1,0,0,0], southeast would be [0,1,0,0], northwest would be [0,0,1,0], and northeast [0,0,0,1]. If you google around, you'll find that that's what [Onehotencorder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder) does. That's why this encoding process is often also called OneHot Encoding. **BUT**
OneHotEncoder doesn't handle strings very well--and that's what we have in our 'region' attribute: Categorical data in words, i.e. strings.

###**A BETTER SOLUTION: get_dummies**

To preprocess our data, we will use the Pandas **[get_dummies function](https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html)**.

**What is a dummy?**

A dummy variable is a numeric variable that encodes categorical information. Dummy variables have two possible values: 0 or 1. In a dummy variable:

* 1 encodes the presence of a category
* 0 encodes the absence of a category

For our sex attribute, it would look like this:

<img src = "https://www.sharpsightlabs.com/wp-content/uploads/2022/04/pandas-get-dummies_simple-visual-example.png" height = 300>

**NOTE**: To learn more about get_dummies, check out [this blog post](https://www.sharpsightlabs.com/blog/pandas-get-dummies/), from which the graphic above stems.

Both, get_dummies and OneHotEncoder are often both referred to with the same term as OneHot Encoding (which really just means indicating presence or absence of an attribute value with a 1 or a 0, respectively). To learn more about the difference between the two, go [here](https://www.linkedin.com/pulse/both-onehotencoder-pdgetdummies-used-convert-data-what-berger-perkins/).

So, let's get started with get_dummies. Before we go any further, though, you will need the insurance_nn dataframe that you have built above.



In [None]:
def create_dummies(df,column_name):  # We build a function that identifies what to do with the dummies
    dummies = pd.get_dummies(df[column_name],prefix=column_name) # We build a variable for the dataframe (df) and the column (column_name)
    df = pd.concat([df,dummies],axis=1) # We concatenate the dataframe with the dummies we have built; these dummies are in columns (axis=1)
    return df
insurance_nn = create_dummies(insurance_nn,"region") # We apply the create_dummies function to the insurance_nn dataframe and the region column

insurance_nn.head()

And here we are! As you can see above, the region attribute--our class--is no longer just one column (although we left 'region' in the dataframe to prove the point--we will need to delete the column later); it is now 4 columns, indexed from 0 to 3. These are the four labels in the output layer. Now the Neural Network just needs to check which of the "region indicators" is switched on, and the data tuple can be dropped under that respective label.

But we are not yet done with preprocessing.

### 2.2.2 Standardizing the Data ##
Well, not only do Neural Networks not like string-type labels in the output layer; they also don't like non-standardized input attributes. That's because the Summation and Activation functions treat the values from each input attribute the same. Hence, these values need to fall into the same scale.

For this, we will scale our data with [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html). **Standardization with StandardScaler** is a common preprocessing step that involves transforming your data so that it has a mean of 0 and a standard deviation of 1. That's why it's also called **mean normalization**. The exact math is explained [here](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

**Benefits of Using StandardScaler (aka mean normalization)**
- Improves Performance: Many algorithms converge faster and perform better when features are standardized.
- Reduces Bias: Standardization ensures that no single feature dominates the algorithm due to its scale.
- Comparability: Makes features directly comparable, which is especially important for distance-based algorithms.

This helps us to speed up our algorithm (gradient descent) and have a more accurate classifier.

In [None]:
# Features before mean normalization
unscaled_features = insurance_nn[['age','bmi','children','charges']]

# Mean Normalization with StandardScaler to have a more reliable classifier
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

# Calculate mean and standard deviation (fit_) and apply the transformation (_transform)
unscaled_features_array = sc.fit_transform(unscaled_features.values)

# Assign the scaled data to a DataFrame & use the index and columns arguments to keep your original indices and column names:
scaled_features = pd.DataFrame(unscaled_features_array, index=unscaled_features.index, columns=unscaled_features.columns)

scaled_features.head()

**Do you notice something?**

Yes, the region attribute is gone because all our input attributes are numeric.

On to the next step.

#**3. Splitting the data into Training and Test Set**
We have already used the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function from Scikit-Learn, so these steps should look familiar. If you can't remember how it works, here is [a great explanation](https://www.geeksforgeeks.org/how-to-split-the-dataset-with-scikit-learns-train_test_split-function/) on the Geeks for Geeks site.

In [13]:
from sklearn.model_selection import train_test_split
X = scaled_features # those are the scaled input attributes from the StandardScaler
y = insurance_nn['region']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

#**4. Determine Neural Network Architecture, then Build and Train the Model**

When we build the Neural Network architecture, we have to configure the following settings:
* the number of nodes in each layer
* the number of layers (if we are using more than one hidden layer)
* the way in which nodes are connected (feedforward, no loops between the units).

The number of nodes in the INPUT and OUTPUT layers is determined by the dimensionality of the problem: We have 4 input units (age, bmi, children, charges) and 4 output units ('southwest', 'southeast', 'northwest', 'northeast'). We will experiment with the hidden layer.

To build the network, we will use the [Multilayer Perceptron classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html).


In [14]:
from sklearn.neural_network import MLPClassifier

mlp1 = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, activation='relu', solver='adam', random_state=42, verbose=1)

Above, we initialize our MLPClassifier like this:

```
MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, activation='relu', solver='adam', random_state=42, verbose=1)
```

Let's step through the configuration:

* It contains one hidden layer of 100 neurons

* It runs through (= iterates through) all the datapoints 500 times (the default is 200 times)

* It uses the [ReLU activation function](https://www.kaggle.com/code/dansbecker/rectified-linear-units-relu-in-deep-learning) , where all negative values are converted to 0 and all positive values are y = f(x)
<center>
<img src = "https://www.researchgate.net/profile/Anas-Issa/publication/370465617/figure/fig2/AS:11431281155101569@1683057835840/Activation-function-ReLu-ReLu-Rectified-Linear-Activation.png" height = 300>
</center>
* It uses the [Adam (=adaptive moment estimation) optimizer](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/) as its learning function, which is the most frequently used learning function for neural networks.

* When you set random_state to a specific integer (like 42 because, well, it is [the answer to “the Ultimate Question of Life, the Universe, and Everything"]
(https://news.mit.edu/2019/answer-life-universe-and-everything-sum-three-cubes-mathematics-0910)), you ensure that every time you run your code, you get the same result.
<center>
<img src="https://ih1.redbubble.net/image.1970558774.0297/flat,750x,075,f-pad,750x1000,f8f8f8.jpg" height = 300>
</center>
* Verbose=1 means that we can observe the output as the model runs. We can turn this off with verbose=0

Now that we have built the model, we will train the model on our training set.

Since this will run in verbose mode, you will be able to see the self-optimization as it happens. Watch for the loss function (i.e. the error rate, which Adam calculates for us) becoming smaller with each iteration. This means that our model is learning from its errors and improving!

In [None]:
# As this runs in verbose mode, note how the loss function steps down with every single iteration!

mlp1.fit(X_train, y_train)

## **How Can We Tune the Multilayer Perceptron?**

There are many ways to adjust the parameters of (=tune) the Multilayer Perceptron function in order to make it work more smoothly for us. The most salient parameters are:

* hidden_layer_sizes tuple, length = n_layers - 2, default=(100,)
The ith element represents the number of neurons in the ith hidden layer.

* activation{‘identity’, ‘logistic’, ‘tanh’, ‘relu’}, default=’relu’
Activation function for the hidden layer.

  * ‘identity’, no-op activation, useful to implement linear bottleneck, returns f(x) = x

  * ‘logistic’, the logistic sigmoid function, returns f(x) = 1 / (1 + exp(-x)).

  * ‘tanh’, the hyperbolic tan function, returns f(x) = tanh(x).

  * ‘relu’, the rectified linear unit function, returns f(x) = max(0, x)

* solver{‘lbfgs’, ‘sgd’, ‘adam’}, default=’adam’
The solver for weight optimization. We will talk about this more when we talk about Gradient Descent

  * ‘sgd’ refers to stochastic gradient descent.

  * ‘adam’ refers to a stochastic gradient-based optimizer

* batch_size int, default=’auto’
Size of minibatches for stochastic optimizers. When set to “auto”, batch_size=min(200, n_samples)

* max_iter int, default=200
Maximum number of iterations. For stochastic solvers (‘sgd’, ‘adam’), note that this determines the number of epochs (how many times each data point will be used).

* verbose bool, default=False
Whether to print progress messages to stdout.

* validation_fraction float, default=0.1
The proportion of training data to set aside as validation set for early stopping. Must be between 0 and 1. Only used if early_stopping is True

As you can see, tuning can be really useful.

#**5. Test the Model**
The moment of truth! So far, we have seen that the error rate decreases, whihc means that our model learns something. But how much does the model improve with each iteration? That's what the **accuracy score** measures.


In [None]:
# Score takes a feature matrix X_test and the expected target values y_test.
# Predictions for X_test are compared with y_test

print (mlp1.score(X_test,y_test))

You are seeing the output above as a decimal number. Translate it into percent, and you'll see how many percent of the original data has been categorized correctly. Now let's see what the predictions look like:

In [None]:
# Now, let's see what the predictions look like
y_pred = mlp1.predict(X_test)
print("Test set predictions: \n {}".format(y_pred))

<img src="https://media.istockphoto.com/id/874505226/vector/contempt-emoticon.jpg?s=612x612&w=0&k=20&c=9ELAryuCY6EPIBlBiMwYf5lz-7QZ_o6WOQPqi5pbxks=" height = 200>
Hm. Does this look good to you?

#**6. Evaluate the Quality**
You know the drill! First, we set up the Confusion Matrix, then the Classification Report. You did this for the modules on k Nearest Neighbor, Naive Bayes, and Random Forest, so you're already a pro at this!

##Your Turn
Build the Confusion Matrix and the Classification Report below:

In [17]:
# Build the Confusion Matrix here


In [18]:
# Set up the Classification Report here


##Your Turn
What do you think of our model, given this Confusion Matrix and the Classification Report? Is it better or worse than our Random Forest? How does it compare with our k Nearest Neighbor or our Naive Bayes output? Record your answer below.

# **7. Using Gradient Descent**

Gradient Descent is a way to optimize how quickly the weights are adjusted in the Neural Network, i.e. how often the Feed Forward-Backpropagation loop has to run to optimize the error aka loss function. To learn more about Gradient Descent, read through [this explanation](https://www.kdnuggets.com/2017/04/simple-understand-gradient-descent-algorithm.html) and watch this awesome video below:



In [23]:
from IPython.display import IFrame  # This is just for me so I can embed videos
IFrame(src="https://www.youtube.com/embed/sDv4f4s2SB8", width=560, height=315)

Now that you know all about gradient descent, we can re-run the MLP Classifier with the Stochastic Gradient Descent optimizer configured. We will specifically pay attention to the parameters shown.

* Stochastic Gradient Descent (SGD) optimizer/solver: updates weights values that minimize the loss function in batches. As the name says, SGD uses the gradient of the loss function. Use parameter solver = sgd.

* learning_rate_init=0.01: This parameter controls the step-size in updating the weights, and here is constant. Its value shouldn’t be too large( fail to converge) neither too small (too slow).

* max_iter=500 : maximum number of epochs (=how many times each data point will be used until solver convergence).

In [25]:
mlp2 = MLPClassifier(hidden_layer_sizes=10,solver='sgd',learning_rate_init= 0.01, max_iter=500, verbose = 1)

## YOUR TURN

1. Fit the test data to mlp2, then produce a confusion matrix and a correlation report, and see if changing the parameters has made any difference.
2. Build an mlp3 model in which you change the solver and the learning rate  however you want (leave verbose on, though). This will change the Gradient Descent. What difference has this made?

#**7. THE BIG QUESTION: How Do You Know How To Architect the Network?**

Deciding the number of layers and the number of nodes per layer in a neural network involves a combination of domain knowledge, experimentation, and understanding of the problem at hand. We have seen that configuring the INPUT and OUTPUT layers is pretty intuitive. However, determining the number of HIDDEN NODES is not as easy. Here are some guidelines and strategies to help make these decisions:

**1. Understanding the Problem and Data**
 * Complexity of the Problem: Complex problems like image recognition, natural language processing, or playing games typically require deeper networks (more layers) because they need to capture more intricate patterns and hierarchical representations.
 * Size of the Dataset: Larger datasets can support more complex models. If you have a small dataset, a simpler model (fewer layers and nodes) is often more appropriate to avoid overfitting.

**2. Guidelines for the Number of Layers**
 * Input and Output: Start with the input layer (which has as many nodes as there are input features) and the output layer (which has as many nodes as there are output classes or a single node for regression). See above.
 * Single Hidden Layer: For many simple problems, a single hidden layer can be sufficient. The Universal Approximation Theorem states that a single hidden layer with enough nodes can approximate any continuous function, but practical performance may vary.
 * Deep Networks: For more complex tasks, deep networks with multiple hidden layers can be more effective. Common deep learning architectures (like CNNs and RNNs) have predefined structures with many layers.

**3. Guidelines for the Number of Nodes**
 * Input Nodes: The input layer should have one node for each feature in your dataset.
 * Output Nodes: The output layer should have one node per class (for classification problems) or a single node (for regression problems).
 * Hidden Nodes: There is no definitive rule, but a common heuristic is to start with a number of nodes between the size of the input layer and the output layer. Some approaches include:
  * Increasing Size: Start with fewer nodes and increase the number incrementally.
  * Decreasing Size: Use more nodes in the initial layers and decrease the number as you go deeper (e.g., a funnel-like structure).
  * Experimentation: Begin with a small network and increase the number of nodes and layers incrementally, observing the impact on performance.

**4. Practical Steps for Designing Neural Networks**
 * Start Simple: Begin with a simple architecture. For example, a single hidden layer with a number of nodes roughly between the input and output layer sizes.
 * Experiment Incrementally: Gradually increase the number of layers and nodes. Observe how the model's performance on validation data changes.
 * Early Stopping and Regularization: Implement early stopping (i.e. fewer epochs) to prevent overfitting. Use regularization techniques like dropout, L2 regularization, and batch normalization, which we will discuss in the next module.
 * Review Architectures of Similar Problems: Look at existing research or common architectures for similar problems to get a sense of what works well.

**5. Tools and Resources**
 * Frameworks: Use deep learning frameworks like TensorFlow, Keras, and PyTorch, which provide high-level APIs to experiment with different architectures easily.
 * Community: Engage with communities on platforms like Stack Overflow, GitHub, and specialized forums to get feedback and insights on your architecture.

**WAIT!**
Look at point 2 above! Number of layers? Does this mean that, instead of the single hidden layer (in teal below),
<center>
<img src="https://raw.githubusercontent.com/shstreuber/Data-Mining/master/images/1200px-Neural_network_example.svg.png" width="200">
</center>
we can actually build more layers?

Yes, yes we can. That brings us to Deep Learning, the topic of our next module.

#8. If You Are Stuck
Here are the solutions to the Your Turn problems.

**NOTE** that I have included them in text format only so you can't simply run them; you actually have to type them into the spaces above.



```
# 2.1 Building the insurance_nn dataframe

insurance_nn = pd.DataFrame(insurance, columns = ['age', 'bmi', 'children','charges','region'])
insurance_nn.head()
```





```
# 6. Confusion Matrix and Classification Report

# Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred, labels=mlp1.classes_)
cm_display = ConfusionMatrixDisplay(cm, display_labels=mlp1.classes_).plot()

# Classification Report

from sklearn import metrics
from sklearn.metrics import classification_report
print(metrics.classification_report(y_test, y_pred))

```





```
# 7. Gradient Descent Exercises

mlp2 = MLPClassifier(hidden_layer_sizes=10,solver='sgd',learning_rate_init= 0.01, max_iter=500, verbose = 1)
mlp2.fit(X_train, y_train)
cm_mlp2 = confusion_matrix(y_test, y_pred, labels=mlp2.classes_)
cm_display_mlp2 = ConfusionMatrixDisplay(cm_mlp2, display_labels=mlp2.classes_).plot()


mlp3 = MLPClassifier(hidden_layer_sizes=a_new_number,solver=pick_another_solver,learning_rate_init= 0.01, max_iter=change_the_number_of_iterations, verbose = 1)
mlp3.fit(X_train, y_train)
cm_mlp3 = confusion_matrix(y_test, y_pred, labels=mlp3.classes_)
cm_display_mlp3 = ConfusionMatrixDisplay(cm_mlp3, display_labels=mlp3.classes_).plot()


```