### Neural Networks and Deep Learning

In this chapter, we will be discussing concepts behind neural networks and deep learning.
These work as building blocks of modern NLP techniques (Transformers and Large Language Models (LLMs)).

We start with logistics regression and build upto neural networks. Along the way, we will be familiar with key terms such as layers, nodes, weights, activation functions, and more.

Modern NLP:

- works with large data sets
- Transformers-based LLMs (BERT, GPT, LLaMA, T5, BART)
- Applications (traditional NLP applications, text summarization, text generation)

**Neural Networks**:

A neural network is a machine learning model designed to process information in a way that is inspired by neurons in the human brain.
Similar to how biological neurons communicate by passing information, neural networks process data through layers of nodes (neurons).

**Logistics Regression**:

Logistics regression is a classification technique, and NN can work as one too.
The output of a logistic regression model is in the range from 0 to 1 (probability).
Example: Profitability of a lemonade stand based on temperature.

The higher the temperature, the higher the chance of being profitable. ($p = \sigma(mx+b)$), $p$ - profit and $x$ - temperature. How do we find $\sigma$, $m$, and $b$?

In order to find the values, first fit a linear model based on the temperature ($y=mx+b$). Then, use the $y$ values as an input to obtain the probability (as 'S curve') ($p=\sigma(y)$) ($\sigma$ - The sigmoid function transforms the y values so that they fall between 0 and 1):

$p = \frac{1}{1+e^{-(mx+b)}}$.

So, first we apply a linear transformation and then a non-linear transformation. These steps are important as they are used in neural networks.

**Logistic Regression and Neural Networks**:

The slope ($m$) is called a weight ($w$), and the intercept ($b$) is called a bias. Together, they are called model parameters.

The sigmoid function is a type of non-linear transformation, or in NN, an **activation function**.

There could be other features (type of the day -  weekday/weekend) affecting lemonade profitability. Then, temperature and day type form the input (input layer) in a NN.

Then in the middle layer (hidden layer), we apply a non-linear transformation (activation function): $h=\sigma(w_{1}x_{1}+w_{2}x_{2}+b)$.

However, coming up with these features is hard. Thus, we introduce more nodes (neurons) in the hidden layer, and then the above model gets closer to a true neural network model ($h_{2} = \sigma(w_{3}x_{1}+w_{4}x_{2}+b_{2}$).

Now $h$ and $h_{2}$ becomes new featues. These are features discovered by the model. Typically, these are less interpretable.

Then, we will combine them ($h$ and $h_{2}$) to calculate the final probability for a classification problem like this. $p = \sigma(w_{5}h + w_{6} h_{2} + b_{3})$.

The data scientist will decide how many hidden layers, how many neurons, and what activation functions should be used.

Neural networks are a type of supervised learning technique, and historical data and labels are used as inputs to obtain predictions.

The parameters are determined by minimizing a loss function.

To create a neural network in Python, we can use **MLPClassifier** (for classification problems) or **MLPRegressor** (for regression problems) within the sklearn neural networks module.

MLP - Multilayer Perceptron (another name for a neural network)

Syntax: MLPClassifier(hidden_layer_size=(100,), activation='relu')

Here we are creating a 1 hidden layer with 100 nodes. (50,30) - 2 hidden layers with 50 and 30 nodes.

Theoraticaly we can set differentactivation functions for each node. But this is a simple neural network. Thus, for all the nodes, the activation function is "relu". To change the activation function for each node, we can use Pytorch/Tensorflow.

Activation function in the last layer depends on whether we try to solve a classification problem or a regression problem, and what kind of output we expect.

Syntax: MLPClassifier(hidden_layer_size=(100,), 
                        activation='relu',
                        max_iter=200,
                        random_state=42)

Note: each time we run the model, it starts with random numbers and updates weights iteratively. Thus, we can fix the number of iterations and random state.

We start the Python code with the text classification example (discussed in the previous chapter)

In [1]:
### import libraries

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

In [2]:
reviews = pd.read_excel('Chapter3_Popchip_Reviews.xlsx')
reviews.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan ga...
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,I like the puffed nature of this chip that mak...
2,23691,A30NYUHEDLWI0Y,5,Low,Great Alternative to Potato Chips,I just love these chips! I was always a big f...


In [3]:
# import the text preprocessing steps we created in the previous chapter
import Chapter3_maven_text_preprocessing

In [4]:
### Apply the text preprocessing to "Text" in reviews

reviews['Text_Clean'] = Chapter3_maven_text_preprocessing.clean_and_normalize(reviews['Text'])
reviews.head(3)

Unnamed: 0,Id,UserId,Rating,Priority,Title,Text,Text_Clean
0,23689,A21SYVGVNG8RAS,5,Low,Yummy snacks!,Popchips are the bomb!! I use the parmesan ga...,popchip bomb use parmesan garlic scoop cotta...
1,23690,AQJYXC0MPRQJL,5,Low,Great chip that is different from the rest,I like the puffed nature of this chip that mak...,like puff nature chip make unique chip market ...
2,23691,A30NYUHEDLWI0Y,5,Low,Great Alternative to Potato Chips,I just love these chips! I was always a big f...,love chip big fan potato chip not discover p...


In [5]:
### create a count vectorizer matrix

cv = CountVectorizer(stop_words='english', ngram_range=(1,2), min_df=0.20)
X = cv.fit_transform(reviews['Text_Clean'])

In [6]:
### view the features/inputs X

X_df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())
X_df

Unnamed: 0,bag,buy,calorie,chip,eat,flavor,good,great,like,love,popchip,potato,potato chip,salt,snack,taste,try
0,0,0,0,1,1,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,4,0,3,0,0,1,1,0,0,0,2,0,0,1
2,0,0,0,3,0,0,0,1,0,2,1,1,1,1,0,0,0
3,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
4,1,0,0,2,1,2,0,1,2,0,0,1,1,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
559,0,0,0,3,3,1,1,5,0,1,1,4,3,0,0,1,0
560,1,0,0,0,0,1,0,1,0,0,1,0,0,0,1,0,1
561,0,0,0,2,0,1,0,2,0,0,0,0,0,0,0,2,0
562,0,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0


In [7]:
### view the target/output y

y = reviews['Priority']
y.head()

0     Low
1     Low
2     Low
3    High
4     Low
Name: Priority, dtype: object

In [8]:
### Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

##### Naive Bayes

In [9]:
from sklearn.naive_bayes import MultinomialNB

In [10]:
### Initialize the Naive Bayes classifier

nb = MultinomialNB()

In [11]:
### Train the model

nb.fit(X_train, y_train)

0,1,2
,"alpha  alpha: float or array-like of shape (n_features,), default=1.0 Additive (Laplace/Lidstone) smoothing parameter (set alpha=0 and force_alpha=True, for no smoothing).",1.0
,"force_alpha  force_alpha: bool, default=True If False and alpha is less than 1e-10, it will set alpha to 1e-10. If True, alpha will remain unchanged. This may cause numerical errors if alpha is too close to 0. .. versionadded:: 1.2 .. versionchanged:: 1.4  The default value of `force_alpha` changed to `True`.",True
,"fit_prior  fit_prior: bool, default=True Whether to learn class prior probabilities or not. If false, a uniform prior will be used.",True
,"class_prior  class_prior: array-like of shape (n_classes,), default=None Prior probabilities of the classes. If specified, the priors are not adjusted according to the data.",


In [12]:
### Make predictions

y_pred = nb.predict(X_test)
y_pred

array(['Low', 'Low', 'Low', 'Low', 'High', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'High', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'High', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'High', 'Low', 'High', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low'], dtype='<U4')

In [13]:
### Evaluate the model

print(classification_report(y_test, y_pred))
print(f'Accuracy: {round(accuracy_score(y_test, y_pred), 2)}')

              precision    recall  f1-score   support

        High       0.60      0.16      0.25        19
         Low       0.85      0.98      0.91        94

    accuracy                           0.84       113
   macro avg       0.73      0.57      0.58       113
weighted avg       0.81      0.84      0.80       113

Accuracy: 0.84


#### Neural Network

In [14]:
### Import the library

from sklearn.neural_network import MLPClassifier

In [15]:
### Initialize the MLPClassifer classifier

nn = MLPClassifier(hidden_layer_sizes=(100,),
                  activation='relu',
                  max_iter=2000,
                  random_state=42)

In [16]:
### Train the model

nn.fit(X_train, y_train)

0,1,2
,"hidden_layer_sizes  hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,) The ith element represents the number of neurons in the ith hidden layer.","(100,)"
,"activation  activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu' Activation function for the hidden layer. - 'identity', no-op activation, useful to implement linear bottleneck,  returns f(x) = x - 'logistic', the logistic sigmoid function,  returns f(x) = 1 / (1 + exp(-x)). - 'tanh', the hyperbolic tan function,  returns f(x) = tanh(x). - 'relu', the rectified linear unit function,  returns f(x) = max(0, x)",'relu'
,"solver  solver: {'lbfgs', 'sgd', 'adam'}, default='adam' The solver for weight optimization. - 'lbfgs' is an optimizer in the family of quasi-Newton methods. - 'sgd' refers to stochastic gradient descent. - 'adam' refers to a stochastic gradient-based optimizer proposed  by Kingma, Diederik, and Jimmy Ba For a comparison between Adam optimizer and SGD, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`. Note: The default solver 'adam' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, 'lbfgs' can converge faster and perform better.",'adam'
,"alpha  alpha: float, default=0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss. For an example usage and visualization of varying regularization, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_alpha.py`.",0.0001
,"batch_size  batch_size: int, default='auto' Size of minibatches for stochastic optimizers. If the solver is 'lbfgs', the classifier will not use minibatch. When set to ""auto"", `batch_size=min(200, n_samples)`.",'auto'
,"learning_rate  learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant' Learning rate schedule for weight updates. - 'constant' is a constant learning rate given by  'learning_rate_init'. - 'invscaling' gradually decreases the learning rate at each  time step 't' using an inverse scaling exponent of 'power_t'.  effective_learning_rate = learning_rate_init / pow(t, power_t) - 'adaptive' keeps the learning rate constant to  'learning_rate_init' as long as training loss keeps decreasing.  Each time two consecutive epochs fail to decrease training loss by at  least tol, or fail to increase validation score by at least tol if  'early_stopping' is on, the current learning rate is divided by 5. Only used when ``solver='sgd'``.",'constant'
,"learning_rate_init  learning_rate_init: float, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver='sgd' or 'adam'.",0.001
,"power_t  power_t: float, default=0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to 'invscaling'. Only used when solver='sgd'.",0.5
,"max_iter  max_iter: int, default=200 Maximum number of iterations. The solver iterates until convergence (determined by 'tol') or this number of iterations. For stochastic solvers ('sgd', 'adam'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.",2000
,"shuffle  shuffle: bool, default=True Whether to shuffle samples in each iteration. Only used when solver='sgd' or 'adam'.",True


In [17]:
### Make predictions

y_pred_nn = nn.predict(X_test)
y_pred_nn

array(['Low', 'Low', 'Low', 'High', 'High', 'High', 'Low', 'High', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'High', 'High',
       'Low', 'Low', 'High', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'High', 'Low', 'High', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'High', 'Low', 'Low',
       'Low', 'Low', 'Low', 'High', 'Low', 'Low', 'Low', 'Low', 'Low',
       'High', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low', 'Low',
       'Low', 'High', 'High', 'Low', 'Low', 'Low', 'High', 'Low', 'Low',
       'Low', 'Low', 'High', 'Low', 'Low', 'Low', 'Low', 'Low', 'High',
       'High', 'Low', 'High', 'Low', 'Low'], dtype='<U4')

In [18]:
### Evaluate the model

print(classification_report(y_test, y_pred_nn))
print(f'Accuracy: {round(accuracy_score(y_test, y_pred_nn), 2)}')

              precision    recall  f1-score   support

        High       0.47      0.47      0.47        19
         Low       0.89      0.89      0.89        94

    accuracy                           0.82       113
   macro avg       0.68      0.68      0.68       113
weighted avg       0.82      0.82      0.82       113

Accuracy: 0.82


In this case, accuracy of neural network is lower. Also, model is much more complex and may have led to overfitting.

#### Neural Network Matrices

In the code below, we are investigating the matrix representation of weights and biases.

In [19]:
### Import libraries

import numpy as np
import pandas as pd
from sklearn.neural_network import MLPClassifier

In [20]:
### Import Data

lemonade_data = pd.read_csv('Chapter4_ lemonade_data.csv')
lemonade_data.head()

Unnamed: 0,temperature,weekend,lemonade_price,blocks_from_park,profitable
0,90,1,1.0,1,1
1,85,1,1.25,2,1
2,78,0,1.5,3,0
3,95,1,0.75,1,1
4,82,0,1.25,4,0


In [31]:
### Seperate inputs and output

X = lemonade_data[['temperature','weekend','lemonade_price','blocks_from_park']]
y = lemonade_data['profitable']

In [33]:
### Fit a neural network

mlp = MLPClassifier(hidden_layer_sizes=(2,),
                   activation='logistic',
                   max_iter=2000,
                   random_state=42)

mlp.fit(X, y) ### Model is trained.

0,1,2
,"hidden_layer_sizes  hidden_layer_sizes: array-like of shape(n_layers - 2,), default=(100,) The ith element represents the number of neurons in the ith hidden layer.","(2,)"
,"activation  activation: {'identity', 'logistic', 'tanh', 'relu'}, default='relu' Activation function for the hidden layer. - 'identity', no-op activation, useful to implement linear bottleneck,  returns f(x) = x - 'logistic', the logistic sigmoid function,  returns f(x) = 1 / (1 + exp(-x)). - 'tanh', the hyperbolic tan function,  returns f(x) = tanh(x). - 'relu', the rectified linear unit function,  returns f(x) = max(0, x)",'logistic'
,"solver  solver: {'lbfgs', 'sgd', 'adam'}, default='adam' The solver for weight optimization. - 'lbfgs' is an optimizer in the family of quasi-Newton methods. - 'sgd' refers to stochastic gradient descent. - 'adam' refers to a stochastic gradient-based optimizer proposed  by Kingma, Diederik, and Jimmy Ba For a comparison between Adam optimizer and SGD, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_training_curves.py`. Note: The default solver 'adam' works pretty well on relatively large datasets (with thousands of training samples or more) in terms of both training time and validation score. For small datasets, however, 'lbfgs' can converge faster and perform better.",'adam'
,"alpha  alpha: float, default=0.0001 Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss. For an example usage and visualization of varying regularization, see :ref:`sphx_glr_auto_examples_neural_networks_plot_mlp_alpha.py`.",0.0001
,"batch_size  batch_size: int, default='auto' Size of minibatches for stochastic optimizers. If the solver is 'lbfgs', the classifier will not use minibatch. When set to ""auto"", `batch_size=min(200, n_samples)`.",'auto'
,"learning_rate  learning_rate: {'constant', 'invscaling', 'adaptive'}, default='constant' Learning rate schedule for weight updates. - 'constant' is a constant learning rate given by  'learning_rate_init'. - 'invscaling' gradually decreases the learning rate at each  time step 't' using an inverse scaling exponent of 'power_t'.  effective_learning_rate = learning_rate_init / pow(t, power_t) - 'adaptive' keeps the learning rate constant to  'learning_rate_init' as long as training loss keeps decreasing.  Each time two consecutive epochs fail to decrease training loss by at  least tol, or fail to increase validation score by at least tol if  'early_stopping' is on, the current learning rate is divided by 5. Only used when ``solver='sgd'``.",'constant'
,"learning_rate_init  learning_rate_init: float, default=0.001 The initial learning rate used. It controls the step-size in updating the weights. Only used when solver='sgd' or 'adam'.",0.001
,"power_t  power_t: float, default=0.5 The exponent for inverse scaling learning rate. It is used in updating effective learning rate when the learning_rate is set to 'invscaling'. Only used when solver='sgd'.",0.5
,"max_iter  max_iter: int, default=200 Maximum number of iterations. The solver iterates until convergence (determined by 'tol') or this number of iterations. For stochastic solvers ('sgd', 'adam'), note that this determines the number of epochs (how many times each data point will be used), not the number of gradient steps.",2000
,"shuffle  shuffle: bool, default=True Whether to shuffle samples in each iteration. Only used when solver='sgd' or 'adam'.",True


In [34]:
### Access the weights (coeficients) and biases (intercepts)

weights = mlp.coefs_
biases = mlp.intercepts_

In [36]:
### Print out the weight matrices and bias vectors for each layer

for i, (w, b) in enumerate(zip(weights, biases)):
    print(f"Connections {i+1}:")
    print("  Weights:")
    print(w)  # weight matrix
    print("  Biases:")
    print(b)  # bias vector
    print()

Connections 1:
  Weights:
[[-0.06429932  0.07118473]
 [-1.39850612  1.50303929]
 [ 1.20491128 -1.55743248]
 [ 1.08363948 -1.34251138]]
  Biases:
[ 1.7367575  -1.25815555]

Connections 2:
  Weights:
[[-3.11142992]
 [ 1.82913475]]
  Biases:
[0.85036271]



Connection 1 represents connections in the hidden layer (Note here we have only one hidden layer with two nodes), and connection 2 represents connections in the output layer.

Column 1 in weights is the coefficients associated with the 1st node in the hidden layer (we have 4 inputs). Column 2 in weights for the 2nd node in the hidden layer. Similarly, for the output node, only 2 inputs (from two nodes in the hidden layer).

Note: The weights and biases of a neural network are contained in "weight matrices" and "bias vectors".
(This will help when working with Transformers and LLMs.)

Notation for weights: $w_{11}^{(1)}$ - from input node 1 to node 1 in the hidden layer, (1) - weight matrix (that is, weights for $h_{1}$)