**CSI 4106 Introduction to Artificial Intelligence** <br/>
*Assignment 3: Neural Networks*

# Identification

Name: Yu-Chen Lee<br/>
Student Number: 300240688

Name: Matsuru Hoshi<br/>
Student Number: 300228879



## 1. Exploratory Analysis

### Loading the dataset

A custom dataset has been created for this assignment. It has been made available on a public GitHub repository:

- [github.com/turcotte/csi4106-f24/tree/main/assignments-data/a3](https://github.com/turcotte/csi4106-f24/tree/main/assignments-data/a3)

Access and read the dataset directly from this GitHub repository in your Jupyter notebook.

You can use this code cell for you import statements and other initializations.

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
import tensorflow as tf
from keras import Sequential, layers, utils


In [4]:
url1 = 'https://raw.githubusercontent.com/turcotte/csi4106-f24/refs/heads/main/assignments-data/a3/cb513_test.csv'
url2 = 'https://raw.githubusercontent.com/turcotte/csi4106-f24/refs/heads/main/assignments-data/a3/cb513_train.csv'
url3 = 'https://raw.githubusercontent.com/turcotte/csi4106-f24/refs/heads/main/assignments-data/a3/cb513_valid.csv'

test_df = pd.read_csv(url1)
train_df = pd.read_csv(url2)
valid_df = pd.read_csv(url3)

### Data Pre-Processing

2. **Shuffling the Rows**:

    - Since examples are generated by sliding a window across each protein sequence, most adjacent examples originate from the same protein and share 20 positions. To mitigate the potential negative impact on model training, the initial step involves shuffling the **rows** of the data matrix.

    In the code, **frac=1** keeps all rows while shuffling and **reset_index(drop=true)** resets the index after shuffling.

In [5]:
test_shuffled_df = test_df.sample(frac=1, random_state=42).reset_index(drop=True)
test_shuffled_df

train_shuffled_df = train_df.sample(frac=1, random_state=42).reset_index(drop=True)
train_shuffled_df

valid_shuffled_df = valid_df.sample(frac=1, random_state=42).reset_index(drop=True)
valid_shuffled_df

Unnamed: 0,2,0.05,0,0.4,0.6,0.10,0.1,0.11,0.35,0.12,...,0.346,0.347,0.348,0.349,0.350,0.351,0.352,0.353,0.354,0.355
0,2,0.0556,0.0000,0.0556,0.0000,0.0000,0.0000,0.0000,0.0000,0.0556,...,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.9444,0.0000,0.0
1,1,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,0.0000,0.0000,0.0000,0.0556,0.1111,0.0000,0.2778,0.0000,0.0
2,0,0.1373,0.0000,0.0196,0.0686,0.0000,0.0196,0.0294,0.0196,0.1176,...,0.0000,0.0000,0.0000,0.0000,0.0000,0.3333,0.0000,0.0000,0.0098,0.0
3,0,0.0476,0.0000,0.0000,0.1429,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,0.0000,0.5238,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0
4,1,0.0000,0.0098,0.0000,0.0000,0.7549,0.0000,0.0000,0.0000,0.0000,...,0.2353,0.0392,0.0294,0.1667,0.1471,0.0294,0.0392,0.0000,0.0098,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7403,0,0.1667,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.3333,0.0000,...,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0
7404,0,0.5000,0.0000,0.0000,0.5000,0.0000,0.0000,0.0000,0.0000,0.0000,...,0.0000,0.0000,0.0000,0.1667,0.0000,0.0000,0.0000,0.0000,0.0000,0.0
7405,1,0.5000,0.0000,0.0000,0.0000,0.0000,0.2500,0.2500,0.0000,0.0000,...,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0000,0.0
7406,0,0.2843,0.0000,0.0098,0.0196,0.0000,0.0490,0.0784,0.0980,0.0098,...,0.0000,0.0196,0.0098,0.0392,0.0196,0.0000,0.0000,0.0000,0.0784,0.0


3. **Scaling of Numerical Features**:

    - Since all 462 features are proportions represented as values between 0 and 1, scaling may not be necessary. In our evaluations, using [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) actually degraded model performance. Within your pipeline, compare the effects of not scaling the data versus applying [MinMaxScaler](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.MinMaxScaler.html). In the interest of time, a single experiment will suffice. It is important to note that when scaling is applied, a uniform method should be used across all columns, given their homogeneous nature.

    To Separate the target column from the Dataframe, we use the **iloc** method. Features (X): extracts all the columns from the second onward. Target (y) extracts the first column whose index 0 as the target.

In [6]:
X = test_shuffled_df.iloc[:, 1:] # Features
y = test_shuffled_df.iloc[:, 0] # Target, first column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline_with_scale = Pipeline([
    ('scaler', MinMaxScaler()),
    ('classifier', RandomForestClassifier(random_state=42))
])

noscale_pipeline = Pipeline([
    ('classifier', RandomForestClassifier(random_state=42))
])

pipeline_with_scale.fit(X_train,y_train)
noscale_pipeline.fit(X_train, y_train)

accuracy_no_scaling = accuracy_score(y_test, noscale_pipeline.predict(X_test))
accuracy_with_scaling = accuracy_score(y_test, pipeline_with_scale.predict(X_test))

print(f"Accuracy without scaling: {accuracy_no_scaling}")
print(f"Accuracy with MinMax scaling: {accuracy_with_scaling}")

Accuracy without scaling: 0.67182246133154
Accuracy with MinMax scaling: 0.67182246133154


4. **Isolating the Target and the Data**:

    - In the CSV files, the target and data are combined. To prepare for our machine learning experiments, separate the training data $X$ and the target vector $y$ for each of the three datasets.

In [7]:
X1 = test_shuffled_df.iloc[:, 1:] # Features
y1 = test_shuffled_df.iloc[:, 0] # Target, first column
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=42)

X2 = train_shuffled_df.iloc[:, 1:] # Features
y2 = train_shuffled_df.iloc[:, 0] # Target, first column
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=42)

X3 = valid_shuffled_df.iloc[:, 1:] # Features
y3 = valid_shuffled_df.iloc[:, 0] # Target, first column
X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y3, test_size=0.2, random_state=42)

### Model Development & Evaluation

5. **Model Development**:

    - **Dummy Model**: Implement a model utilizing the [DummyClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.dummy.DummyClassifier.html). This model disregards the input data and predicts the majority class. Such model is sometimes called a straw man model.

    - **Baseline Model**: As a baseline model, select one of the previously studied machine learning algorithms: Decision Trees, K-Nearest Neighbors (KNN), or Logistic Regression. Use the default parameters provided by scikit-learn to train each model as a baseline. Why did you choose this particular classifier? Why do you think it should be appropriate for this specific task?

    - **Neural Network Model**: Utilizing [Keras](https://keras.io) and [TensorFlow](https://www.tensorflow.org), construct a sequential model comprising an input layer, a hidden layer, and an output layer. The input layer should consist of 462 nodes, reflecting the 462 attributes of each example. The hidden layer should include 8 nodes and employ the default activation function. The output layer should contain three nodes, corresponding to the three classes: helix (0), sheet (1), and coil (2). Apply the softmax activation function to the output layer to ensure that the outputs are treated as probabilities, with their sum equaling 1 for each training example.

    We therefore have three models: dummy, baseline, and neural network.

In [21]:
dummy_model = DummyClassifier(strategy='most_frequent')

baseline_model = DecisionTreeClassifier(random_state=42)

nn_model = Sequential([
    layers.Dense(462, input_dim = 462),
    layers.Dense(8, activation = 'relu'),
    layers.Dense(3, activation = 'softmax')
])  
nn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

6. **Model Evaluation**:

    - Employ cross-validation to assess the performance of the baseline model. Select a small number of folds to prevent excessive computational demands.

In [15]:
score = cross_val_score(baseline_model, X1_train, y1_train, cv= StratifiedKFold(n_splits=10))


**Training neural networks can be time-consuming.** Consequently, their performance is typically assessed once using a validation set. Make sure to not use the test set until the very end of the assignment.

In [19]:
y1_train_categorical = utils.to_categorical(y1_train)
y1_test_categorical = utils.to_categorical(y1_test)
nn_model.fit(X1_train, y1_train_categorical, epochs=10, batch_size=32, verbose=1)

Epoch 1/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 4ms/step - accuracy: 0.5177 - loss: 0.9693
Epoch 2/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7107 - loss: 0.7173
Epoch 3/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7425 - loss: 0.6528
Epoch 4/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7533 - loss: 0.6169
Epoch 5/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7675 - loss: 0.5877
Epoch 6/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7729 - loss: 0.5629
Epoch 7/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7789 - loss: 0.5495
Epoch 8/10
[1m186/186[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step - accuracy: 0.7792 - loss: 0.5295
Epoch 9/10
[1m186/186[0m [32m━━━━━━━━

<keras.src.callbacks.history.History at 0x279a32d50d0>

Assess the models using metrics such as precision, recall, and F1-score.

### Hyperparameter Optimization

7. **Baseline Model:**

    - To ensure a fair comparison for our baseline model, we will examine how varying hyperparameter values affect its performance. This prevents the erroneous conclusion that neural networks inherently perform better, when in fact, appropriate hyperparameter tuning could enhance the baseline model's performance.

    - Focus on the following relevant hyperparameters for each model:

        - [DecisionTreeClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.tree.DecisionTreeClassifier.html): `criterion` and `max_depth`.
  
        - [LogisticRegression](https://scikit-learn.org/1.5/modules/generated/sklearn.linear_model.LogisticRegression.html): `penalty`, `max_iter`, and `tol`.
  
        - [KNeighborsClassifier](https://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html): `n_neighbors` and `weights`.

    - Employ a grid search strategy or utilize scikit-learn's built-in methods [GridSearchCV](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html) to thoroughly evaluate all combinations of hyperparameter values. Cross-validation should be used to assess each combination.

    - Quantify the performance of each hyperparameter configuration using precision, recall, and F1-score as metrics.

    - Analyze the findings and offer insights into which hyperparameter configurations achieved optimal performance for each model.

In [None]:
# Code cell

8. **Neural Network:**

    In our exploration and tuning of neural networks, we focus on the following hyperparameters:

    - **Single hidden layer, varying the number of nodes**. 

        - Start with a single node in the hidden layer. Use a graph to depict the progression of loss and accuracy for both the training and validation sets, with the horizontal axis representing the number of training epochs and the vertical axis showing loss and accuracy. Training this network should be relatively fast, so let's conduct training for 50 epochs. Observing the graph, what do you conclude? Is the network underfitting or overfitting? Why?

        - Repeat the above process using 2 and 4 nodes in the hidden layer. Use the same type of graph to document your observations regarding loss and accuracy.

        - Start with 8 nodes in the hidden layer and progressively double the number of nodes until it surpasses the number of nodes in the input layer. This results in seven experiments and corresponding graphs for the following configurations: 8, 16, 32, 64, 128, 256, and 512 nodes. Document your observations throughout the process.
        
        - Ensure that the **number of training epochs** is adequate for **observing an increase in validation loss**. **Tip**: During model development, start with a small number of epochs, such as 5 or 10. Once the model appears to perform well, test with larger values, like 40 or 80 epochs, which proved reasonable in our tests. Based on your observations, consider conducting further experiments, if needed. How many epochs were ultimately necessary?

In [None]:
# Code cell

**Varying the number of layers**
Conduct similar experiments as described above, but this time vary the number of layers from 1 to 4. Document your findings.

How many nodes should each layer contain? Test at least two scenarios. Traditionally, a common strategy involved decreasing the number of nodes from the input layer to the output layer, often by halving, to create a pyramid-like structure. However, recent experience suggests that maintaining a constant number of nodes across all layers can perform equally well. Describe your observations. It is acceptable if both strategies yield similar performance results.

Select one your models that exemplifies overfitting. In our experiments, we easily constructed a model achieving nearly 100% accuracy on the training data, yet showing no similar improvement on the validation set. Present this neural network along with its accuracy and loss graphs. Explain the reasoning for concluding that the model is overfitting.

In [None]:
# Code cell

**Activation function**.

Present results for one of the configurations mentioned above by varying the activation function. Test at least `relu` (the default) and `sigmoid`. The choice of the specific model, including the number of layers and nodes, is at your discretion. Document your observations accordingly.

In [None]:
# Code cell

**Regularization** in neural networks is a technique used to prevent overfitting.

One technique involves adding a penalty to the loss function to discourage excessively complex models. Apply an `l2` penalty to some or all layers. Exercise caution, as overly aggressive penalties have been problematic in our experiments. Begin with the default `l2` value of 0.01, then reduce it to 0.001 and 1e-4. Select a specific model from the above experiments and present a case where you successfully reduced overfitting. Include a pair of graphs comparing results with and without regularization. Explain your rationale to conclude that overfitting has been reduced. Do not expect to completely eliminate overfitting. Again, this is a challenging dataset to work with.

In [None]:
# Code cell

Dropout layers are a regularization technique in neural networks where a random subset of neurons is temporarily removed during training. This helps prevent overfitting by promoting redundancy and improving the network's ability to generalize to new data. Select a specific model from the above experiments where you have muliple layers and experiment adding one or of few dropout layers into your network. Experiment with two different rates, say 0.25 and 0.5. Document your observations.

In [None]:
# Code cell

Summarize your experiments with using a graphical representation such as Figure 6.15 [on this page](https://egallic.fr/Enseignement/ML/ECB/book/deep-learning.html).

In [None]:
# Code cell

Early stopping is a regularization technique in neural network training wherein the process is halted when validation set performance starts to decline, thus preventing overfitting by avoiding the learning of noise in the training data. From all the experiments conducted thus far, choose **one** configuration (the number of layers, number of nodes, activation function, L2 penalty, and dropout layers) that yielded the best performance. Use a graph of loss and accuracy to determine the optimal number of training iterations for this network. What is the optimal number of epochs for this network configuration and why?

In [None]:
# Code cell

### Test

9. **Model Comparison**:

    - Evaluate the baseline model on the test set, using the optimal parameter set identified through grid search. Additionally, apply your best-performing neural network configuration to the test set.

    - Quantify the performance of the baseline model (best hyperparameter configuration) and your neural network (best configuration) using precision, recall, and F1-score as metrics. How do these two models compare to the dummy model?

    - Provide recommendations on which model(s) to choose for this task and justify your choices based on the analysis results.

In [None]:
# Code cell

# Resources