<a href="https://colab.research.google.com/github/shstreuber/AI/blob/main/Week7_RNNs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**The Problem with Feed-Forward Neural Networks**
So far, you have used Neural Networks to analyze data and perform classifications based on their analysis--classifications of numbers, images, and the like.

Remember the code below from [one of the very first neural networks you encountered in this class](https://github.com/shstreuber/AI/blob/main/Week1_EasyEquationNN5.ipynb)?

**TO DO**: Execute this code 5 times in a row:

In [None]:
import tensorflow as tf
import numpy as np
from tensorflow import keras

# Sample Data
xs = np.array([-1.0,  0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-1.0,  1.0, 3.0, 5.0, 7.0, 9.0], dtype=float)

# The Basic Model
model = tf.keras.Sequential([tf.keras.Input([1]), keras.layers.Dense(units=1)])
model.compile(optimizer='sgd', loss='mean_squared_error')

# Train the Model with 10 epochs
model.fit(xs, ys, epochs=10)

Compare the outputs of all 5 runs. Does the learning improve from run 1 to run 5? How can we improve the the model? Record your thoughts below:

#**The Solution: Recurrent Neural Networks**
A recurrent neural network (RNN) is the type of artificial neural network that is used in Apple’s Siri and Google’s voice search. At a high level, a recurrent neural network (RNN) processes sequences — whether daily stock prices, sentences, or sensor measurements — one element at a time while retaining a **memory** (called a state) of what has come previously in the sequence.

<img src="https://github.com/shstreuber/Data-Mining/blob/master/images/Structure-of-simple-recurrent-neural-network-RNN-and-unfolded-RNN.png?raw=true">

**How It Works**:
1. **Inputs & Hidden State**: At each step, the RNN receives an input *x* and combines it with information from the previous step (stored in a hidden state *h*).
2. **Recurrence**: This hidden state is passed to the next time step, allowing the network to retain context from earlier inputs.
3. **Output**: The network processes all steps and produces an output, such as predicting the next word in a sentence or forecasting stock prices.

##**Limitations**
Simple RNN models usually run into two major issues. These issues are related to gradient, which is the slope of the loss function along with the error function.

1. [Vanishing Gradient problem](https://aiml.com/what-do-you-mean-by-vanishing-and-exploding-gradient-problem-and-how-are-they-typically-addressed/) occurs when the gradient becomes so small that updating parameters becomes insignificant; eventually the algorithm stops learning:

<img src ="https://camo.githubusercontent.com/50601245fe1be8ab8a254056e6f7cd26ce2eb4c29dfe4adcc10363def93ae380/68747470733a2f2f7777772e6b646e7567676574732e636f6d2f77702d636f6e74656e742f75706c6f6164732f76616e697368696e672d6772616469656e742d70726f626c656d2d31322e706e67" height =200>

Imagine it like playing [silent post](https://actiwity.com/details/silentpost).

2. [Exploding Gradient problem](https://aiml.com/what-do-you-mean-by-vanishing-and-exploding-gradient-problem-and-how-are-they-typically-addressed/) occurs when the gradient becomes too large, which makes the model unstable.
<img src = "https://aiml.com/wp-content/uploads/2023/11/vanishing-and-exploding-gradient-1.png" height=200>

In this case, larger error gradients accumulate, and the model weights become too large. This issue can cause longer training times and poor model performance. It is less common than the Vanishing Gradient.

Watch this video to understand how these two concepts operate:

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('qO_NLVjD6zE', width=600, height=400)

Advanced RNN architectures such as LSTM and GRU mitigate the Vanishing Gradient problem. That's what we will see next.

#**MasterCard Stock Price Prediction Using LSTM & GRU**
Below, we will use Kaggle’s MasterCard stock dataset from May-25-2006 to Oct-11-2021 to train an LSTM and a GRU model to forecast the stock price. As before, we will start with an Exploratory Data Analysis and data preprocessing and then build and train each model. We will also evaluate which model works best.

##**0. Importing the Libraries and the Data**

In [None]:
# Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Dropout, GRU, Bidirectional
from tensorflow.keras.optimizers import SGD
from tensorflow.random import set_seed
set_seed(455)
np.random.seed(455)

dataset = pd.read_csv(
    "https://raw.githubusercontent.com/shstreuber/Data-Mining/refs/heads/master/data/Mastercard_stock_history.csv", index_col="Date", parse_dates=["Date"]
).drop(["Dividends", "Stock Splits"], axis=1)
print(dataset.head())

##**1. Exploratory Data Analysis (EDA)**

In [None]:
dataset.describe()

In [None]:
dataset.isna().sum()

In [None]:
# Checking the data distribution
tstart = 2016
tend = 2020

def train_test_plot(dataset, tstart, tend):
    dataset.loc[f"{tstart}":f"{tend}", "High"].plot(figsize=(16, 4), legend=True)
    dataset.loc[f"{tend+1}":, "High"].plot(figsize=(16, 4), legend=True)
    plt.legend([f"Train (Before {tend+1})", f"Test ({tend+1} and beyond)"])
    plt.title("MasterCard stock price")
    plt.show()

train_test_plot(dataset,tstart,tend)

##**2. Preprocessing**


In [None]:
#Setting up Training and Test Sets
def train_test_split(dataset, tstart, tend):
    train = dataset.loc[f"{tstart}":f"{tend}", "High"].values
    test = dataset.loc[f"{tend+1}":, "High"].values
    return train, test
training_set, test_set = train_test_split(dataset, tstart, tend)

In [None]:
#Standardizing the inputs with MinMaxScaler--this is a different form of normalization
sc = MinMaxScaler(feature_range=(0, 1))
training_set = training_set.reshape(-1, 1)
training_set_scaled = sc.fit_transform(training_set)

In [None]:
print("This is the beginning of the Training Set BEFORE scaling \n",training_set)
print("This is the beginning of the Training Set AFTER scaling \n",training_set_scaled)

In [None]:
# The split_sequence function uses a training dataset and converts it into inputs (X_train) and outputs (y_train).

def split_sequence(sequence, n_steps):
    X, y = list(), list()   # initialize two empty lists called X and y
    for i in range(len(sequence)): # loop through the sequence argument and calculate the end_ix variable by adding the current index i to the n_steps argument
        end_ix = i + n_steps
        if end_ix > len(sequence) - 1: # If end_ix is greater than the length of the sequence minus 1, the loop is broken
            break
        seq_x, seq_y = sequence[i:end_ix], sequence[end_ix] # create two variables seq_x and seq_y by slicing the sequence from the current index i to end_ix and selecting the value at end_ix
        X.append(seq_x) # append to the X and y lists
        y.append(seq_y)
    return np.array(X), np.array(y) # return X and y as numpy arrays


n_steps = 60 # initialize with 60
features = 1
# split into samples
X_train, y_train = split_sequence(training_set_scaled, n_steps) # assign the output of calling the split_sequence function with the training_set_scaled argument and n_steps variable

In [None]:
# We are working with a univariate series, so the number of features is one, and we need to reshape the X_train to fit on the LSTM model.
# The X_train has [samples, timesteps], and we will reshape it to [samples, timesteps, features].

# Reshaping X_train for model
X_train = X_train.reshape(X_train.shape[0],X_train.shape[1],features)

##**3. The LSTM Model**
The Long Short Term Memory (LSTM) is an advanced type of RNN, designed to prevent both decaying and exploding gradient problems. Just like RNN, LSTM has repeating modules, but the structure is different. Think of an LSTM like a memory bank for the network. It has three parts: The Forget Gate, the Input Gate, and the Output Gate.

This is what an LSTM model looks like:

<img src="https://cdn-images-1.medium.com/max/1500/1*Mw4W7FZUbSr4EoriB5GuqQ.jpeg">

In an LSTM, the Forget Gate, Input Gate, and Output Gate work together to control the flow of information through the network. They use point-by-point multiplication and addition to manage how much information to keep, update, or output at each step in a sequence. Here's how each gate works:

1. **Forget Gate**:
The Forget Gate decides which information in the memory (the cell state) should be discarded.
* It looks at the current input (data at the current time step) and the previous output (previous state).
* It produces values between 0 and 1 using a sigmoid function. A value of 0 means "forget everything," and 1 means "keep everything."
* Point-by-point multiplication: The output from the Forget Gate (values between 0 and 1) is multiplied element-wise with the previous cell state, controlling how much of the previous memory should be kept.
2. **Input Gate**:
The Input Gate decides what new information should be added to the memory.
It has two parts:
* A sigmoid activation that controls which values to update.
* A tanh activation that creates new candidate values to add to the cell state.
* Point-by-point multiplication: The output from the sigmoid gate multiplies element-wise with the new candidate values (from the tanh) to decide how much of the new information should be added to the memory.
3. Output Gate:
The Output Gate controls what part of the memory should be used to produce the current output.
* It takes the current input and the updated memory (cell state) and passes them through a sigmoid function.
* Then, it multiplies the output of the sigmoid function with the tanh of the updated cell state to decide the final output.
* Point-by-point multiplication: The output from the sigmoid gate is multiplied element-wise with the updated cell state to produce the final output.

<hr>

Our model will consist of a single hidden layer of LSTM and an output layer. You can experiment with the number of units, as more units will give you better results. For this experiment, we will set LSTM units to 125, tanh as activation, and set input size.

We will compile the model with an RMSprop optimizer and mean square error as a loss function.

###**3.1 Build the Model**

In [None]:
# The LSTM architecture
from tensorflow.keras import Input # This is to define the input layer

model_lstm = Sequential()
model_lstm.add(Input(shape=(n_steps, features)))  # Explicitly defining the input layer
model_lstm.add(LSTM(units=125, activation="tanh"))
model_lstm.add(Dense(units=1))

# Compiling the model
model_lstm.compile(optimizer="RMSprop", loss="mse", metrics=["accuracy"])

model_lstm.summary()


###**3.2 Train the Model**

In [None]:
model_lstm.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2) # train on 50 epochs with 32 batch sizes.

In this case, the loss value is very small, which indicates that the model is performing well on the training data.

###**3.3 Running the model on the Test Set**
We are going to repeat preprocessing and normalize the test set. First of all, we will transform then split the dataset into samples, reshape it, predict, and inverse transform the predictions into standard form.

In [None]:
# Select the "High" column from the dataset and assign it to the variable dataset_total
dataset_total = dataset.loc[:, "High"]

# Defining the Inputs
# Select the last part of the dataset as inputs for the test set.
# This takes values from dataset_total, starting from an index offset by the length of the test set and n_steps.
inputs = dataset_total[len(dataset_total) - len(test_set) - n_steps:].values

# Reshape inputs to a 2D array with one column to be used in the model
inputs = inputs.reshape(-1, 1)  # Reshapes the data into a 2D array (one column)

# Preprocessing with Scaling
# Apply the scaling transformation to the inputs using the scaler (sc) to normalize the data
inputs = sc.transform(inputs)  # Scale the inputs using the sc.transform() method

# Splitting into samples
# Split the sequence of inputs into samples for X_test (features) and y_test (labels)
X_test, y_test = split_sequence(inputs, n_steps)

# Reshape the input data for the model
# The model expects 3D input (samples, time steps, features).
# Reshape X_test into a 3D array where the number of samples is the first dimension,
# n_steps (sequence length) is the second dimension, and the third dimension is the number of features.
X_test = X_test.reshape(X_test.shape[0], X_test.shape[1], features)

# Predicting Stock Prices
# Use the trained LSTM model to predict the stock prices based on the input sequences in X_test
predicted_stock_price = model_lstm.predict(X_test)

# Inverse scaling transformation
# The model outputs scaled predictions, so we need to inverse transform the predicted values
# using the scaler (sc) to obtain the actual stock price values
predicted_stock_price = sc.inverse_transform(predicted_stock_price)


###**3.4 Plot Predictions vs. Actual Values from the Test Set**
The code below defines two functions: `plot_predictions` and `return_rmse`. The `plot_predictions` function visualizes the real and predicted stock prices over time by plotting the values in a line graph, with the real values in gray and the predicted values in red. The `return_rmse` function calculates the Root Mean Squared Error (RMSE) between the actual and predicted stock prices, providing a measure of prediction accuracy. Finally, the code calls both functions to display the stock price predictions and the RMSE value for the test dataset and predicted stock prices.

In [None]:
def plot_predictions(test, predicted):
    # This function takes in two arguments:
    # 'test' represents the real stock price values (true values),
    # 'predicted' represents the predicted stock price values by the model.

    # Plotting the real stock prices in gray color
    plt.plot(test, color="gray", label="Real")
    # Plotting the predicted stock prices in red color
    plt.plot(predicted, color="red", label="Predicted")

    # Set the title of the plot
    plt.title("MasterCard Stock Price Prediction")

    # Label the x-axis as 'Time' (since the data is over time)
    plt.xlabel("Time")

    # Label the y-axis as 'MasterCard Stock Price' (the variable being predicted)
    plt.ylabel("MasterCard Stock Price")

    # Show the legend to differentiate between real and predicted values
    plt.legend()

    # Display the plot
    plt.show()


def return_rmse(test, predicted):
    # This function calculates and prints the Root Mean Squared Error (RMSE) between the actual and predicted stock prices.

    # Use numpy to compute the RMSE, which gives an indication of the prediction accuracy
    # RMSE is calculated as the square root of the mean squared error (MSE).
    rmse = np.sqrt(mean_squared_error(test, predicted))

    # Print the RMSE value with two decimal places for clarity
    print("The root mean squared error is {:.2f}.".format(rmse))

# Call the plot_predictions function to visualize the real vs predicted stock prices
plot_predictions(test_set, predicted_stock_price)

# Call the return_rmse function to calculate and print the RMSE for the predictions
return_rmse(test_set, predicted_stock_price)


##**4. GRU Model**

The GRU architecture is similar to the LSTM (Long Short-Term Memory) model but simpler and faster.
1. **Gates**: GRU has only two main gates that control the flow of information:
- **Update Gate**: Decides how much of the previous information (memory) should be kept and how much should be replaced with new information. It’s like deciding how much of the past you want to remember and how much to forget.
- **Reset Gate**: Decides how much of the previous memory should be discarded when considering the new input. It helps decide what past information should be ignored when making new predictions.
2. **Memory Update**: GRU combines the new input with the memory from previous steps, using the update and reset gates. It keeps important past information and adds new insights from the current data.
3. **Output**: The updated memory is then used to make predictions or output decisions, such as predicting the next value in a sequence.

The GRU model is efficient because it simplifies the process compared to LSTM models, while still being able to capture important patterns in sequential data.

Unlike LSTM, GRU does not have cell state Ct. It only has a hidden state ht, and due to the simple architecture, GRU has a lower training time compared to LSTM models. The GRU architecture is simpler as it takes input x(t) and the hidden state from the previous timestamp h(t-1) and outputs the new hidden state h(t).

<img src ="https://cdn-images-1.medium.com/max/1500/1*zFhmhw_SZcX4kUVQH-z2aw.jpeg">

We are going to keep everything the same and just replace the LSTM layer with the GRU layer so we can compare the results. The model structure contains a single GRU layer with 125 units and an output layer.


###**4.1 Build the Model**

In [None]:
from tensorflow.keras import Input

model_gru = Sequential()

# Define the input shape using the Input() layer explicitly
model_gru.add(Input(shape=(n_steps, features)))  # n_steps: number of time steps, features: number of features at each time step

# Add GRU layer
model_gru.add(GRU(units=125, activation="tanh"))

# Add Dense output layer
model_gru.add(Dense(units=1))

# Compile the model
model_gru.compile(optimizer="RMSprop", loss="mse", metrics=["accuracy"])

# Print the model summary
model_gru.summary()


###**4.2 Train the Model**

In [None]:
model_gru.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

###**4.3 Run the Model on the Test Set**

In [None]:
GRU_predicted_stock_price = model_gru.predict(X_test)
GRU_predicted_stock_price = sc.inverse_transform(GRU_predicted_stock_price)

###**4.4 Plot Predictions vs. Actual Values from the Test Set**

In [None]:
plot_predictions(test_set, GRU_predicted_stock_price)
return_rmse(test_set,GRU_predicted_stock_price)

**Your Task**: Which model performs better? What could be the reason? Record your thoughts below, then read on.

The difference in RMSE (Root Mean Squared Error) between the GRU and LSTM models could be due to several factors:

**1. Model Complexity:**

LSTM has a more complex architecture compared to GRU. Its three gates allow it to better capture long-term dependencies in the data. As a result, it may perform better at modeling sequential data, especially if the data has complex patterns or long-term relationships.
GRU is simpler and has only two gates (update and reset), which can make it faster and easier to train but potentially less effective at capturing long-term dependencies.

**2. Data Characteristics:**

Our data may benefit more from the LSTM’s ability to handle long-term dependencies, especially since they  contain complex temporal relationships and trends over a long period. LSTM might be able to capture these patterns better. If the data is simpler or doesn't have many long-term dependencies, GRU may be sufficient, but it might not perform as well as LSTM.

**3. Hyperparameters:**

Differences in model performance can also be influenced by the choice of hyperparameters (e.g., number of units, learning rate, batch size). Even slight differences in how the models are trained can result in significant changes in performance.
The GRU model might need further tuning in terms of the number of units, learning rate, or other training parameters to perform better.

**4. Training and Optimization:**

The optimizer (e.g., RMSprop) and training process for each model might be influencing the RMSE. LSTM models are generally more stable and easier to train for certain types of tasks, while GRU models might require additional tuning.

**5. Overfitting or Underfitting:**

If either model is overfitting or underfitting the training data, this can lead to a higher RMSE on the test set. You might want to check the validation loss and the generalization ability of the models to determine if this is the case.

##**5. Transformer Model**
A [Transformer model](https://www.techtarget.com/searchenterpriseai/definition/transformer-model) is a type of machine learning model mainly used for tasks involving sequences, like language translation, text generation, or time series prediction. What makes it different from older models like LSTMs and GRUs is how it handles and processes data. A transformer architecture consists of an encoder and decoder that work together. The attention mechanism lets transformers encode the meaning of words based on the estimated importance of other words or tokens. This enables transformers to process all words or tokens in parallel for faster performance, helping drive the growth of increasingly bigger LLMs.

Using the attention mechanism, the encoder block transforms each word or token into vectors further weighted by other words. For example, in the following two sentences, the meaning of it is weighted differently, owing to the change of the word filled to emptied:
* He poured the pitcher into the cup and *filled* it.
* He poured the pitcher into the cup and *emptied* it.

The attention mechanism connects it to the cup being filled in the first sentence and to the pitcher being emptied in the second sentence.

The decoder essentially reverses the process in the target domain:

<img src = "https://www.techtarget.com/rms/onlineimages/transformer_model_architecture-f.png" width = 300>

The important mechanisms are:
1. **Self-Attention**:
The key idea behind a Transformer is self-attention, which allows the model to look at all parts of the input sequence simultaneously (instead of step-by-step like LSTMs).
This means the model can focus on important parts of the input, regardless of their position in the sequence. For example, in a sentence, the model can directly focus on relevant words, even if they're far apart.
2. **Attention Mechanism**:
The attention mechanism helps the model decide which parts of the input sequence should be given more importance when making predictions. It works like this:
For each word (or time step), the model looks at all other words or time steps in the sequence and decides how much attention to pay to each one. This helps the model capture relationships between words or time steps, even if they're far apart.
3. **Multi-Head Attention**:
Multi-head attention means the model uses several "attention heads," or separate attention mechanisms, to look at different parts of the sequence in parallel. This allows the model to capture various relationships in the data at once.
4. **Feed-Forward Networks**:
After the attention step, the Transformer uses a simple neural network (called a feed-forward network) to process the output further. This network helps the model make more complex decisions based on the attention results.
5. **Positional Encoding**:
Unlike older models like LSTMs, the Transformer doesn't process the data in a sequence. Instead, it processes all parts at once. To keep track of the order of the sequence (e.g., the order of words in a sentence or time steps in a series), positional encoding is added, which tells the model the position of each word or time step in the sequence.

How It Works:
* **Input**: The input sequence (like a sentence or a time series) is passed into the Transformer.
* **Self-Attention**: The model looks at the entire sequence and determines which parts are most important to focus on at each step.
* **Processing**: The model then processes the sequence through multiple layers of attention and feed-forward networks.
* **Output**: After processing the sequence, the model outputs its prediction (like translating a sentence or predicting a future value).

**Why Transformers Are Powerful:**

1. **Efficiency**: Transformers can process the entire sequence at once, unlike models like LSTMs, which process one step at a time. This makes Transformers faster, especially for long sequences.
2. **Long-Term Dependencies**: The attention mechanism allows Transformers to capture long-range dependencies more easily than LSTMs or GRUs, which can struggle with long sequences.
3. **Scalability**: Transformers can handle large datasets and are often used in very complex tasks, such as language models like GPT (the model you're interacting with right now).

In short, Transformers are great at understanding complex patterns in sequences by focusing on important parts of the data, making them ideal for tasks like text processing and time series forecasting.

In our code example, we’ll replace the LSTM/GRU layers with Transformer blocks.









###**5.1 Build the Model**


In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, LayerNormalization, MultiHeadAttention, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

# Define Transformer block (Self-Attention Layer)
def transformer_block(inputs, num_heads, num_units, dropout_rate=0.3):
    # Multi-head self-attention layer
    attention = MultiHeadAttention(num_heads=num_heads, key_dim=num_units)(inputs, inputs)
    attention = Dropout(dropout_rate)(attention)
    attention = LayerNormalization()(attention)

    # Feed-forward network (Fully connected) for processing attention output
    ffn = Dense(num_units * 2, activation='relu')(attention)  # Increased the size of the dense layer
    ffn = Dropout(dropout_rate)(ffn)
    ffn = LayerNormalization()(ffn)

    return ffn

# Input layer
input_layer = Input(shape=(n_steps, features))

# Apply transformer block with modified hyperparameters
x = transformer_block(input_layer, num_heads=16, num_units=128, dropout_rate=0.3)

# Add another Transformer block for deeper learning
x = transformer_block(x, num_heads=16, num_units=128, dropout_rate=0.3)

# Flatten the output for the Dense layer
x = Flatten()(x)

# Dense output layer for regression (predicting stock price)
output_layer = Dense(1)(x)

# Create the model using the functional API
model_transformer = Model(inputs=input_layer, outputs=output_layer)

# Compile the model with Adam optimizer
optimizer = Adam(learning_rate=0.001)  # Added a learning rate setting
model_transformer.compile(optimizer=optimizer, loss='mse', metrics=['mae'])

# Add early stopping to avoid overfitting
early_stopping = EarlyStopping(monitor='val_loss', patience=15, restore_best_weights=True)

# Print the model summary
model_transformer.summary()


**Explanation of the Code**:
1. **Transformer Block**:
The TransformerBlock class defines a single layer of the Transformer architecture. It uses the MultiHeadAttention layer to capture dependencies between different time steps in the input sequence.
The feed-forward network (FFN) processes the output of the attention mechanism, and LayerNormalization and Dropout are applied for stability and regularization.

2. **Model Definition**: The build_transformer_model function defines the Transformer model architecture.
The input layer accepts sequences of shape (n_steps, features). We stack multiple Transformer blocks (defined by the num_layers parameter). After processing through the Transformer layers, the output is pooled using GlobalAveragePooling1D to reduce the sequence to a fixed-size vector. Finally, a Dense layer is added for regression (or forecasting), with a single output neuron (since we are predicting a continuous value).
3. **Compilation**: The model is compiled with RMSprop optimizer and Mean Squared Error (MSE) loss for regression.


### **5.2 Train the Model**
We train the model in the same way we trained the LSTM or GRU model, adding validation metrics and monitoring loss.

In [None]:
model_transformer.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2)

###**5.3 Run the Model on the Test Set**

In [None]:
# Step 1: Generate predictions from the Transformer model
transformer_predicted_stock_price = model_transformer.predict(X_test)

# Step 2: Inverse transform the predicted values to get the actual stock prices
transformer_predicted_stock_price = sc.inverse_transform(transformer_predicted_stock_price)

###**5.3 Plot Predictions vs. Actual Values from the Test Set**

In [None]:
# Step 3: Visualize the real vs predicted stock prices
plot_predictions(test_set, transformer_predicted_stock_price)

# Step 4: Calculate and print the RMSE for the predictions
return_rmse(test_set, transformer_predicted_stock_price)

##**6. Exercises**

###**6.1: RNNs**
Tune the code for the LSTM and the GRU models such that their performance becomes similar or near identical. You can play with any of the following:
1. Hyperparameters
2. Input parameters/ number of features
3. Model layers and architecture
4. Anything else

When you are satisfied, copy the revised code in the field(s) below.

In [None]:
# Copy the tuned LSTM code here

In [None]:
# Copy the tuned GRU code here

###**6.2: Transformer and Query Engineering**
The output for the Transfoermer model points to problems with the model architecture itself. Use any TWO of the following to improve the model performance and bring the RMSE under 100. NOTE that you must use IDENTICAL queries for both AIs:
* OpenAI's [ChatGPT](https://chat.openai.com/) (GPT)
* Google's [Aistudio](https://aistudio.google.com/prompts/new_chat) (Gemini)
* Meta [AI Assistant](https://www.meta.ai/) (LLama)
* xAI [Grok](https://x.ai/grok) (the Tesla of AIs)
* Perplexity's [AI](https://www.perplexity.ai/) (another GPT)
* Microsoft [Copilot](https://copilot.microsoft.com/) (Mistral)


State below which two systems you have used, then paste your query history into the text fields, paste the best code solution into the code field, and in at least two sentences, evaluate which AI has provided the best solution most easily.


The Systems I have used are:

My query history:

In [None]:
# The winning code solution

The best AI for this coding job was ____ because ___ ...