<a href="https://colab.research.google.com/github/UAPH451551/PH451_551_Sp23/blob/main/Exercises/08B_rnns_and_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RNNs, Attention, and Transformers

Due date: 2023-04-07

File name convention: For group 42 and memebers Richard Stallman and Linus <br>
Torvalds it would be <br>
"08_Exercise8_group42_Stallman_Torvalds.pdf".

Submission via blackboard (UA).

Feel free to answer free text questions in text cells using markdown and <br>
possibly $\LaTeX{}$ if you want to.

**You don't have to understand every line of code here and it is not intended <br> 
for you to try to understand every line of code.   <br>
Big blocks of code are usually meant to just be clicked through.**

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)
tf.random.set_seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Loading and Preparing the Dataset

In [None]:
def generate_time_series(batch_size, n_steps):
    freq1, freq2, offsets1, offsets2 = np.random.rand(4, batch_size, 1)
    time = np.linspace(0, 1, n_steps)
    series = 0.5 * np.sin((time - offsets1) * (freq1 * 10 + 10))  #   wave 1
    series += 0.2 * np.sin((time - offsets2) * (freq2 * 20 + 20)) # + wave 2
    series += 0.1 * (np.random.rand(batch_size, n_steps) - 0.5)   # + noise
    return series[..., np.newaxis].astype(np.float32)

In [None]:
np.random.seed(42)

n_steps = 50
series = generate_time_series(10000, n_steps + 1)
X_train, y_train = series[:7000, :n_steps], series[:7000, -1]
X_valid, y_valid = series[7000:9000, :n_steps], series[7000:9000, -1]
X_test, y_test = series[9000:, :n_steps], series[9000:, -1]

In [None]:

X_train.shape, y_train.shape

In [None]:
def plot_series(series, y=None, y_pred=None, x_label="$t$", y_label="$x(t)$", legend=True):
    plt.plot(series, ".-")
    if y is not None:
        plt.plot(n_steps, y, "bo", label="Target")
    if y_pred is not None:
        plt.plot(n_steps, y_pred, "rx", markersize=10, label="Prediction")
    plt.grid(True)
    if x_label:
        plt.xlabel(x_label, fontsize=16)
    if y_label:
        plt.ylabel(y_label, fontsize=16, rotation=0)
    plt.hlines(0, 0, 100, linewidth=1)
    plt.axis([0, n_steps + 1, -1, 1])
    if legend and (y or y_pred):
        plt.legend(fontsize=14, loc="upper left")

fig, axes = plt.subplots(nrows=1, ncols=3, sharey=True, figsize=(12, 4))
for col in range(3):
    plt.sca(axes[col])
    plot_series(X_valid[col, :, 0], y_valid[col, 0],
                y_label=("$x(t)$" if col==0 else None),
                legend=(col == 0))
plt.show()

# Recurrent Neural Networks

# The Simple Recurrent Neural Network RNN

The simplest form of **RNN** one can build **takes an output and passes it into the** <br>
**next input**. The idea is that, if your data has some form of **sequential** property <br>
such as being elements of a **time series** or a sentence or even generally related <br>
(not explicitly sequential) values, you should attempt to **inform your model via** <br>
**explicit neural connections** that those values are linked. By passing the output <br>
of one neuron to the next neuron in a sequence, you can explicitly inform your <br>
model of the relationship of an input to its neighboring inputs.

In [None]:
from keras import layers

## Task 1: Simple RNN

Build  a simple RNN using the `SimpleRNN` layer. Your model should be a <br>
Sequential model called `simplernn` of the following form:<br>
1) `SimpleRNN` with 16 units, `return_sequences=True`, and `input_shape=[None,1]`<br>
2) `SimpleRNN` layer with 16 units and `return_sequences=False`<br>
3) Output `Dense` layer with 1 output dimension with a `'linear'` activation.

Compile the model with `'mse'` loss and the `'adam'` optimizer.<br>
Train the model on `X_train` and `y_train`.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your code goes below

In [None]:
# simplernn = keras.models.Sequential([])
# simplernn.compile()
# simplernn.fit()

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your code goes above this

In [None]:
simplernn.evaluate(X_test, y_test)

In [None]:
def plot_learning_curves(loss, val_loss):
    plt.plot(np.arange(len(loss)) + 0.5, loss, "b.-", label="Training loss")
    plt.plot(np.arange(len(val_loss)) + 1, val_loss, "r.-", label="Validation loss")
    plt.gca().xaxis.set_major_locator(mpl.ticker.MaxNLocator(integer=True))
    plt.axis([1, 10, 0.002, 0.01])
    plt.legend(fontsize=14)
    plt.xlabel("Epochs")
    plt.ylabel("Loss")
    plt.grid(True)

In [None]:
plot_learning_curves(history.history["loss"], history.history["val_loss"])
plt.show()

In [None]:
y_pred = simplernn.predict(X_valid)
plot_series(X_valid[0, :, 0], y_valid[0, 0], y_pred[0, 0])
plt.show()

In [None]:
print(y_pred.shape)

In [None]:
np.random.seed(43) # not 42, as it would give the first series in the train set

new_series = generate_time_series(1, n_steps + 10)
X_new, Y_new = new_series[:, :n_steps], new_series[:, n_steps:]

In [None]:
X = X_new
for step_ahead in range(10):
    y_pred_one = simplernn.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

y_pred_new = X[:, n_steps:]

In [None]:
y_pred_new.shape

In [None]:
def plot_multiple_forecasts(X, Y, Y_pred):
    n_steps = X.shape[1]
    ahead = Y.shape[1]
    plot_series(X[0, :, 0])
    plt.plot(np.arange(n_steps, n_steps + ahead), Y[0, :, 0], "bo-", label="Actual")
    plt.plot(np.arange(n_steps, n_steps + ahead), Y_pred[0, :, 0], "rx-", label="Forecast", markersize=10)
    plt.axis([0, n_steps + ahead, -1, 1])
    plt.legend(fontsize=14)

In [None]:
plot_multiple_forecasts(X_new, Y_new, y_pred_new)
plt.show()

# The Gated Recurrent Unit (GRU)

GRU models are **actually very new** in the timeline of RNNs having been first <br>
**proposed in 2014** by Kyunghyun Cho et al. The main feature that this adds to the <br>
RNN architechture is that it replaces the simple densely connected unit with a <br>
new unit that adds a **"forget gate"**. You might remember the **characteristic** <br>
**equation for a standard neuron** is $a_i=\sigma(w_ix_i + b_l)$. For the GRU you add <br>
a term which depends on a new weight we can call u which is applied to the <br>
output of the previous neuron: $a_i=\sigma(w_ix_i + u_ih_{i-1} + b_l)$ where h is the final <br>
output of the previous neuron. Additionally, it takes that initial activation and <br>
applies two steps:<br>
1) $\hat h_i=\phi(w_{2i}x_i+u_{2i}(a_i\odot h_{i-1}) + b_{2l})$ where $\phi$ is tanh and $w_2$ is a second <br>
neuron weight meaning that the simplest **GRU will have 4 weights and 2 biases**.<br>
2) $h_i=(1-a_i)\odot h_{i-1}+a_i\odot \hat h_i$)

If we consider what those equations mean then you can probably work out that <br>
**(1), our forget gate**, is checking how correlated our activation is with our <br> 
previous neuron's activation. Then, **if our output is highly correlated**, our <br> 
final activation **(2) will have a large contribution from our previous neuron** <br>
and if uncorrelated, it will have a small contribution.

The modern standard implementation of GRU also now includes a "reset gate" and <br>
an "update gate" which you can read about further [here](https://en.wikipedia.org/wiki/Gated_recurrent_unit).

## Task 2: Simple GRU Model

Build an identical model to task 1 but substituting `GRU` layers instead of <br>
`SimpleRNN` layers and instead call the model `simplegru`.

Train the model on `X_train` and `y_train`.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your code goes below

In [None]:
# simplegru = keras.models.Sequential([])
# simplegru.compile()
# simplegru.fit()

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your code goes above this

In [None]:
simplegru.evaluate(X_test, y_test)

In [None]:
X = X_new
for step_ahead in range(10):
    y_pred_one = simplegru.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

y_pred_new = X[:, n_steps:]

In [None]:
plot_multiple_forecasts(X_new, Y_new, y_pred_new)
plt.show()

# The Long Short-Term Memory (LSTM)

LSTM's were among the state of the art for language processing tasks for quite <br>
some time. Interestingly though, they were **first proposed in 1995** by Hochreiter <br>
and Schmidhuber almost **20 years before GRU**. In fact, LSTM models **outperform** <br>
**GRU models** on many tasks and by many metrics. However, **not without costs**. LSTM <br>
cells require **more storage space** and are **slower** than GRU cells. This is due to <br>
the fact that they have an additional **"long-term memory"** value which they can <br>
use to preserve information even between iterations. Whereas GRU will only <br>
"remember" information passed by adjacent neurons, **LSTM will remember** that and <br>
also information passed by **previous batches or iterations of data**.

In [None]:
simplelstm = keras.models.Sequential([
    layers.LSTM(16, return_sequences=True, input_shape=[None,1]),
    layers.LSTM(16),
    layers.Dense(1, activation='linear'),
])

In [None]:
simplelstm.compile(loss="mse", optimizer="adam")
history = simplelstm.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=20)

In [None]:
simplelstm.evaluate(X_test, y_test)

In [None]:
X = X_new
for step_ahead in range(10):
    y_pred_one = simplelstm.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

y_pred_new = X[:, n_steps:]

In [None]:
plot_multiple_forecasts(X_new, Y_new, y_pred_new)
plt.show()

## Task 3: Analyzing Model Performance

Question: Compare the behavior of LSTM to GRU and simple RNN. Which region of <br>
future forecasting is most different? Describe characteristics about the <br>
various models that would lead to that difference.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your answer goes below

Task 3) answer:

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your answer goes above this

# The Luong Attention Mechanism

Attention is an operation built around applying a **dot product** across **1) outputs** <br>
**from the head of a model** which has "encoded" the data by transforming with <br>
weights and biases and **2) hidden states (weights) from the tail of a model** <br>
which is attempting to "decode" the data and turn it into your desired output.

That is a long sentence but what it breaks down to is that you're **reweighting** <br>
**data** part way through a model so that the back half of your model knows what's <br>
most important from what the first half did. When Luong, Pham and Manning <br>
published [this paper](https://arxiv.org/abs/1508.04025) in 2015, they achieved 
**state of the art results** and laid the <br>
foundation for the **Third Wave of AI** driven largely by attention-based <br>
Transformer models.

In [None]:
from keras.optimizers import Adam

In [None]:
inputs = layers.Input(shape=X_train.shape[1:])
x, f_h, f_c = layers.LSTM(16, return_state=True, return_sequences=True)(inputs)
attention = layers.Activation('softmax')(f_h)
context = layers.dot([attention, f_h], axes=[1,1])
f_h = layers.add([f_h, context])
y = layers.LSTM(16)(x, initial_state=[f_h, f_c])
outputs = layers.Dense(1, activation='linear')(y)

attlstm = keras.Model(inputs=inputs, outputs=outputs)

## Task 4: Understanding Luong Attention

Here `f_h` is the hidden state output from an encoder and `f_c` is the cell <br>
state from an encoder. Describe, to the best your ability, what is happening <br>
between the extraction of the hidden state from the encoder and passing the <br>
hidden state as the initial state of the decoder (between lines 2-6).

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your answer goes below

Task 4) answer:

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your answer goes above this

In [None]:
optimizer = Adam(learning_rate=10e-4)
attlstm.compile(loss="mse", optimizer=optimizer)
history1 = attlstm.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=20)

In [None]:
attlstm.evaluate(X_test, y_test)

In [None]:
X = X_new
for step_ahead in range(10):
    y_pred_one = attlstm.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

y_pred_new = X[:, n_steps:]

In [None]:
plot_multiple_forecasts(X_new, Y_new, y_pred_new)
plt.show()

# The Transformer Model

The transformer architecture was first proposed in a paper called <br>
[Attention is All You Need (Vaswani et. al 2017](https://arxiv.org/abs/1706.03762).

The characteristic operation of the transformer is the **self-attention** <br>
operation. In essence, it is using **dot product** operations along different axes <br>
of the data to amplify the importance of relevant data. You can think of this <br>
as effectively building **a model that reweights its inputs** to pick out the most <br>
important and relevant inputs. Then it does typical mathematical operations of <br>
**densely connected layers** on those reweighted inputs.

Researchers figured out several years before this that dot-product attention <br>
could be used in combination with other forms of models to improve performance <br>
by reweighting input data but the **Attention is All You Need** paper showed that, <br>
as the name would imply, you can get even **better performance** by eliminating all <br>
other characteristic layers and instead **using only attention and dense layers**.

In [None]:
print(X_train.shape)

## Task 5: Definining model dimensions

Define two variables: <br>
1) `max_len` should be equal to the size of the dimension from X_train that <br>
describes the length of the time series. **Hint: see above plots**<br>
2) `embedding_dim` should be equal to the fourth root (**(1/4)) of the total <br>
possible data points in the input and output

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your code goes below

In [None]:
# max_len =
# embedding_dim =
# print(max_len)
# print(embedding_dim)

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your code goes above this

It's typical with transformers to embed our inputs into a higher dimensional <br>
space. For discrete data such as integers or words, it's typical to use an  <br>
embedding layer from the keras toolkit. However, for continuous data, it's more <br>
common to embed using linear or non-linear transformations.

In [None]:
class LinearEmbedding(layers.Layer):
  def __init__(self, dim, dtype=tf.float32, **kwargs):
    super().__init__(dtype=dtype, **kwargs)
    self.dense = layers.Dense(dim, activation='linear')
  def call(self, inputs):
    inputs = tf.expand_dims(inputs, axis=-1)
    return self.dense(inputs)

Another interesting feature of transformers is that they are not explicitly <br>
aware of the order of data in a relative sequence, especially after embedding. <br>
We therefore can inform it of positions by, for example, adding a small amount <br>
related to its position in the tensor by some interpretable form like periodic <br>
functions.

In [None]:
class PositionalEncoding(layers.Layer):
    def __init__(self, max_len, max_dims, alpha=1, dtype=tf.float32, **kwargs):
        super().__init__(dtype=dtype, **kwargs)
        if max_dims % 2 == 1: max_dims += 1 # max_dims must be even
        p, i = np.meshgrid(np.arange(max_len), np.arange(max_dims // 2))
        pos_emb = np.empty((1, max_len, max_dims))
        pos_emb[0, :, ::2] = np.sin(p / 10000**(2 * i / max_dims)).T
        pos_emb[0, :, 1::2] = np.cos(p / 10000**(2 * i / max_dims)).T
        self.positional_embedding = tf.constant(pos_emb.astype(self.dtype))*alpha
    def call(self, inputs):
        shape = tf.shape(inputs)
        return inputs + self.positional_embedding[:, :shape[-2], :shape[-1]]

In [None]:
from tensorflow import math, matmul, reshape, shape, transpose, cast, float32

The next two colab cells for simple transformers were derived from [this](https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/)<br>
article by Stefania Cristina on machinelearningmastery.com

In [None]:
#https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/
K = keras.backend

class DotProductAttention(layers.Layer):
    def __init__(self, **kwargs):
        super(DotProductAttention, self).__init__(**kwargs)
 
    def call(self, queries, keys, values, d_k, mask=None):
        # Scoring the queries against the keys after transposing the latter, and scaling
        scores = matmul(queries, keys, transpose_b=True) / math.sqrt(cast(d_k, float32))
 
        # Apply mask to the attention scores
        if mask is not None:
            scores += -1e9 * mask
 
        # Computing the weights by a softmax operation
        weights = K.softmax(scores)
 
        # Computing the attention by a weighted sum of the value vectors
        return matmul(weights, values)

In [None]:
#https://machinelearningmastery.com/how-to-implement-multi-head-attention-from-scratch-in-tensorflow-and-keras/
class MultiHeadAttention(layers.Layer):
    def __init__(self, h, d_k, d_v, d_model, **kwargs):
        super(MultiHeadAttention, self).__init__(**kwargs)
        self.attention = DotProductAttention()  # Scaled dot product attention
        self.heads = h  # Number of attention heads to use
        self.d_k = d_k  # Dimensionality of the linearly projected queries and keys
        self.d_v = d_v  # Dimensionality of the linearly projected values
        self.d_model = d_model  # Dimensionality of the model
        self.W_q = layers.Dense(d_k)  # Learned projection matrix for the queries
        self.W_k = layers.Dense(d_k)  # Learned projection matrix for the keys
        self.W_v = layers.Dense(d_v)  # Learned projection matrix for the values
        self.W_o = layers.Dense(d_model)  # Learned projection matrix for the multi-head output
 
    def reshape_tensor(self, x, heads, flag):
        if flag:
            # Tensor shape after reshaping and transposing: (batch_size, heads, seq_length, -1)
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], heads, -1))
            x = transpose(x, perm=(0, 2, 1, 3))
        else:
            # Reverting the reshaping and transposing operations: (batch_size, seq_length, d_k)
            x = transpose(x, perm=(0, 2, 1, 3))
            x = reshape(x, shape=(shape(x)[0], shape(x)[1], self.d_k))
        return x
 
    def call(self, inputs, mask=None):
        queries = keys = values = inputs
        # Rearrange the queries to be able to compute all heads in parallel
        q_reshaped = self.reshape_tensor(self.W_q(queries), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)
 
        # Rearrange the keys to be able to compute all heads in parallel
        k_reshaped = self.reshape_tensor(self.W_k(keys), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)
 
        # Rearrange the values to be able to compute all heads in parallel
        v_reshaped = self.reshape_tensor(self.W_v(values), self.heads, True)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)
 
        # Compute the multi-head attention output using the reshaped queries, keys and values
        o_reshaped = self.attention(q_reshaped, k_reshaped, v_reshaped, self.d_k, mask=mask)
        # Resulting tensor shape: (batch_size, heads, input_seq_length, -1)
 
        # Rearrange back the output into concatenated form
        output = self.reshape_tensor(o_reshaped, self.heads, False)
        # Resulting tensor shape: (batch_size, input_seq_length, d_v)
 
        # Apply one final linear projection to the output to generate the multi-head attention
        # Resulting tensor shape: (batch_size, input_seq_length, d_model)
        return self.W_o(output)

In [None]:
def transformer_block(inputs, n_heads, max_len, ff_dim, dropout_rate):
    # Multi-head attention
    attention = MultiHeadAttention(h=n_heads, d_k=max_len, 
                                   d_v=max_len, d_model=ff_dim)(inputs)
    attention = layers.Dropout(dropout_rate)(attention)
    add1 = layers.Add()([inputs, attention])
    norm1 = layers.LayerNormalization(epsilon=1e-6)(add1)

    # Feedforward network
    ff = layers.Dense(units=ff_dim, activation="relu")(norm1)
    ff = layers.Dense(units=embedding_dim)(ff)
    ff = layers.Dropout(dropout_rate)(ff)
    add2 = layers.Add()([norm1, ff])
    norm2 = layers.LayerNormalization(epsilon=1e-6)(add2)

    return norm2

In [None]:
n_heads = 1
ff_dim = embedding_dim
dropout_rate = 0.0 # This should be >0 but <1 for more complex data sets
num_blocks = 1

inputs = layers.Input(shape=(max_len,))
embedding_layer = LinearEmbedding(embedding_dim)
encoder_embeddings = embedding_layer(inputs)
positional_encoding = PositionalEncoding(max_len, max_dims=embedding_dim, alpha=.5)
encoder_in = positional_encoding(encoder_embeddings)

blocks = []
for i in range(num_blocks):
    blocks.append(transformer_block(encoder_in, n_heads, max_len, ff_dim, dropout_rate))
transfout = layers.Concatenate(axis=-1)([block for block in blocks])

In [None]:
pooled = layers.GlobalAveragePooling1D()(transfout)

## Task 6: Global Average Pooling

Here we use a `GlobalAveragePooling1D` layer instead of simply adding, <br>
multiplying or flattening. Explain why we use this instead of the previously <br>
mentioned operations. Feel free to use online sources.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your answer goes below

Task 6) answer:

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your answer goes above this

In [None]:
outputs = layers.Dense(1, activation='linear')(pooled)

**Note: Transformers are often much larger in dimensionality than RNNs so we may** <br>
**need to use different training procedure like schedules for best results.**

In [None]:
transformer = keras.Model(inputs=inputs, outputs=outputs)
optimizer = keras.optimizers.Adam(learning_rate=10e-4)
transformer.compile(optimizer=optimizer, loss="mse")

In [None]:
transformer.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=20)

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=5e-4)
transformer.compile(optimizer=optimizer, loss="mse")
transformer.fit(X_train, y_train, validation_data=(X_valid, y_valid), epochs=10)

In [None]:
X = X_new
for step_ahead in range(10):
    y_pred_one = transformer.predict(X[:, step_ahead:])[:, np.newaxis, :]
    X = np.concatenate([X, y_pred_one], axis=1)

y_pred_new = X[:, n_steps:]

In [None]:
plot_multiple_forecasts(X_new, Y_new, y_pred_new)
plt.show()

## ChatGPT

GPT stands for **"generative, pre-trained transformer"**. ChatGPT is a version of <br>
a gpt **created by OpenAI** which made waves in late 2022 because of how well it <br>
mimics human conversation and interaction. Below, we'll look at how to access <br>
**ChatGPT in colab using a public API**. It is not possible to house the full model <br>
in a typical colab notebook because the model has an astounding **175 billion** <br>
**model parameters**. For comparison, two fully connected dense layers with 10,000 <br>
neurons each would only have around 100 million parameters.

In March 2023, OpenAI also released GPT4 which has over **1 TRILLION!** <br>
parameters. The model extended the capabilities of previous GPT iterations to <br>
include vision. It's so far shown incredible ability, even able to score in the <br>
**90th percentile on the Bar exam**, the final examination to become a lawyer in <br>
the United States.

In [None]:
!pip install openai

In [None]:
import openai

# Create an acccount at https://platform.openai.com
# Replace YOUR_API_KEY with your actual API key for the ChatGPT service:
# https://platform.openai.com/account/api-keys
openai.api_key = "YOUR_API_KEY"

In [None]:
response = openai.Completion.create(
    engine="davinci", # The specific version of GPT model to use
    prompt="What is dot-product attention?", # What you feed into the model
    max_tokens=300, # Max number of characters in output
    n=1, # Number of responses to generate
    stop=None, # When the output stops
    temperature=0.0, # Controls the randomness of the output (>=0)
)

message = response.choices[0].text.strip()
print(message)

## Task 7: ChatGPT for Education

Using ChatGPT, generate an interesting piece of educational content about the <br>
current topic of RNNs and/or Transformers. **Hint: better promt=better response**<br>

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your code goes below

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your code goes above this

## Task 8 (Bonus): Model performance analysis

Analyze the performance of at least 3 models used in this exercise using at <br>
least two different metrics on the test data sets or newly generated data. <br>
Examples of metrics might be single-event prediction error, forecasting error, <br>
or deviation-based metrics.

Further, provide a brief text description (less than 500 words) explaining the <br>
metric used and results.

↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ your code goes below

Task 8) answer:

↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑↑ your code goes above this