<a href="https://colab.research.google.com/github/rupeshthapa123/NotebookProject/blob/main/Lab_8_GenerativeAI_RupeshThapa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generative AI



## Text Generation in Keras

First step is downloading the dataset and extracting it: the data contains wikipedia and other movie reviews

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2024-07-27 22:40:01--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-07-27 22:40:17 (5.17 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [None]:
!tar -xf aclImdb_v1.tar.gz

## Importing all necessary library

In [None]:
import keras
import tensorflow
from tensorflow import keras
from tensorflow.keras import layers

In [None]:
import tensorflow as tf

dataset = keras.utils.text_dataset_from_directory(
    "aclImdb",
    label_mode=None,
    batch_size=256,
)
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))

Found 100006 files belonging to 1 classes.


## Using a Text Vectorization layer to compute the vocabulary
Also, we consider the first sequence_length words of each review and cut the rest

In [None]:
from tensorflow.keras.layers import TextVectorization

sequence_length = 100
vocab_size = 15000
text_vectorization = TextVectorization(
    max_tokens=vocab_size,
    output_mode="int",
    output_sequence_length=sequence_length,
)
text_vectorization.adapt(dataset)

## Using a TextVectorization layer to create a language modeling dataset

In [None]:
def prepare_lm_dataset(text_batch):
  vectorized_sequences = text_vectorization(text_batch)
  x = vectorized_sequences[:, :-1]
  y = vectorized_sequences[:, 1:]
  return x, y

lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)

## Helper class for positional Embedding

In [None]:
class PositionalEmbedding(layers.Layer):
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim
        )
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim
        )
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
      length = tf.shape(inputs)[-1]
      positions = tf.range(start=0, limit=length, delta=1)
      embedded_tokens = self.token_embeddings(inputs)
      embedded_positions = self.position_embeddings(positions)
      return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
      return tf.math.not_equal(inputs, 0)

    def get_config(self):
      config = super().get_config()
      config.update({
          "output_dim": self.output_dim,
          "sequence_length": self.sequence_length,
          "input_dim": self.input_dim
      })
      return config

## Helper Class for Transformer Encoder

In [None]:
class TransformerEncoder(layers.Layer):
  def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
    super().__init__(**kwargs)
    self.embed_dim = embed_dim
    self.dense_dim = dense_dim
    self.num_heads = num_heads
    self.attention = layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=embed_dim)
    self.dense_proj = keras.Sequential(
        [layers.Dense(dense_dim, activation="relu"),
        layers.Dense(embed_dim),]
    )
    self.layernorm_1 = layers.LayerNormalization()
    self.layernorm_2 = layers.LayerNormalization()

  def call(self, inputs, mask=None):
    if mask is not None:
      mask = mask[:, tf.newaxis, :]
    attention_output = self.attention(
        inputs, inputs, attention_mask=mask
    )
    proj_input = self.layernorm_1(inputs + attention_output)
    proj_output = self.dense_proj(proj_input)
    return self.layernorm_2(proj_input + proj_output)

  def get_config(self):
    config = super().get_config()
    config.update({
        "embed_dim": self.embed_dim,
        "num_heads": self.num_heads,
        "dense_dim": self.dense_dim
    })
    return config

## Using a different softmax temperatures to see how the model converges

In [None]:
import numpy as np

tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

def sample_next(predictions, temperature=1.0):
  predictions = np.asarray(predictions).astype("float64")
  predictions = np.log(predictions) / temperature
  exp_preds = np.exp(predictions)
  predictions = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, predictions, 1)
  return np.argmax(probas)

class TextGenerator(keras.callbacks.Callback):
  def __init__(self, prompt, generate_length, model_input_length, temperatures=(1.,), print_freq=1):
    self.prompt = prompt
    self.generate_length = generate_length
    self.model_input_length = model_input_length
    self.temperatures = temperatures
    self.print_freq = print_freq

  def on_epoch_end(self, epoch, logs=None):
    if (epoch + 1) % self.print_freq != 0:
      return
    for temperature in self.temperatures:
      print(f"== Generating with temperature", temperature)
      sentence = self.prompt
      for i in range(self.generate_length):
        tokenized_sentence = text_vectorization([sentence])
        predictions = self.model(tokenized_sentence)
        next_token = sample_next(predictions[0, i, :])
        sampled_token = tokens_index[next_token]
        sentence += " " + sampled_token
      print(sentence)


### Creating the model

In [None]:
# Model definition
embed_dim = 256
dense_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)

In [None]:
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer="rmsprop",
)

In [None]:
prompt = "This movie"
text_gen_callback = TextGenerator(
    prompt,
    generate_length=50,
    model_input_length=sequence_length,
    temperatures=(0.2, 0.5, 1.0, 1.5))

## Training
Final model can be trained and even though 200 epochs might be needed but here we are just using 50 epochs

In [None]:
model.fit(lm_dataset, epochs=50, callbacks=[text_gen_callback])

Epoch 1/50
This movie movie terrible ive movie ever  this movie it art you laugh                                      
== Generating with temperature 0.5
This movie this this this awful this wrong movie watching unfortunately you  even bad into movies this movie this girl this movie but this thing this movie the people you just matter this porn movies to be high             
== Generating with temperature 1.0
This movie movie line visually  this awful had cameos it                                         
== Generating with temperature 1.5
This movie people this my movie live while graduate the i suspense love this if is it disappointed is its much slow has humour its  vote if you [UNK] 1972 this its just want you like me this frederic writing it dont just funny       
Epoch 2/50
This movie movie watch movie this movie movie movie movie movie movie a movie movie movie this film movie is movie effort movie movie movie movie great really movie movie movie movie movie movie movie movie fe

<keras.src.callbacks.History at 0x7d15e23251b0>

## Conclusion
As the resource was limited and I could only use GPU for very short time the result was not as expected as it should have been. I also ran the model for only 50 epochs that is why output are not as expected from the model.