## Downloading the dataset

In [1]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

--2023-12-02 05:03:25--  https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2023-12-02 05:03:39 (6.07 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



## Preparing the dataset

In [5]:
import tensorflow as tf
from tensorflow import keras

In [4]:
dataset = keras.utils.text_dataset_from_directory(directory="aclImdb", label_mode=None, batch_size=256)

Found 100006 files belonging to 1 classes.


In [6]:
dataset = dataset.map(lambda x: tf.strings.regex_replace(x, "<br />", " "))

In [15]:
for d in dataset:
  print(d[0])
  break


tf.Tensor(b"I sat down to watch this film with much trepidation and little hope. I didn't think it would be possible for this film to live up to its subject matter. But it absolutely did, and then some. First, I must say that Jared Harris did an extraordinary job as John Lennon. At times it seemed that Harris was channeling Lennon. The resemblance was often uncanny, and he clearly studied Lennon's mannerisms and vocal inflections. Aiden Quinn was quite good as McCartney, also bearing a striking resemblance to Macca, although he did occasionally trip over his Scouse accent.  This work of fiction was well-written and well-directed. It was pure fantasy, of course, but sometimes I felt like a voyeur peeking through a keyhole at this reunion. The rooftop scene was especially moving, as McCartney told Lennon what he had never heard as a child--that he was worthy and important, and it could never be his fault that he was abandoned by his parents. I also enjoyed the scene in the park where the

## Text to vector conversion

In [16]:
from tensorflow.keras.layers import TextVectorization

In [17]:
sequence_length = 100

# We’ll only consider the top 15,000 most common words—anything else will be treated as the out-of-vocabulary token, "[UNK]".
# We want to return integer word index sequences.
# We’ll work with inputs and targets of length 100 (but since we’ll offset the targets by 1, the model will actually see sequences of length 99).
vocab_size = 15000
text_vectorization = TextVectorization(max_tokens=vocab_size, output_mode="int", output_sequence_length=sequence_length)
text_vectorization.adapt(dataset)

Let’s use the layer to create a language modeling dataset where input samples are vectorized texts, and corresponding targets are the same texts offset by one word.

In [18]:
def prepare_lm_dataset(text_batch):
  vectorized_sequences = text_vectorization(text_batch)
  x = vectorized_sequences[:, :-1] # Create inputs by cutting off the last word of the sequences.
  y = vectorized_sequences[:, 1:]
  return x, y

In [19]:
lm_dataset = dataset.map(prepare_lm_dataset, num_parallel_calls=4)

In [24]:
for l in lm_dataset:
  print("input: ", l[0])
  print("output: ",l[1])
  break

input:  tf.Tensor(
[[ 6355   387 13277 ...     5    42  1986]
 [  323   383   733 ...   699     2  2539]
 [  310     5     2 ...    13     1    50]
 ...
 [   11    17     7 ...   135    31     8]
 [    1   172    35 ...   409   500   989]
 [   22    33   743 ...     0     0     0]], shape=(256, 99), dtype=int64)
output:  tf.Tensor(
[[  387 13277   798 ...    42  1986  7098]
 [  383   733   270 ...     2  2539   793]
 [    5     2   572 ...     1    50    11]
 ...
 [   17     7     4 ...    31     8    31]
 [  172    35     2 ...   500   989    43]
 [   33   743    19 ...     0     0     0]], shape=(256, 99), dtype=int64)


## Transformer-based model
We could train a model that takes as input a sequence of N words and simply predicts word N+1. But this can cause couple of issues so, we’ll use a sequence-to-sequence model: we’ll feed sequences of N words (indexed from 0 to N) into our model, and we’ll predict the sequence offset by one (from 1 to N+1).

![Screenshot from 02-12-23 06:26:46](https://github.com/surajkarki66/MediLeaf_backend/assets/50628520/09b4c824-c116-468e-9808-109e45a6a0d3)


When you’re doing text generation, there is no source sequence: you’re just trying to predict the next tokens in the target sequence given past tokens, which we can do using only the decoder.

In [27]:
import tensorflow as tf
from tensorflow.keras import layers

class PositionalEmbedding(layers.Layer):
    '''A positional encoding is a finite dimensional representation of the location or “position” of items in a sequence.
     Given some sequence A = [a_0, …, a_{n-1}], the positional encoding must be some type of tensor that we can feed to a
     model to tell it where some value a_i is in the sequence A.'''
    def __init__(self, sequence_length, input_dim, output_dim, **kwargs):
        super().__init__(**kwargs)
        self.token_embeddings = layers.Embedding(
            input_dim=input_dim, output_dim=output_dim)
        self.position_embeddings = layers.Embedding(
            input_dim=sequence_length, output_dim=output_dim)
        self.sequence_length = sequence_length
        self.input_dim = input_dim
        self.output_dim = output_dim

    def call(self, inputs):
        length = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=length, delta=1)
        embedded_tokens = self.token_embeddings(inputs)
        embedded_positions = self.position_embeddings(positions)
        return embedded_tokens + embedded_positions

    def compute_mask(self, inputs, mask=None):
        return tf.math.not_equal(inputs, 0)

    def get_config(self):
        config = super(PositionalEmbedding, self).get_config()
        config.update({
            "output_dim": self.output_dim,
            "sequence_length": self.sequence_length,
            "input_dim": self.input_dim,
        })
        return config


class TransformerDecoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention_1 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.attention_2 = layers.MultiHeadAttention(
          num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()
        self.layernorm_3 = layers.LayerNormalization()
        self.supports_masking = True

    def get_config(self):
        config = super(TransformerDecoder, self).get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

    def get_causal_attention_mask(self, inputs):
        input_shape = tf.shape(inputs)
        batch_size, sequence_length = input_shape[0], input_shape[1]
        i = tf.range(sequence_length)[:, tf.newaxis]
        j = tf.range(sequence_length)
        mask = tf.cast(i >= j, dtype="int32")
        mask = tf.reshape(mask, (1, input_shape[1], input_shape[1]))
        mult = tf.concat(
            [tf.expand_dims(batch_size, -1),
             tf.constant([1, 1], dtype=tf.int32)], axis=0)
        return tf.tile(mask, mult)

    def call(self, inputs, encoder_outputs, mask=None):
        causal_mask = self.get_causal_attention_mask(inputs)
        if mask is not None:
            padding_mask = tf.cast(
                mask[:, tf.newaxis, :], dtype="int32")
            padding_mask = tf.minimum(padding_mask, causal_mask)
        else:
            padding_mask = mask
        attention_output_1 = self.attention_1(
            query=inputs,
            value=inputs,
            key=inputs,
            attention_mask=causal_mask)
        attention_output_1 = self.layernorm_1(inputs + attention_output_1)
        attention_output_2 = self.attention_2(
            query=attention_output_1,
            value=encoder_outputs,
            key=encoder_outputs,
            attention_mask=padding_mask,
        )
        attention_output_2 = self.layernorm_2(
            attention_output_1 + attention_output_2)
        proj_output = self.dense_proj(attention_output_2)
        return self.layernorm_3(attention_output_2 + proj_output)

In [28]:
from tensorflow.keras import layers

embed_dim = 256
latent_dim = 2048
num_heads = 2

inputs = keras.Input(shape=(None,), dtype="int64")
x = PositionalEmbedding(sequence_length, vocab_size, embed_dim)(inputs)
x = TransformerDecoder(embed_dim, latent_dim, num_heads)(x, x)
outputs = layers.Dense(vocab_size, activation="softmax")(x)
model = keras.Model(inputs, outputs)
model.compile(loss="sparse_categorical_crossentropy", optimizer="rmsprop")

## A text-generation callback with variable-temperature sampling

We’ll use a callback to generate text using a range of different temperatures after every epoch. This allows you to see how the generated text evolves as the model begins to converge, as well as the impact of temperature in the sampling strategy. To seed text generation, we’ll use the prompt “this movie”: all of our generated texts will start with this.

**The text-generation callback**

In [29]:
import numpy as np

tokens_index = dict(enumerate(text_vectorization.get_vocabulary()))

def sample_next(predictions, temperature=1.0):
    predictions = np.asarray(predictions).astype("float64")
    predictions = np.log(predictions) / temperature
    exp_preds = np.exp(predictions)
    predictions = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, predictions, 1)
    return np.argmax(probas)

class TextGenerator(keras.callbacks.Callback):
    def __init__(self,
                 prompt,
                 generate_length,
                 model_input_length,
                 temperatures=(1.,),
                 print_freq=1):
        self.prompt = prompt
        self.generate_length = generate_length
        self.model_input_length = model_input_length
        self.temperatures = temperatures
        self.print_freq = print_freq
        vectorized_prompt = text_vectorization([prompt])[0].numpy()
        self.prompt_length = np.nonzero(vectorized_prompt == 0)[0][0]

    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % self.print_freq != 0:
            return
        for temperature in self.temperatures:
            print("== Generating with temperature", temperature)
            sentence = self.prompt
            for i in range(self.generate_length):
                tokenized_sentence = text_vectorization([sentence])
                predictions = self.model(tokenized_sentence)
                next_token = sample_next(
                    predictions[0, self.prompt_length - 1 + i, :]
                )
                sampled_token = tokens_index[next_token]
                sentence += " " + sampled_token
            print(sentence)

prompt = "This movie"
text_gen_callback = TextGenerator(
    prompt,
    generate_length=50,
    model_input_length=sequence_length,
    temperatures=(0.2, 0.5, 0.7, 1., 1.5))

Since this model takes lots of time to train i am going to train it only for 25 epochs

In [31]:
model.fit(lm_dataset, epochs=25, callbacks=[text_gen_callback])

Epoch 1/25
This movie from the biggest piece is so good one definitely put off the acting a great speaking of material definitely lacks understanding a plot the cast is that are supposed to be one or a few scenes abound in shock some plausibility the movie takes to the character and the place
== Generating with temperature 0.5
This movie is one of the best of the [UNK] bits its one of the best ive ever seen in a bit of the events 1 for two hype there is simple and not to have my computer graphics is robert emotionless the music can you stand on the better than that
== Generating with temperature 0.7
This movie starts to be a very good movie this is a [UNK] for me they get a church of authenticity poor flight about what was involved in a plane into a new i was a moving it not to see it is the early 1960s i would be yet they werent
== Generating with temperature 1.0
This movie should be [UNK] worst movie i have ever seen about zombie movie i have one had the many this character is said t

<keras.src.callbacks.History at 0x7ca600de1d20>

## Conclusion
As you can see, a low temperature value results in very boring and repetitive text and can sometimes cause the generation process to get stuck in a loop. With higher temperatures, the generated text becomes more interesting, surprising, even creative. With a very high temperature, the local structure starts to break down, and the output looks largely random.