# Load Libraries

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchtext, datasets, math
from tqdm import tqdm

torchtext.disable_torchtext_deprecation_warning()

In [2]:
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# print(device)

# Set device (MPS if available)
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
device

device(type='mps')

In [3]:
SEED = 1234
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# **Task 1: Dataset Acquisition**

#### Instructions
Your first task is to find a suitable text dataset. This is worth **1 point**.

1. **Dataset Selection**  
   - Choose a dataset that is text-rich and appropriate for language modeling.
   - The dataset can be based on any theme, such as:  
     - *Harry Potter*  
     - *Star Wars*  
     - Jokes  
     - Works by Isaac Asimov  
     - Thai Stories  
     - Or any other theme you prefer.  

2. **Dataset Description**  
   - Provide a brief description of your chosen dataset.  

3. **Source Crediting**  
   - Source your dataset from reputable public databases or repositories.  
   - Ensure to give proper credit to the dataset source in your documentation.  

#### Note
The dataset must be suitable for language modeling and rich in textual content.

# 1. Load data  - Harry Potter Text
We will be using the harrypotter text which contains a large corpus of text, perfect for language modeling task. This time we will use the `datasets` library from HuggingFace to load.

### Dataset Description: Harry Potter Text Dataset
Dataset Source: https://huggingface.co/datasets/elricwan/HarryPotter 
 
The selected dataset is the **Harry Potter Text Dataset**, sourced from the [Hugging Face Dataset Repository](https://huggingface.co/datasets/elricwan/HarryPotter). This dataset contains text derived from the *Harry Potter* series, making it a rich resource for language modeling and natural language processing tasks. The dataset is ideal for exploring themes, styles, and linguistic patterns in literary works.

**Source Credit**:  
The dataset was published by [elricwan](https://huggingface.co/elricwan) on Hugging Face, a reputable platform for sharing datasets and machine learning models.

In [4]:
from datasets import load_dataset

#Load the Harry Potter dataset
dataset = load_dataset("elricwan/HarryPotter")

In [5]:
import os
# From the code above a datasets folder has been created with all the dataset
# Now split the dataset to test, train and validation dataset

# Path to corpus dataset folder
dataset_folder = 'datasets/'

# List all text files in the dataset folder
all_files = sorted(os.listdir(dataset_folder))

# Print the files to understand the structure
print("All files:", all_files)

All files: ['01 Harry Potter and the Sorcerers Stone.txt', '02 Harry Potter and the Chamber of Secrets.txt', '03 Harry Potter and the Prisoner of Azkaban.txt', '04 Harry Potter and the Goblet of Fire.txt', '05 Harry Potter and the Order of the Phoenix.txt', '06 Harry Potter and the Half-Blood Prince.txt', '07 Harry Potter and the Deathly Hallows.txt']


In [6]:
# Split the files into train, test, and validation sets
train_files = [all_files[i] for i in range(0, 4)]  # First 4 files for training
test_files = [all_files[i] for i in range(4, 6)]   # Next 2 files for testing
validation_files = [all_files[i] for i in range(6, 7)]  # One file for validation

In [7]:
# Function to read files and return text content
def read_files(file_list):
    text_data = []
    for file_name in file_list:
        with open(os.path.join(dataset_folder, file_name), 'r', encoding='utf-8') as file:
            text_data.append(file.read())
    return text_data

In [8]:
from datasets import Dataset

# Read the files for each split
train_data = read_files(train_files)
test_data = read_files(test_files)
validation_data = read_files(validation_files)

# Create Dataset objects for each split
train_dataset = Dataset.from_dict({"text": train_data})
test_dataset = Dataset.from_dict({"text": test_data})
validation_dataset = Dataset.from_dict({"text": validation_data})

In [9]:
from datasets import DatasetDict
harrypotter_dataset = DatasetDict({
    'train': train_dataset,
    'test': test_dataset,
    'validation': validation_dataset
})

In [10]:
print(harrypotter_dataset)

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 4
    })
    test: Dataset({
        features: ['text'],
        num_rows: 2
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 1
    })
})


In [11]:
print(harrypotter_dataset['train'].shape)
print(harrypotter_dataset['test'].shape)
print(harrypotter_dataset['validation'].shape)

(4, 1)
(2, 1)
(1, 1)


# **Task 2: Model Training**
Incorporate the chosen dataset into our existing code framework. Train a
language model that can understand the context and style of the text.  
1) Detail the steps taken to preprocess the text data. (1 points)  
2) Describe the model architecture and the training process. (1 points)

### Preprocessing Steps

**Tokenization:**
- The `get_tokenizer` function from `torchtext` is used to create a basic English tokenizer. This tokenizer breaks text into individual words.  
- A `lambda` function is defined to apply tokenization to each example in the dataset. The resulting tokens are stored in a new field named `tokens`.  
- The `map` function is used to apply this tokenization function to each example in the dataset, and the original text column is removed.  

**Numericalization:**
- The vocabulary (`vocab`) is built from the tokenized training dataset, assigning a unique numerical index to each token that appears at least three times (`min_freq=3`).  
- Two special tokens, `(unknown)` and `(end of sequence)`, are added to the vocabulary at indices `0` and `1`, respectively.  
- The default index for the vocabulary is set to the index of `(unknown)`.  

**get_data Method:**
- This method converts the tokenized dataset into a format suitable for training a language model.  
- For each example in the dataset, the `tokens` are retrieved, and `(end of sequence)` is appended to represent the end of the sequence.  
- The tokens are then numericalized using the vocabulary (`vocab`), and the resulting indices are added to the `data` list.  
- The list of indices is converted into a PyTorch LongTensor (`torch.LongTensor`).  
- The data is reshaped into batches of size `batch_size`, and the function returns the processed data.  

The `get_data` function is used to preprocess the tokenized datasets for training, validation, and testing. The resulting `train_data`, `valid_data`, and `test_data` are batches of numericalized sequences, ready for input to the language model.

### Model Architecture

The `LSTMLanguageModel` is an LSTM-based language model implemented using PyTorch. The LSTM model consists of the following layers:

**Embedding Layer:**  
- The input to this layer is a token, and the output is an embedding vector of dimension `emb_dim`, which in our case is `1024`.  

**LSTM Layer:**  
- This layer takes `emb_dim` (1024) as input and outputs hidden states with a dimension of `hid_dim` for each time step in the sequence.  
- Parameters for this layer are:  
  - Number of layers: `num_layers` (2)  
  - Hidden state dimension: `hid_dim` (1024)  
  - Dropout rate: `dropout_rate` (0.65)  
- Weights for this layer are initialized uniformly within the range `[-init_range_other, init_range_other]`.

**Dropout Layer:**  
- This layer is applied to discard some of the output from the LSTM layer by randomly setting some of the LSTM output values to `0`.  
- The main purpose of this layer is to introduce regularization.  
- The rate at which outputs are set to `0` is determined by `dropout_rate` (0.65).

**Linear (Fully Connected) Layer:**  
- In this layer, the output from the LSTM layer, which has a dimension of `hid_dim` (1024), is passed through, and the score for each word in the vocabulary is computed.  
- Weights are initialized uniformly within the range `[-init_range_other, init_range_other]`, and the bias is set to zero.

The model has a total of 33,607,694 trainable parameters, including the weights and biases in the embedding layer, LSTM layer, linear layer, and other parameters.

### Training Process

**LSTM Model's Method used in Training**

**Hidden State Initialization:**  
- The `init_hidden` method is responsible for initializing the hidden state. This method sets the hidden state and cell state for the LSTM layer to zero.

**Forward Method:**  
- The `forward` method takes a batch of token sequences and an initial hidden state (`hidden`) as input.  
- It then embeds the input tokens using the embedding layer and applies dropout to the embedded sequence.  
- After that, it passes the sequence through the LSTM layer to obtain hidden states.  
- It then applies dropout to the LSTM output and finally feeds the output through a linear layer to obtain predictions for the next tokens.  
- This method returns the predictions as well as the hidden state.

**Detaching Hidden State:**  
- The `detach_hidden` method is used to detach the hidden states from the computation graph to prevent unwanted gradient computation during training.

**Train Method**

**Loading Data:**  
- Data is loaded into the model in batches with a size of `num_batches`. Batches that are not multiples of `seq_len` are discarded to maintain consistent sequence length.

**Training Loop:**  
During each training iteration, the following steps are carried out:
1. Training data is iterated in batches of sequence length.
2. The gradient of the model parameters is set to zero.
3. The hidden state is detached from the computation graph to ensure better accuracy.
4. The `get_batch` method is called to retrieve a batch of input and target sequences.
5. Data is transferred to the specified device (GPU or CPU) for processing.
6. A forward pass is made to obtain the predictions and update the hidden state.
7. The loss of the prediction is calculated given the target.
8. Backpropagation is performed to update the model's parameters.
9. The `clip_grad_norm` method from `torch.nn.utils` is used to prevent the exploding gradient problem.
10. Model parameters are updated.
11. The loss for the epoch is accumulated.

**Validation:**  
- Similarly, the `evaluate` method is used to compute the validation loss.

**Learning Rate Update:**  
- The learning rate is updated using a learning rate.

Lastly, the model with the best validation loss is saved.

# 2. Preprocessing

#### Tokenization
Simply tokenize the given text to tokens

In [12]:
from torchtext.data.utils import get_tokenizer
torchtext.disable_torchtext_deprecation_warning()

tokenizer = get_tokenizer('basic_english')

tokenize_data = lambda example, tokenizer: {'tokens': tokenizer(example['text'])}
tokenized_dataset = harrypotter_dataset.map(tokenize_data, remove_columns=['text'], fn_kwargs={'tokenizer': tokenizer})

Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [13]:
print(tokenized_dataset['train'][2]['tokens'])



#### Numericalizing
We will tell torchtext to add any word that has occurred at least three times in the dataset to the vocabulary because otherwise it would be too big.  Also we shall make sure to add `unk` and `eos`.

In [14]:
from torchtext.vocab import build_vocab_from_iterator

vocab = build_vocab_from_iterator(tokenized_dataset['train']['tokens'], min_freq=3)
vocab.insert_token('<unk>', 0)
vocab.insert_token('<eos>', 1)
vocab.set_default_index(vocab['<unk>'])

In [15]:
print(len(vocab))

8206


In [16]:
print(vocab.get_itos()[:10])

['<unk>', '<eos>', ',', '.', 'the', 'and', 'to', '”', 'a', 'of']


Now we can save the vocab for future use in our app.

In [39]:
# Save the vocab
torch.save(vocab, './app/pickle/vocab')

# 3. Prepare the batch loader

#### Prepare data
Given "Chaky loves eating at AIT", and "I really love deep learning", and given batch size = 3, we will get three batches of data "Chaky loves eating at", "AIT `<eos>` I really", "love deep learning `<eos>`".  

In [18]:
def get_data(dataset, vocab, batch_size):
    data = []
    for example in dataset:
        if example['tokens']:
            tokens = example['tokens'].append('<eos>') # append tokens '<eos>' to the end of the 'tokens'
            tokens = [vocab[token] for token in example['tokens']] # transform list of tokens to token's indices on our vocab
            data.extend(tokens)
    data = torch.LongTensor(data)
    num_batches = data.shape[0] // batch_size
    data = data[:num_batches * batch_size]
    data = data.view(batch_size, num_batches) #view vs. reshape (whether data is contiguous)
    return data #[batch size, seq len]

In [19]:
print(tokenized_dataset.keys())

dict_keys(['train', 'test', 'validation'])


In [20]:
batch_size = 128
train_data = get_data(tokenized_dataset['train'], vocab, batch_size)
valid_data = get_data(tokenized_dataset['validation'], vocab, batch_size)
test_data  = get_data(tokenized_dataset['test'],  vocab, batch_size)

In [21]:
train_data.shape

torch.Size([128, 4407])

In [22]:
valid_data.shape

torch.Size([128, 1891])

In [23]:
test_data.shape

torch.Size([128, 4196])

# 4. Modeling

In [24]:
class LSTMLanguageModel(nn.Module):
    def __init__(self, vocab_size, emb_dim, hid_dim, num_layers, dropout_rate):
        super().__init__()
        self.num_layers = num_layers
        self.hid_dim    = hid_dim
        self.emb_dim    = emb_dim
        
        self.embedding  = nn.Embedding(vocab_size, emb_dim)
        self.lstm       = nn.LSTM(emb_dim, hid_dim, num_layers=num_layers, dropout=dropout_rate, batch_first=True)
        self.dropout    = nn.Dropout(dropout_rate)
        self.fc         = nn.Linear(hid_dim, vocab_size)
        
        self.init_weights()
    
    def init_weights(self):
        init_range_emb = 0.1
        init_range_other = 1/math.sqrt(self.hid_dim)
        self.embedding.weight.data.uniform_(-init_range_emb, init_range_other)
        self.fc.weight.data.uniform_(-init_range_other, init_range_other)
        self.fc.bias.data.zero_()
        for i in range(self.num_layers):
            self.lstm.all_weights[i][0] = torch.FloatTensor(self.emb_dim,
                self.hid_dim).uniform_(-init_range_other, init_range_other) #We
            self.lstm.all_weights[i][1] = torch.FloatTensor(self.hid_dim,   
                self.hid_dim).uniform_(-init_range_other, init_range_other) #Wh
    
    def init_hidden(self, batch_size, device):
        hidden = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        cell   = torch.zeros(self.num_layers, batch_size, self.hid_dim).to(device)
        return hidden, cell
        
    def detach_hidden(self, hidden):
        hidden, cell = hidden
        hidden = hidden.detach() #not to be used for gradient computation
        cell   = cell.detach()
        return hidden, cell
        
    def forward(self, src, hidden):
        #src: [batch_size, seq len]
        embedding = self.dropout(self.embedding(src)) #harry potter is
        #embedding: [batch-size, seq len, emb dim]
        output, hidden = self.lstm(embedding, hidden)
        #ouput: [batch size, seq len, hid dim]
        #hidden: [num_layers * direction, seq len, hid_dim]
        output = self.dropout(output)
        prediction =self.fc(output)
        #prediction: [batch_size, seq_len, vocab_size]
        return prediction, hidden

## 5. Training 

Follows very basic procedure.  One note is that some of the sequences that will be fed to the model may involve parts from different sequences in the original dataset or be a subset of one (depending on the decoding length). For this reason we will reset the hidden state every epoch, this is like assuming that the next batch of sequences is probably always a follow up on the previous in the original dataset.

In [25]:
vocab_size = len(vocab)
emb_dim = 1024                # 400 in the paper
hid_dim = 1024                # 1150 in the paper
num_layers = 2                # 3 in the paper
dropout_rate = 0.65              
lr = 1e-3 

In [26]:
model      = LSTMLanguageModel(vocab_size, emb_dim, hid_dim, num_layers, dropout_rate).to(device)
optimizer  = optim.Adam(model.parameters(), lr=lr)
criterion  = nn.CrossEntropyLoss()
num_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {num_params:,} trainable parameters')

The model has 33,607,694 trainable parameters


In [27]:
def get_batch(data, seq_len, idx):
    #data #[batch size, bunch of tokens]
    src    = data[:, idx:idx+seq_len]                   
    target = data[:, idx+1:idx+seq_len+1]  #target simply is ahead of src by 1            
    return src, target

In [28]:
def train(model, data, optimizer, criterion, batch_size, seq_len, clip, device):
    
    epoch_loss = 0
    model.train()
    # drop all batches that are not a multiple of seq_len
    # data #[batch size, seq len]
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]  #we need to -1 because we start at 0
    num_batches = data.shape[-1]
    
    #reset the hidden every epoch
    hidden = model.init_hidden(batch_size, device)
    
    for idx in tqdm(range(0, num_batches - 1, seq_len), desc='Training: ',leave=False):
        optimizer.zero_grad()
        
        #hidden does not need to be in the computational graph for efficiency
        hidden = model.detach_hidden(hidden)

        src, target = get_batch(data, seq_len, idx) #src, target: [batch size, seq len]
        src, target = src.to(device), target.to(device)
        batch_size = src.shape[0]
        prediction, hidden = model(src, hidden)               

        #need to reshape because criterion expects pred to be 2d and target to be 1d
        prediction = prediction.reshape(batch_size * seq_len, -1)  #prediction: [batch size * seq len, vocab size]  
        target = target.reshape(-1)
        loss = criterion(prediction, target)
        
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

In [29]:
def evaluate(model, data, criterion, batch_size, seq_len, device):

    epoch_loss = 0
    model.eval()
    num_batches = data.shape[-1]
    data = data[:, :num_batches - (num_batches -1) % seq_len]
    num_batches = data.shape[-1]

    hidden = model.init_hidden(batch_size, device)

    with torch.no_grad():
        for idx in range(0, num_batches - 1, seq_len):
            hidden = model.detach_hidden(hidden)
            src, target = get_batch(data, seq_len, idx)
            src, target = src.to(device), target.to(device)
            batch_size= src.shape[0]

            prediction, hidden = model(src, hidden)
            prediction = prediction.reshape(batch_size * seq_len, -1)
            target = target.reshape(-1)

            loss = criterion(prediction, target)
            epoch_loss += loss.item() * seq_len
    return epoch_loss / num_batches

Here we will be using a `ReduceLROnPlateau` learning scheduler which decreases the learning rate by a factor, if the loss don't improve by a certain epoch.

In [30]:
n_epochs = 50
seq_len  = 50 #<----decoding length
clip    = 0.25

lr_scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, factor=0.5, patience=0)

best_valid_loss = float('inf')

for epoch in range(n_epochs):
    train_loss = train(model, train_data, optimizer, criterion, 
                batch_size, seq_len, clip, device)
    valid_loss = evaluate(model, valid_data, criterion, batch_size, 
                seq_len, device)

    lr_scheduler.step(valid_loss)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), './app/pickle/best-val-lstm_lm.pt')

        print(f'\tTrain Perplexity: {math.exp(train_loss):.3f}')
        print(f'\tValid Perplexity: {math.exp(valid_loss):.3f}')

                                                         

	Train Perplexity: 615.274
	Valid Perplexity: 366.025


                                                         

	Train Perplexity: 329.955
	Valid Perplexity: 213.202


                                                         

	Train Perplexity: 212.777
	Valid Perplexity: 160.796


                                                         

	Train Perplexity: 163.101
	Valid Perplexity: 138.377


                                                         

	Train Perplexity: 138.657
	Valid Perplexity: 126.304


                                                         

	Train Perplexity: 122.880
	Valid Perplexity: 117.767


                                                         

	Train Perplexity: 111.627
	Valid Perplexity: 112.488


                                                         

	Train Perplexity: 102.991
	Valid Perplexity: 108.746


                                                         

	Train Perplexity: 96.244
	Valid Perplexity: 106.194


                                                         

	Train Perplexity: 90.384
	Valid Perplexity: 104.173


                                                         

	Train Perplexity: 85.481
	Valid Perplexity: 102.181


                                                         

	Train Perplexity: 80.997
	Valid Perplexity: 101.576


                                                         

	Train Perplexity: 77.141
	Valid Perplexity: 99.851


                                                         

	Train Perplexity: 73.722
	Valid Perplexity: 98.369


                                                         

	Train Perplexity: 70.460
	Valid Perplexity: 98.272


                                                         

	Train Perplexity: 67.672
	Valid Perplexity: 97.592


                                                         

	Train Perplexity: 65.024
	Valid Perplexity: 97.069


                                                         

	Train Perplexity: 59.645
	Valid Perplexity: 95.429


                                                         

	Train Perplexity: 53.483
	Valid Perplexity: 95.417


                                                         

	Train Perplexity: 53.354
	Valid Perplexity: 95.406


                                                         

	Train Perplexity: 53.239
	Valid Perplexity: 95.406


                                                         

	Train Perplexity: 53.286
	Valid Perplexity: 95.405


                                                         

	Train Perplexity: 53.360
	Valid Perplexity: 95.405


                                                         

	Train Perplexity: 53.281
	Valid Perplexity: 95.405


                                                         

	Train Perplexity: 53.248
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.304
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.267
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.338
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.382
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.290
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.382
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.331
	Valid Perplexity: 95.404


                                                         

	Train Perplexity: 53.347
	Valid Perplexity: 95.404


                                                         

# Testing

In [33]:
model.load_state_dict(torch.load('./app/pickle/best-val-lstm_lm.pt',  map_location=device))
test_loss = evaluate(model, test_data, criterion, batch_size, seq_len, device)
print(f'Test Perplexity: {math.exp(test_loss):.3f}')

Test Perplexity: 103.879


## 7. Real-world inference

Here we take the prompt, tokenize, encode and feed it into the model to get the predictions.  We then apply softmax while specifying that we want the output due to the last word in the sequence which represents the prediction for the next word.  We divide the logits by a temperature value to alter the model’s confidence by adjusting the softmax probability distribution.

Once we have the Softmax distribution, we randomly sample it to make our prediction on the next word. If we get <unk> then we give that another try.  Once we get <eos> we stop predicting.
    
We decode the prediction back to strings last lines.

In [34]:
def generate(prompt, max_seq_len, temperature, model, tokenizer, vocab, device, seed=None):
    if seed is not None:
        torch.manual_seed(seed)
    model.eval()
    tokens = tokenizer(prompt)
    indices = [vocab[t] for t in tokens]
    batch_size = 1
    hidden = model.init_hidden(batch_size, device)
    with torch.no_grad():
        for i in range(max_seq_len):
            src = torch.LongTensor([indices]).to(device)
            prediction, hidden = model(src, hidden)
            
            #prediction: [batch size, seq len, vocab size]
            #prediction[:, -1]: [batch size, vocab size] #probability of last vocab
            
            probs = torch.softmax(prediction[:, -1] / temperature, dim=-1)  
            prediction = torch.multinomial(probs, num_samples=1).item()    
            
            while prediction == vocab['<unk>']: #if it is unk, we sample again
                prediction = torch.multinomial(probs, num_samples=1).item()

            if prediction == vocab['<eos>']:    #if it is eos, we stop
                break

            indices.append(prediction) #autoregressive, thus output becomes input

    itos = vocab.get_itos()
    tokens = [itos[i] for i in indices]
    return tokens

In [35]:
prompt = 'Harry Potter is '
max_seq_len = 30
seed = 0

#smaller the temperature, more diverse tokens but comes 
#with a tradeoff of less-make-sense sentence
temperatures = [0.5, 0.7, 0.75, 0.8, 1.0]
for temperature in temperatures:
    generation = generate(prompt, max_seq_len, temperature, model, tokenizer, 
                          vocab, device, seed)
    print(str(temperature)+'\n'+' '.join(generation)+'\n')

0.5
harry potter is all right . ” harry felt a very large package in the third morning . harry heard them , who had been sure to get rid of the dursleys .

0.7
harry potter is all right . ’ harry felt that had been closing in the third room . harry heard them , harry saw their way to his knees , and uncle vernon

0.75
harry potter is all right . ’ harry felt that had been closing in the third room . harry heard them , harry saw their shoes to his knees , and uncle vernon

0.8
harry potter is all right . ’ harry felt that had been closing in the third room . “yes , leave the excited floor , he spoke , and then hermione enjoyed no

1.0
harry potter is all ten point . ” “i don ' t bring their market back , but in time to pretend to make their manners to history , ” hermione continued .



# **Task 3: Text Generation** 
**Web Application Development**

Develop a simple web application that demonstrates the capabilities of your language model. (2 points)

1) The application should include an input box where users can type in a text prompt.  
2) Based on the input, the model should generate and display a continuation of the text. For example, if the input is ”Harry Potter is”, the model might generate ”a wizard in the world of Hogwarts”.  
3) Provide documentation on how the web application interfaces with the language model.

### Web application can be accessed locally:  
To deploy application first download repo from github (https://github.com/sachinmalego/NLP-A2-Language-Model.git).   
Open in VSCode and open terminal.  
In the terminal type "python3 app.py". My local deployment address was "http://127.0.0.1:5000/" however your's might be different.  
Go to browser and enter your local deployment server address to test the application. 

**Video of Working application:**
Link of working video:  
https://drive.google.com/file/d/13deBCHkm2k7d1QU6PkjkckAyNfLkLODG/view?usp=sharing

<img src = "screenshots/Applicationvideo.gif" width=800>

**Screen shots of the working application is attached here with:** 
Link of screenshots:  
https://drive.google.com/drive/folders/1sOBa98Bgl8smWl-Z7F2zFSXLtctovYrE?usp=sharing

<img src = "screenshots/Screenshot1.png" width=800>

<img src = "screenshots/Screenshot2.png" width=800>

<img src = "screenshots/Screenshot3.png" width=800>

<img src = "screenshots/Screenshot4.png" width=800>
