# Programming Assignment 2: Naive Bayes
## Part 1: Language Modelling and Text Generation

#### Name: Abdul Rafay
#### Roll Number: 24100173

### Instructions
*   In this part of the assignment you will be implementing an n-gram model for text-generation.
*   Your code must be in the Python programming language.
*   You are encouraged to use procedural programming and throughly comment your code.
*   For Part 1, in addition to standard libraries i.e. numpy, pandas, regex, matplotlib and scipy, you can use [UrduHack](https://docs.urduhack.com/en/stable/index.html) for tokenization, and [NLKT](https://www.nltk.org/) for training your n-grams. However, no other machine learning toolkits or libraries are allowed.
*   **Carefully read the submission guidelines, plagiarism and late days policy.**

### Submission Guidelines
Submit your code both as notebook file (.ipynb) and python script (.py) as individual files on LMS. Name both files as RollNumber_PA2_PartNum, i.e. this part should be named as `2xxxxxxx_PA4_1`. If you don’t know how to save .ipynb as .py see [this](https://i.stack.imgur.com/L1rQH.png). Failing to submit any one of them might result in the reduction of marks. All cells **MUST** be run to get credit.

### Plagiarism Policy
The code **MUST** be done independently. Any plagiarism or cheating of work from others or the internet will be immediately referred to the DC. If you are confused about what constitutes plagiarism, it is **YOUR** responsibility to consult with the instructor or the TA in a timely manner. No “after the fact” negotiations will be possible. The only way to guarantee that you do not lose marks is **DO NOT LOOK AT ANYONE ELSE'S CODE NOR DISCUSS IT WITH THEM**.

### Late Days Policy

The deadline for the assignment is final. However, in order to accommodate all the 11th-
hour issues, there is a late submission policy i.e. you can submit your assignment within
3 days after the deadline with a 25% deduction each day.


### Introduction
An n-gram is a contiguous sequence of n words. For example "Machine" is a unigram, "Machine Learning" is a bigram and "Machine Learning PA2" is a trigram. In language modeling, n-gram models are probabilistic models of text that use word dependencies and context to predict the likelihood of occurence of an n-gram, i.e. predicting the nth word in an n-gram based on the previous n-1 words:
$$
P(ngram) =  P(word|context) = P(x^{n}|x^{n-1},...,x^{1})
$$
One use of the predictions made by such a model is text generation. In this part you will be training your own n-gram model and using it to generate text after learning from the provided Urdu short stories. 
<br><br>
For additional details of the working of n-gram models, you can also consult [Chapter 3](https://web.stanford.edu/~jurafsky/slp3/3.pdf) of the Speech and Language Processing book as and references.


### Dataset
You will be using the Urdu short stories by Patras Bukhari given in the folder `Urdu Short Stories` in the PA2 zip file for the purposes of this part of the assignment. This contains 6 stories of varying lengths which will serve as inputs for your n-gram model. 
You're required to implement an n-gram model that uses the given stories to generate Urdu text that mimics the input stories.

Start by importing all required libraries here.

In [1]:
import urduhack
import pandas as pd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import os

2022-12-02 05:50:53.488381: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-12-02 05:50:57.712210: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-12-02 05:50:57.712253: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-12-02 05:51:04.839144: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-

### 1.1 - Loading and Preprocessing the Dataset

Read in the short story files given and tokenize the text to be preprocessed.

In [2]:
# code here

dataset_path = 'DataP1'

# store the names of all files in a list inside the dataset_path folder
files = os.listdir(dataset_path)
files.remove('Urdu Data.rar')

# dataset is multiple urdu text files. read all files and store them in a list
dataset = []
for file in files:
    with open(dataset_path + '/' + file, 'r') as f:
        dataset.append(f.read())
    print("Read file: ", file)

Read file:  hostel mein parhna.txt
Read file:  maibal aur main.txt
Read file:  dost k naam.txt
Read file:  cinema ka ishq.txt
Read file:  sawere.txt
Read file:  lahore ka jughrafiya.txt


In [5]:
# normalize the dataset
dataset = [urduhack.normalization.normalize(text) for text in dataset]

In [6]:
# tokenize the data
tokenized_data = []

i = 1
for text in dataset:
    print("Tokenizing text: ", i)
    sentences = urduhack.tokenization.sentence_tokenizer(text)
    words = []
    print("Tokenizing sentences")
    for sentence in sentences:
        words.extend(urduhack.tokenization.word_tokenizer(sentence))
    tokenized_data.append(words)

Tokenizing text:  1
Tokenizing sentences


2022-11-29 23:24:10.758741: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-11-29 23:24:10.758771: W tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-29 23:24:10.758791: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (rafay-pc): /proc/driver/nvidia/version does not exist
2022-11-29 23:24:10.759019: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-29 23:24:18.412021: W tensorflow/tsl/framework/cpu_allocator_impl.c

1/6 [====>.........................] - ETA: 21s

2022-11-29 23:24:19.142371: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 67108864 exceeds 10% of free system memory.
2022-11-29 23:24:19.178817: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 67108864 exceeds 10% of free system memory.




2022-11-29 23:24:19.700403: W tensorflow/tsl/framework/cpu_allocator_impl.cc:82] Allocation of 67108864 exceeds 10% of free system memory.


Tokenizing text:  1
Tokenizing sentences
Tokenizing text:  1
Tokenizing sentences
Tokenizing text:  1
Tokenizing sentences
Tokenizing text:  1
Tokenizing sentences
Tokenizing text:  1
Tokenizing sentences


Preprocess the tokenized data. Go through the data and use your own discretion to decide on what kind of pre-processing might be required.

In [9]:
# sanity check, total number of words in the dataset
total_words = 0
for text in tokenized_data:
    total_words += len(text)
print("Total words in the dataset: ", total_words)

Total words in the dataset:  16028


In [14]:
# create a list of tokens extracted from the dataset
tokens = []
for text in tokenized_data:
    tokens.extend(text)

# create a dictionary of tokens and their frequencies
token_freq = {}
for token in tokens:
    if token in token_freq:
        token_freq[token] += 1
    else:
        token_freq[token] = 1

In [15]:
# save the tokens list to a file
with open('tokens.txt', 'w') as f:
    for token in tokens:
        f.write(token + '\n')

In [2]:
# read the tokens from the file
with open('tokens.txt', 'r') as f:
    tokens = f.read().splitlines()

In [7]:
# clean up the tokens
tokens = [urduhack.preprocessing.remove_punctuation(token).strip() for token in tokens if token != '']

In [8]:
len(tokens)

15994

In [9]:
# create a dictionary of tokens and their frequencies
token_freq = {}
for token in tokens:
    if token in token_freq:
        token_freq[token] += 1
    else:
        token_freq[token] = 1

### 1.2 - Creating Unigrams

Start by training a unigram model. For a unigram model, the n-gram probability is approximated by probability of the word in the unigram, as the model assumes independence:

$$
P(word) = \frac{n}{N}
$$

where n = count of the word in the corpus and N = total number of words in the corpus.

Generate a list of unigrams. Print the first 10 unigrams obtained.

In [10]:
unigrams = tokens

print("Total unigrams: ", len(unigrams))
print("First 10 unigrams are: ", unigrams[:10])

Total unigrams:  15994
First 10 unigrams are:  ['ہم', 'نے', 'کالج', 'میں', 'تعلیم', 'تو', 'ضرور', 'پائی', 'اور', 'رفتہ']


Find the probabilities for each unique unigram. 

In [11]:
# create a dataframe for the unigrams and their probabilities
unigram_df = pd.DataFrame(list(token_freq.items()), columns=['Unigram', 'Probability'])
unigram_df['Probability'] = unigram_df['Probability'] / unigram_df['Probability'].sum()

unigram_df.head()

Unnamed: 0,Unigram,Probability
0,ہم,0.006627
1,نے,0.010066
2,کالج,0.001501
3,میں,0.031074
4,تعلیم,0.000688


### 1.3 - Creating Bigrams
Now train a bigram model. 

Generate a list of bigrams. Print the first 10 bigrams obtained.

In [12]:
# code here
bigrams = []
for i in range(len(tokens) - 1):
    bigrams.append(tokens[i] + ' ' + tokens[i + 1])

print("Total bigrams: ", len(bigrams))
print("First 10 bigrams are: ", bigrams[:10])

Total bigrams:  15993
First 10 bigrams are:  ['ہم نے', 'نے کالج', 'کالج میں', 'میں تعلیم', 'تعلیم تو', 'تو ضرور', 'ضرور پائی', 'پائی اور', 'اور رفتہ', 'رفتہ رفتہ']


Find the probabilities for each unique bigram. 

In [13]:
# code here
# create a dictionary of bigrams and their frequencies
bigram_freq = {}
for bigram in bigrams:
    if bigram in bigram_freq:
        bigram_freq[bigram] += 1
    else:
        bigram_freq[bigram] = 1

# create a dataframe for the bigrams and their probabilities
bigram_df = pd.DataFrame(list(bigram_freq.items()), columns=['Bigram', 'Probability'])

# calculate the probability of each bigram
bigram_df['Probability'] = bigram_df.apply(lambda row: row['Probability'] / token_freq[row['Bigram'].split()[0]], axis=1)

bigram_df.head()

# save bigram_df to a csv file
bigram_df.to_csv('bigram_df.csv', index=False)

### 1.4 - Creating Trigrams
Lastly train a trigram model.

Generate a list of trigrams. Print the first 10 trigrams obtained.

In [14]:
# code here
trigrams = []
for i in range(len(tokens) - 2):
    trigrams.append(tokens[i] + ' ' + tokens[i + 1] + ' ' + tokens[i + 2])

print("Total trigrams: ", len(trigrams))
print("First 10 trigrams are: ", trigrams[:10])

Total trigrams:  15992
First 10 trigrams are:  ['ہم نے کالج', 'نے کالج میں', 'کالج میں تعلیم', 'میں تعلیم تو', 'تعلیم تو ضرور', 'تو ضرور پائی', 'ضرور پائی اور', 'پائی اور رفتہ', 'اور رفتہ رفتہ', 'رفتہ رفتہ بی']


Find the probabilities for each unique trigram. 

In [17]:
# code here
# create a dictionary of trigrams and their frequencies
trigram_freq = {}
for trigram in trigrams:
    if trigram in trigram_freq:
        trigram_freq[trigram] += 1
    else:
        trigram_freq[trigram] = 1

# create a dataframe for the trigrams and their probabilities
trigram_df = pd.DataFrame(list(trigram_freq.items()), columns=['Trigram', 'Probability'])

# calculate the probability of each trigram
trigram_df['Probability'] = trigram_df.apply(lambda row: row['Probability'] / bigram_freq[' '.join(row['Trigram'].split()[:2])], axis=1)

trigram_df.head()

Unnamed: 0,Trigram,Probability
0,ہم نے کالج,0.028571
1,نے کالج میں,1.0
2,کالج میں تعلیم,0.222222
3,میں تعلیم تو,0.5
4,تعلیم تو ضرور,1.0


### 1.5 - Generating Text
Generate a paragraph with ten sentences each containing 9-15 words (pick the length of the sentence randomly within this range) using you language model. Start with trigrams, use back-off technique (i.e. use n-1 gram) if a token is not available. 

For each word prediction, get top 5 most probabale words using the n-gram model and then pick the next word randomly from within these. This is being done to avoid excessive repetitive sequences in your generated text.

In [15]:
# generating text
def generate_sentence(unigram_df, bigram_df, trigram_df, n=10, start_word=None):
    # generate the first word
    if start_word is None:
        first_word = unigram_df.sample(n=1)['Unigram'].values[0]
    
    sentence = first_word

    # generate the second word
    second_word = bigram_df[bigram_df['Bigram'].str.startswith(first_word)].sample(n=1)['Bigram'].values[0].split()[1]
    sentence += ' ' + second_word

    # generate the rest of the words
    for i in range(n - 2):
        # pick the top 5 most probable trigrams that start with the last two words
        top_5_trigrams = trigram_df[trigram_df['Trigram'].str.startswith(sentence.split()[-2] + ' ' + sentence.split()[-1])].sort_values(by='Probability', ascending=False).head(5)

        # if no tri-gram is found, use the bigram model
        if len(top_5_trigrams) == 0:
            top_5_bigrams = bigram_df[bigram_df['Bigram'].str.startswith(sentence.split()[-1])].sort_values(by='Probability', ascending=False).head(5)

            # if no bi-gram is found, use the unigram model
            if len(top_5_bigrams) == 0:
                top_5_unigrams = unigram_df[unigram_df['Unigram'].str.startswith(sentence.split()[-1])].sort_values(by='Probability', ascending=False).head(5)
                next_word = top_5_unigrams.sample(n=1)['Unigram'].values[0]
            else:
                next_word = top_5_bigrams.sample(n=1)['Bigram'].values[0].split()[1]
        else:
            next_word = top_5_trigrams.sample(n=1)['Trigram'].values[0].split()[2]

        sentence += ' ' + next_word


    return sentence

In [19]:
paragraph = ""
for i in range(1, 10):
    length = np.random.randint(9, 15)
    paragraph += generate_sentence(unigram_df, bigram_df, trigram_df, n=length)
    paragraph += '- '

paragraph

'اشیاء کا ہےباقی رہے جنگل پہاڑ دریا تو وہاں بھی ایک نہ- قوم کا لکھا ہو سامنے نہایت بے تکلفی سے اپنی- صناعی اور ہنرمندی سے لکھے کہ کاتب قدرت نے بھی یہی مشورہ دیا- سمجھئے کہ لاہور لا ہو رہی ہے اگر اس- موقلم آپ کیون کر رکھ ہے کالج کی تعلیم حاصل کرنے کا- پکتی ہے چنانچہ ابو ثوق سے کہا کہ دل بھی ا نہ تو الہی- عرصے سے کمیٹی کے زیر غور ہے یہ حالت وجاندار اشیاء کا ہےباقی رہے- موقعے زیادہ ملتے رہتے ہیں دوسری قسم جلالی طلباء کی- مت باہم بحث مباحثے رہتے ہم میں سے چند مشہور ہیں قسم اولی جمالی- '

In [21]:
# analyze the generated text
tokenized_paragraph = urduhack.tokenization.word_tokenizer(urduhack.preprocessing.remove_punctuation(paragraph))

# create a dictionary of tokens and their frequencies
token_freq = {}
for token in tokenized_paragraph:
    if token in token_freq:
        token_freq[token] += 1
    else:
        token_freq[token] = 1

print("Total tokens: ", len(tokenized_paragraph))

for token, freq in token_freq.items():
    print(token, freq)

Total tokens:  63
اشیاء 1
کاہے 1
باقی 1
رہے 1
جنگل 1
پہاڑدریا 1
تو 1
وہاں 1
بھی 2
ایک 1
نہ 1
قوم 1
کالکھاہوسامنے 1
نہایت 1
بے 1
تکلفی 1
سے 4
اپنی 1
صناعی 1
اورہنرمندی 1
لکھے 1
کہ 3
کاتب 1
قدرت 1
نے 1
یہی 1
مشورہ 1
دیا 1
سمجھئے 1
لاہورلاہورہی 1
ہے 1
اگر 1
اس 1
موقلم 1
آپ 1
کیون 1
کر 1
رکھہے 1
کالج 1
کی 1
تعلیم 1
حاصل 1
کرنے 1
کاپکتیہے 1
چنانچہ 1
ابوثوق 1
کہا 1
دل 1
بھیانہ 1
توالہی 1
عرصے 1
کمیٹی 1
کے 1
زیرغورہے 1
یہ 1
حالت 1
و 1


### 1.6 - Discussion and Evaluation

- Analyze the text generated, and mention 3 distinct observations. Also compare it with the input text and how different it is and why might that be.
- Is going upto n=3 enough? What do you think would be a good value of n and why?

Answer here:

1. Most of the tokens are unique and only a few are repeated. This is because the model is trained on a small dataset and the number of tokens is also small.
2. Yes, going upto n=3 is generally enough. However, if the dataset is very large and the number of tokens is also large, then a higher value of n can be used. This is because the model will be able to learn more about the context of the tokens and will be able to generate more meaningful text.