# Programming Assignment 2: Naive Bayes
## Part 1: Language Modelling and Text Generation

#### Name: Umama Nasir Abbasi
#### Roll Number: 23100265

### Instructions
*   In this part of the assignment you will be implementing an n-gram model for text-generation.
*   Your code must be in the Python programming language.
*   You are encouraged to use procedural programming and throughly comment your code.
*   For Part 1, in addition to standard libraries i.e. numpy, pandas, regex, matplotlib and scipy, you can use [UrduHack](https://docs.urduhack.com/en/stable/index.html) for tokenization, and [NLKT](https://www.nltk.org/) for training your n-grams. However, no other machine learning toolkits or libraries are allowed.
*   **Carefully read the submission guidelines, plagiarism and late days policy.**

### Submission Guidelines
Submit your code both as notebook file (.ipynb) and python script (.py) as individual files on LMS. Name both files as RollNumber_PA2_PartNum, i.e. this part should be named as `2xxxxxxx_PA4_1`. If you don’t know how to save .ipynb as .py see [this](https://i.stack.imgur.com/L1rQH.png). Failing to submit any one of them might result in the reduction of marks. All cells **MUST** be run to get credit.

### Plagiarism Policy
The code **MUST** be done independently. Any plagiarism or cheating of work from others or the internet will be immediately referred to the DC. If you are confused about what constitutes plagiarism, it is **YOUR** responsibility to consult with the instructor or the TA in a timely manner. No “after the fact” negotiations will be possible. The only way to guarantee that you do not lose marks is **DO NOT LOOK AT ANYONE ELSE'S CODE NOR DISCUSS IT WITH THEM**.

### Late Days Policy

The deadline for the assignment is final. However, in order to accommodate all the 11th-
hour issues, there is a late submission policy i.e. you can submit your assignment within
3 days after the deadline with a 25% deduction each day.


### Introduction
An n-gram is a contiguous sequence of n words. For example "Machine" is a unigram, "Machine Learning" is a bigram and "Machine Learning PA2" is a trigram. In language modeling, n-gram models are probabilistic models of text that use word dependencies and context to predict the likelihood of occurence of an n-gram, i.e. predicting the nth word in an n-gram based on the previous n-1 words:
$$
P(ngram) =  P(word|context) = P(x^{n}|x^{n-1},...,x^{1})
$$
One use of the predictions made by such a model is text generation. In this part you will be training your own n-gram model and using it to generate text after learning from the provided Urdu short stories. 
<br><br>
For additional details of the working of n-gram models, you can also consult [Chapter 3](https://web.stanford.edu/~jurafsky/slp3/3.pdf) of the Speech and Language Processing book as and references.


### Dataset
You will be using the Urdu short stories by Patras Bukhari given in the folder `Urdu Short Stories` in the PA2 zip file for the purposes of this part of the assignment. This contains 6 stories of varying lengths which will serve as inputs for your n-gram model. 
You're required to implement an n-gram model that uses the given stories to generate Urdu text that mimics the input stories.

Start by importing all required libraries here.

In [None]:
!pip install urduhack
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting urduhack
  Downloading urduhack-1.1.1-py3-none-any.whl (105 kB)
[K     |████████████████████████████████| 105 kB 18.8 MB/s 
[?25hCollecting tensorflow-datasets~=3.1
  Downloading tensorflow_datasets-3.2.1-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 32.0 MB/s 
Collecting tf2crf
  Downloading tf2crf-0.1.33-py2.py3-none-any.whl (7.3 kB)
Collecting tensorflow-addons>=0.8.2
  Downloading tensorflow_addons-0.18.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 46.4 MB/s 
Installing collected packages: tensorflow-addons, tf2crf, tensorflow-datasets, urduhack
  Attempting uninstall: tensorflow-datasets
    Found existing installation: tensorflow-datasets 4.6.0
    Uninstalling tensorflow-datasets-4.6.0:
      Successfully uninstalled tensorflow-datasets-4.6.0
Successfully installed te

In [None]:
# import all required libraries here
import numpy as np
import pandas as pd
import nltk
import urduhack
import glob

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 1.1 - Loading and Preprocessing the Dataset

Read in the short story files given and tokenize the text to be preprocessed.

In [None]:
#  code here
short_story = glob.glob('/content/drive/MyDrive/Fall2022-2023/CS535ML/PA4/DataP1/*'+ '.txt')
processed = []

for story in range(0, len(short_story)):
  content = open(short_story[story], encoding='utf-8')
  c = content.read()
  processed.append(c)
  content.close()

def getTokens(story):
  tokens = story.split(' ')
  return tokens


total_tokens = []
for story in processed: 
  total_tokens = total_tokens + getTokens(story)




Preprocess the tokenized data. Go through the data and use your own discretion to decide on what kind of pre-processing might be required.

In [None]:
# code here
for i in range(len(total_tokens)):
  total_tokens[i] = urduhack.normalization.normalize(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.remove_punctuation(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.remove_accents(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.replace_numbers(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.replace_currency_symbols(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.replace_phone_numbers(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.replace_urls(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.replace_emails(total_tokens[i])
  total_tokens[i] = urduhack.preprocessing.normalize_whitespace(total_tokens[i])



total_tokens = np.array(total_tokens)
total_tokens.shape  

(16033,)

### 1.2 - Creating Unigrams

Start by training a unigram model. For a unigram model, the n-gram probability is approximated by probability of the word in the unigram, as the model assumes independence:

$$
P(word) = \frac{n}{N}
$$

where n = count of the word in the corpus and N = total number of words in the corpus.

Generate a list of unigrams. Print the first 10 unigrams obtained.

In [None]:
# code here
u_grams = nltk.ngrams(total_tokens,1)
ugram_array = np.array(list(u_grams))
for i in range(10):
  print(ugram_array[i])

['ہم']
['نے']
['کالج']
['میں']
['تعلیم']
['تو']
['ضرور']
['پائی']
['اور']
['رفتہ']


Find the probabilities for each unique unigram. 

In [None]:
# code here
dict_ugram = {}
words, occurance = np.unique(ugram_array, return_counts = True)
p = []
total_occ = occurance.sum()
for i in range(occurance.shape[0]):
  dict_ugram[words[i]] = occurance[i]/total_occ
  p.append(occurance[i]/total_occ)

### 1.3 - Creating Bigrams
Now train a bigram model. 

Generate a list of bigrams. Print the first 10 bigrams obtained.

In [None]:
# code here
bigram = nltk.bigrams(total_tokens)
bigram = np.array(list(bigram))
for i in range(10):
  print(bigram[i])

['ہم' 'نے']
['نے' 'کالج']
['کالج' 'میں']
['میں' 'تعلیم']
['تعلیم' 'تو']
['تو' 'ضرور']
['ضرور' 'پائی']
['پائی' 'اور']
['اور' 'رفتہ']
['رفتہ' 'رفتہ']


Find the probabilities for each unique bigram. 

In [None]:
# code here
dict_bigram = {}
words_bigram, occurance_bigram = np.unique(bigram, return_counts = True,axis=0)
p_bigram = []
total_occ = occurance_bigram.sum()
for i in range(occurance_bigram.shape[0]):
  w = tuple(words_bigram[i])
  dict_bigram[w] = occurance_bigram[i]/total_occ
  p_bigram.append(occurance_bigram[i]/total_occ)

### 1.4 - Creating Trigrams
Lastly train a trigram model.

Generate a list of trigrams. Print the first 10 trigrams obtained.

In [None]:
# code here
trigram = nltk.ngrams(total_tokens,3)
trigram = np.array(list(trigram))
for i in range(10):
  print(trigram[i])

['ہم' 'نے' 'کالج']
['نے' 'کالج' 'میں']
['کالج' 'میں' 'تعلیم']
['میں' 'تعلیم' 'تو']
['تعلیم' 'تو' 'ضرور']
['تو' 'ضرور' 'پائی']
['ضرور' 'پائی' 'اور']
['پائی' 'اور' 'رفتہ']
['اور' 'رفتہ' 'رفتہ']
['رفتہ' 'رفتہ' 'بیاے']


Find the probabilities for each unique trigram. 

In [None]:
# code here
dict_trigram = {}
words_trigram, occurance_trigram = np.unique(trigram, return_counts = True,axis=0)
p_trigram = []
total_occ = occurance_trigram.sum()
for i in range(occurance_trigram.shape[0]):
  w = tuple(words_trigram[i])
  dict_trigram[w] = occurance_trigram[i]/total_occ
  p_trigram.append(occurance_trigram[i]/total_occ)

### 1.5 - Generating Text
Generate a paragraph with ten sentences each containing 9-15 words (pick the length of the sentence randomly within this range) using you language model. Start with trigrams, use back-off technique (i.e. use n-1 gram) if a token is not available. 

For each word prediction, get top 5 most probabale words using the n-gram model and then pick the next word randomly from within these. This is being done to avoid excessive repetitive sequences in your generated text.

In [None]:
def get_bigrams(w):
  # print(w)
  bigrams_total = []
  final = {}
  bigrams = (dict_bigram.keys())
  # print("l: ", bigrams)
  for i in bigrams:
    # print('i: ', i)
    if i[0] == w:
      bigrams_total.append(i)
  for k,v in dict_bigram.items():
    for j in bigrams_total:
      if j == k :
        final[j] = v
  if len(final.keys()) < 5:
    return ""  
  else:
    keys = sorted(final, key = final.get, reverse=True)
    keys = np.array(keys[0:5])
    rows = np.random.randint(5,size=1)
    w = keys[rows[0]]
    return tuple(w)


In [None]:
def get_trigrams(w):
  trigrams_total =[]
  final = {}
  trigrams = (dict_trigram.keys())
  for i in trigrams:
    if i[0] == w[0] and i[1] == w[1]:
      trigrams_total.append(i)
  for key,value in dict_trigram.items():
    for j in trigrams_total:
      if j == key:
        final[j] = value
  if (len(final.keys())) <5:
    return ""
  else:
    k = sorted(final, key = final.get,reverse=True)
    k = np.array(k[0:5])
    row = np.random.randint(5,size=1)
    w = k[row[0]]
    return tuple(w)





In [None]:
final_sentences=[]
for i in range(10):
  sent = ''
  word = np.random.choice(ugram_array.flatten(),1)
  sent=word[0]
  while len(sent.split(" ")) < 15: 
    temp =sent.split()
    bi_gram = get_bigrams(temp[-1])
    if bi_gram !="":
      sent = sent + " " + bi_gram[1]
      temp1 = sent.split()
      tuple_temp = tuple([temp1[-2], temp1[-1]])
      trigram = get_trigrams(tuple_temp)
      if trigram!= "":
        sent = sent + " " + trigram[2]
        while trigram != "" and len(sent.split(" ")) < 15:
          temp3 = sent.split()
          tuple_temp = tuple([temp3[-2], temp3[-1]])
          trigram = get_trigrams(tuple_temp)
          if trigram != "":
            sent = sent + " " + trigram[2]
    elif bi_gram == "":
      uni = np.random.choice(ugram_array.flatten(),1)
      sent = sent+ " " + uni[0]
  final_sentences.append(sent)  
  print(sent)





تھیٹروں تو ضرور حاصل تھا وہ تو اب میں سے آسمان تین اصحاب چنانچہ اس
ہوا ہے اس سے آپ کو ان کو باہر جاتے تاریخ اور اس میں جو
اور ان میں نے اپنے کام آتے یہ ہوا تو ہم اس میں ضرور ایسی
طور وہ یہ کہ فارسی میں ایک آزادی ہیں اس کی باتیں ذہن گی آپ
اس میں سفر بعض ایسے لوگ اجی انجمن تو ضرور مضمر بس صاحب کے متعلق میں
آواز اور آپ کی آزاد کی ایک اور ان بیرون مصوری ذرا تاریک دلآویزیوں کی
سینماسے اگر آپ کے سامنے آبیٹھے کیوں نہ کوئی بات ایسی اچھی فلم کون ہمارے
تمہارے ہاتھ سے کام نہ کوئی ایسی بےمروتی سے پہلے مرزا سے کام لیا اور
بیٹھے لیکن جب آپ کو ان کی تسلی آدھ منٹ بس یہی تو میں بائیں
ایک آدھ کروٹ اس میں سما خیال سے آپ کیونکر لیڈروں تاریخی اور آپ کی


In [None]:
combined = ''
for i in final_sentences:
  combined = combined + " " + i
  

In [None]:
combined

' تھیٹروں تو ضرور حاصل تھا وہ تو اب میں سے آسمان تین اصحاب چنانچہ اس ہوا ہے اس سے آپ کو ان کو باہر جاتے تاریخ اور اس میں جو اور ان میں نے اپنے کام آتے یہ ہوا تو ہم اس میں ضرور ایسی طور وہ یہ کہ فارسی میں ایک آزادی ہیں اس کی باتیں ذہن گی آپ اس میں سفر بعض ایسے لوگ اجی انجمن تو ضرور مضمر بس صاحب کے متعلق میں آواز اور آپ کی آزاد کی ایک اور ان بیرون مصوری ذرا تاریک دلآویزیوں کی سینماسے اگر آپ کے سامنے آبیٹھے کیوں نہ کوئی بات ایسی اچھی فلم کون ہمارے تمہارے ہاتھ سے کام نہ کوئی ایسی بےمروتی سے پہلے مرزا سے کام لیا اور بیٹھے لیکن جب آپ کو ان کی تسلی آدھ منٹ بس یہی تو میں بائیں ایک آدھ کروٹ اس میں سما خیال سے آپ کیونکر لیڈروں تاریخی اور آپ کی'

### 1.6 - Discussion and Evaluation

- Analyze the text generated, and mention 3 distinct observations. Also compare it with the input text and how different it is and why might that be.
- Is going upto n=3 enough? What do you think would be a good value of n and why?

Answer here:

1- The phrases make sense however the sentences as a whole make no sense at all. alot of conjunctions are used in the generated text such as iss, hai, koo etc.


2- We can go beyong n=3 for more choesive sentences however it might lead to overfitting.  