# Language model

## Reading file

In [3]:
!gdown --id 1Hb58rR--Qjwr21cLK6A29ROp6Uy5pqq8

Downloading...
From: https://drive.google.com/uc?id=1Hb58rR--Qjwr21cLK6A29ROp6Uy5pqq8
To: /content/train.txt
100% 9.87M/9.87M [00:00<00:00, 160MB/s]


In [4]:
all_lines = []
with open('train.txt') as file:
  for line in file:
    # strip removes any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove)
    tmp_line = line.strip()
    tmp_line = tmp_line.replace('?','')
    tmp_line = tmp_line.replace('!','')  
    tmp_line = tmp_line.replace('.','')
    tmp_line = tmp_line.replace(',','')
    all_lines.append(tmp_line)

In [5]:
print(all_lines[0])
print(all_lines[1])
print(len(all_lines))

زانک دل یا اوست یا خود اوست دل
عکس هر نقشی نتابد تا ابد
188894


## Unigram language model

This function calculates count of each word in corpus and total words in corpus.

In [8]:
def unigram(corpus):
  word_count = {}
  total_words = 0
  for sentence in corpus:
    for word in sentence.split(' '):
      total_words += 1
      if word in word_count.keys():
        word_count[word] +=1
      else:
        word_count[word] = 1
  return word_count, total_words

In [10]:
word_count, total_words = unigram(all_lines)

In [11]:
print(word_count['فتح'])

316


In [12]:
print(total_words)

1323474


This function calculate unigram probability of a word.

In [13]:
# import math
def word_unigram_prob(word):
  # return math.log((word_count[word] / total_words))
  return word_count[word] / total_words

This function calculate unigram probability of a sentence.
Because probabilities are so small when we multiply it gets smaller so we can sum log of word unigram probs.

In [15]:
def sentence_unigram_prob(sentence):
  sentence_prob = 1
  # sentence_prob = 0
  for word in sentence.split(' '):
    sentence_prob *= word_unigram_prob(word)
    # sentence_prob += word_unigram_prob(word)
  return sentence_prob

In [16]:
prob_1 = sentence_unigram_prob('عکس هر نقشی نتابد تا ابد')

In [17]:
prob_2 = sentence_unigram_prob('عکس هر محکم نتابد تا ابد')

In [19]:
if prob_1>prob_2:
  print('prob_1')
else:
  print('prob_2')

prob_2


In [20]:
print(word_count['محکم'])
print(word_count['نقشی'])

94
45


As we can see the correct sentece is the first sentence but because we are only seeing words and using unigrams and not seeing the connection between words in a sentence the asnwer is wrong.

## Bigram language model

In this function we find all previous words of each word in a corpus.

In [25]:
def bigram(corpus):
  word_bigrams = {}
  for sentence in corpus:
    sentence_words = sentence.split(' ')
    for i in range(1,len(sentence_words)):
      if sentence_words[i] not in word_bigrams.keys():
        word_bigrams[sentence_words[i]] = []
      word_bigrams[sentence_words[i]].append(sentence_words[i-1])
  return word_bigrams

In [26]:
word_bigrams = bigram(all_lines)

In [29]:
print(word_bigrams['محکم'])
print(len(word_bigrams['محکم']))

['دینست', 'خار', 'که', 'و', 'من', 'من', 'ز', 'گشته', 'حصار', 'و', 'مثالش', 'دو', 'حکم', 'حصار', 'که', 'ساعدان', 'حکم', 'شده', 'اندیشه', 'گاو', 'کرد', '', 'آن', 'گل', 'عشق', 'آنچنان', 'ام', 'رای', 'آباد', 'گامی', 'شاهی', 'ضربتی', 'رایش', 'تو', 'زرهی', 'راکاندران', 'چو', 'نعلش', 'ملک', 'غنچه', 'ضربتی', 'و', 'عشق', 'و', 'شدست', 'معانی', 'بنیاد', 'و', 'من', 'نماید', 'تو', 'تو', 'بد', 'گره', 'و', 'آن', 'من', 'بند', 'اوست', 'سروری', 'را', 'گشت', 'نه', 'سم', 'ضربتی', 'توست', 'هفت', 'که', 'تو', 'چنان', 'بندی', 'عمر', 'عهد', 'کجا', 'کار', 'نهاد', 'عمر', 'بنیاد', 'کار', 'و', 'حصاری', 'ترا', 'بیناد', 'ای', 'تو', 'تقدیر', 'گاو', 'گره', 'وعده', 'اینت', 'عهدها']
91


calculating p(wi|wi-1) = p(wi,wi-1) / p(wi-1) for each word.

In [30]:
def word_bigram(word,prev_word):
  all_prev_words = word_bigrams[word]
  first_word = word_count[prev_word]
  count = 0
  for prev in all_prev_words:
    if prev == prev_word:
      count +=1
  return count/first_word

In [32]:
word_bigram('تو','محکم')

0.010638297872340425

calculating bigram probability for a sentence (for all words in a sentence).

In [38]:
def sentence_bigram(sentence):
  sentence_words = sentence.split(' ')
  prob = word_count[sentence_words[0]]/total_words
  for i in range(1, len(sentence_words)):
    prob *= word_bigram(sentence_words[i],sentence_words[i-1])
  return prob

In [39]:
prob_1 = sentence_bigram('عکس هر نقشی نتابد تا ابد')

In [40]:
prob_2 = sentence_bigram('عکس هر محکم نتابد تا ابد')

In [41]:
if prob_1>prob_2:
  print('prob_1')
else:
  print('prob_2')

prob_1
