<h1>Algorithm</h1>

* A bigram language model is trained from the corpus. <br />
* Another bigram language model is trained after reversing the sentences from the corpus. <br />
* Given the context word, the model randomly decides to add a prefix or a suffix.<br />
* To add prefix, we generate the word from the language model trained of reversed sentences.<br />
* To add suffix, we generate the word from the language model trained on original sentences.<br />

## Algorithm applied on corpus

In [1]:
import os
import nltk
import string
import random

from nltk.util import ngrams

In [2]:
FILE_PATH = './corpus.txt'

stopwords = ['', '(', ')', '{', '}', '\\', '--', ':', '-', "'s"]
punc = string.punctuation + "``" + "''" + '"'

In [3]:
with open(FILE_PATH, 'r') as f:
    data = f.read().lower().replace('\n',' ')

In [4]:
len(data)

3954021

In [5]:
type(data)

str

In [6]:
def tokenized_words(data):
    sents = []
    for sent in nltk.sent_tokenize(data):
        words = [word for word in nltk.word_tokenize(sent) if word not in stopwords and word not in punc]
        sents.append(words)
        
    return sents

In [7]:
def tokenized_rev_words(data):
    sents = []
    for sent in nltk.sent_tokenize(data):
        words = [word for word in nltk.word_tokenize(sent) if word not in stopwords and word not in punc]
        sents.append(list(reversed(words)))
        
    return sents

In [8]:
sents = tokenized_words(data)
rev_sents = tokenized_rev_words(data)

In [9]:
len(sents)

40992

In [10]:
train_corpus = [word for sent in sents for word in sent]
rev_train_corpus = [word for sent in rev_sents for word in sent] 

In [11]:
len(train_corpus)

666665

In [12]:
def ngram_freq_dist(corpus, ngram=1):
    if isinstance(corpus, list) and len(corpus)>0:
        train_corpus=corpus
    elif type(corpus) is str:
        train_corpus=nltk.word_tokenize(corpus)
    else:
        print('Error')
        return None
    
    freq_dist=None
    if ngram==1:
        freq_dist = nltk.FreqDist(train_corpus) #freq distibution for unigrams
    elif ngram==2:
        freq_dist = nltk.ConditionalFreqDist(nltk.ngrams(train_corpus, 2))# conditional freq dist for bigrams
    elif ngram==3:
        trigrams_as_bigrams=[]
        trigram =[a for a in ngrams(train_corpus, 3)]
        trigrams_as_bigrams.extend([((t[0],t[1]), t[2]) for t in trigram])
        freq_dist = nltk.ConditionalFreqDist(trigrams_as_bigrams)# conditional freq dist for trigrams
    else:
        print('Supported upto trigrams only')
    return freq_dist

In [13]:
cfd_2gram = ngram_freq_dist(train_corpus, 2)
cfd_2gram_rev = ngram_freq_dist(rev_train_corpus, 2)

cpd_2gram = nltk.ConditionalProbDist(cfd_2gram, nltk.MLEProbDist)
cpd_2gram_rev = nltk.ConditionalProbDist(cfd_2gram_rev, nltk.MLEProbDist)

In [14]:
def generate_txt_bigram_model_random(cprob_2gram, cprob_2gram_rev, initialword, numwords=15):
    text = initialword
    suf_word = initialword
    pre_word = initialword
    for index in range(numwords):
        if random.random() > 0.5:
            try:
                suf_word = cprob_2gram[suf_word].generate()
                text = text + " " + suf_word
            except Exception as e:
                print('Can not generate the sentence')
                return
        else:
            try: 
                pre_word = cprob_2gram_rev[pre_word].generate()
                text = pre_word + ' ' + text
            except Exception as e:
                print('Can not generate the sentence')
                return
    return text

<h1>Results</h1>

In [21]:
for _ in range(5):
    print(generate_txt_bigram_model_random(cpd_2gram, cpd_2gram_rev, 'trump'))
    print('----------------------------------------')

his grave problem is he believes donald trump foundation of rt berniesanders holds a great to
----------------------------------------
our justice reform a trump believes that donald trump pay to cut social security issues these
----------------------------------------
13th because americans if donald trump got to see her greeting western industrialized nation in the
----------------------------------------
a government makes billions in july calling for president trump from every person than 56 of
----------------------------------------
the drug prices you still beat trump is not class knows how would be too many
----------------------------------------


## Our algorithm applied on a small example

In [16]:
small_corpus = 'the day is quite bright today i am feeling very good \
this is my best feeling i have ever experienced the dog is sleeping peacefully at one corner \
the coffee is warm'

In [17]:
corpus = tokenized_words(small_corpus)[0]
corpus

['the',
 'day',
 'is',
 'quite',
 'bright',
 'today',
 'i',
 'am',
 'feeling',
 'very',
 'good',
 'this',
 'is',
 'my',
 'best',
 'feeling',
 'i',
 'have',
 'ever',
 'experienced',
 'the',
 'dog',
 'is',
 'sleeping',
 'peacefully',
 'at',
 'one',
 'corner',
 'the',
 'coffee',
 'is',
 'warm']

In [18]:
bigrams_freq_dist = ngram_freq_dist(corpus, 2)
dict(bigrams_freq_dist)

{'the': FreqDist({'day': 1, 'dog': 1, 'coffee': 1}),
 'day': FreqDist({'is': 1}),
 'is': FreqDist({'quite': 1, 'my': 1, 'sleeping': 1, 'warm': 1}),
 'quite': FreqDist({'bright': 1}),
 'bright': FreqDist({'today': 1}),
 'today': FreqDist({'i': 1}),
 'i': FreqDist({'am': 1, 'have': 1}),
 'am': FreqDist({'feeling': 1}),
 'feeling': FreqDist({'very': 1, 'i': 1}),
 'very': FreqDist({'good': 1}),
 'good': FreqDist({'this': 1}),
 'this': FreqDist({'is': 1}),
 'my': FreqDist({'best': 1}),
 'best': FreqDist({'feeling': 1}),
 'have': FreqDist({'ever': 1}),
 'ever': FreqDist({'experienced': 1}),
 'experienced': FreqDist({'the': 1}),
 'dog': FreqDist({'is': 1}),
 'sleeping': FreqDist({'peacefully': 1}),
 'peacefully': FreqDist({'at': 1}),
 'at': FreqDist({'one': 1}),
 'one': FreqDist({'corner': 1}),
 'corner': FreqDist({'the': 1}),
 'coffee': FreqDist({'is': 1})}

In [19]:
# reversing the corpus
rev_corpus = tokenized_rev_words(small_corpus)[0]
rev_corpus

['warm',
 'is',
 'coffee',
 'the',
 'corner',
 'one',
 'at',
 'peacefully',
 'sleeping',
 'is',
 'dog',
 'the',
 'experienced',
 'ever',
 'have',
 'i',
 'feeling',
 'best',
 'my',
 'is',
 'this',
 'good',
 'very',
 'feeling',
 'am',
 'i',
 'today',
 'bright',
 'quite',
 'is',
 'day',
 'the']

In [20]:
bigrams_freq_dist_rev = ngram_freq_dist(rev_corpus, 2)
dict(bigrams_freq_dist_rev)

{'warm': FreqDist({'is': 1}),
 'is': FreqDist({'coffee': 1, 'dog': 1, 'this': 1, 'day': 1}),
 'coffee': FreqDist({'the': 1}),
 'the': FreqDist({'corner': 1, 'experienced': 1}),
 'corner': FreqDist({'one': 1}),
 'one': FreqDist({'at': 1}),
 'at': FreqDist({'peacefully': 1}),
 'peacefully': FreqDist({'sleeping': 1}),
 'sleeping': FreqDist({'is': 1}),
 'dog': FreqDist({'the': 1}),
 'experienced': FreqDist({'ever': 1}),
 'ever': FreqDist({'have': 1}),
 'have': FreqDist({'i': 1}),
 'i': FreqDist({'feeling': 1, 'today': 1}),
 'feeling': FreqDist({'best': 1, 'am': 1}),
 'best': FreqDist({'my': 1}),
 'my': FreqDist({'is': 1}),
 'this': FreqDist({'good': 1}),
 'good': FreqDist({'very': 1}),
 'very': FreqDist({'feeling': 1}),
 'am': FreqDist({'i': 1}),
 'today': FreqDist({'bright': 1}),
 'bright': FreqDist({'quite': 1}),
 'quite': FreqDist({'is': 1}),
 'day': FreqDist({'the': 1})}

As we can see, after reversing the corpus, we get the appropriate frequency distribution format which is required for generating prefixes. Prefix can be generated from freq. distribtion by converting it into a probability distribution and then generating words from it
