# Problem Statement

Predict next possible-word [n-gram approach]

Given a phrase, predict the next possible word as per the corpus with an
incremental approach.

Input: the play opens with

Output: ['hermia who']



# Approach

A Hand Written approach to the problem, here we developed probabilistic approach to the problem. More reccent approach tend towards using RNNs/LSTM to develop language models using character modelling.

<img src="l1_1.jpg">
<img src="l1_2.jpg">
<img src="l1_3.jpg">

# Libraries

In [1]:
from libraries import *

Using TensorFlow backend.


# Basic Functions

In [2]:
def clean_text(x):
    x= re.sub('[^A-Za-z0-9]+', ' ', x)
    x=x.lower()
    x = x.strip(' ')
    x = x.split(' ')
    return_ =[]
    for w in x:
        if len(w)>1:
            return_.append(w)
            
    return ' '.join(return_)



def build_conditional_probabilities(corpus):
    """
    Function to demonstrate to find out probailities of P(x|y) or based on bigrams
    """
    # First we parse the string to build a double dimension dictionnary that
    # returns the conditional probabilities.

    # We parse the string to build a first dictionnary indicating for each
    # word, what are the words that follow it in the string. Repeated next
    # words are kept so we use a list and not a set. 

    tokenized_string = corpus.split()
    previous_word = ""
    dictionnary = defaultdict(list)

    for current_word in tokenized_string:
        if previous_word != "":
            dictionnary[previous_word].append(current_word)
        previous_word = current_word

    # We know parse dictionnary to compute the probability each observed
    # next word for each word in the dictionnary. 

    for key in dictionnary.keys():
        next_words = dictionnary[key]
        unique_words = set(next_words) # removes duplicated
        nb_words = len(next_words)
        probabilities_given_key = {}
        for unique_word in unique_words:
            probabilities_given_key[unique_word] = float(next_words.count(unique_word)) / nb_words
        dictionnary[key] = probabilities_given_key

    return dictionnary

In [3]:
corpus = open('data.txt','r').read()
corpus = corpus.split('\n')
corpus = ' '.join(corpus)
corpus = clean_text(corpus)
corpus

'the play consists of four interconnecting plots connected by celebration of the wedding of duke theseus of athens and the amazon queen hippolyta which are set simultaneously in the woodland and in the realm of fairyland under the light of the moon the play opens with hermia who is in love with lysander resistant to her father egeus demand that she wed demetrius whom he has arranged for her to marry helena hermia best friend pines unrequitedly for demetrius who broke up with her to be with hermia enraged egeus invokes an ancient athenian law before duke theseus whereby daughter needs to marry suitor chosen by her father or else face death theseus offers her another choice lifelong chastity as nun worshipping the goddess artemis peter quince and his fellow players nick bottom francis flute robin starveling tom snout and snug plan to put on play for the wedding of the duke and the queen the most lamentable comedy and most cruel death of pyramus and thisbe quince reads the names of charac

In [4]:
corpus_words = corpus.split(' ')

# Demonstration On The Function "build_conditional_probabilities"

In [5]:
cp = build_conditional_probabilities(corpus)
sentence = 'play consists of four'

cp['four']

{'interconnecting': 1.0}

In [11]:
cp['queen']

{'hippolyta': 0.5, 'the': 0.5}

In [12]:
cfreq_brown_2gram = nltk.ConditionalFreqDist(nltk.bigrams(corpus_words))

In [13]:
#similar results can be achieved by ConditionalFreqDist
cfreq_brown_2gram['four']

FreqDist({'interconnecting': 1})

In [14]:
#constructing trigrams ConditionalFreqDist
b_trigrams = nltk.trigrams(corpus_words)
condition_pairs = (((w0, w1), w2) for w0, w1, w2 in b_trigrams)
cfd_brown = nltk.ConditionalFreqDist(condition_pairs)

# Prediction Of Next Words

In [15]:
def predict_next_words(sentence,num_words_to_predict,type_=None):
    sentence = clean_text(sentence)
    df = pd.DataFrame()
    sentence = sentence.lower()
    pred=[]
    for i in range(num_words_to_predict):
        idf = pd.DataFrame(columns=['sentence','predicted_word','sentence_plus_predicted_word'],index=[i])
        sent = sentence.split(' ')
        idf['sentence'] = sentence
        word_prev =tuple(sentence.split(' ')[-2:])
        pred_word = list(cfd_brown.get(word_prev).keys())[0]
        idf['predicted_word']=pred_word
        
        sentence = sentence+' '+pred_word
        idf['sentence_plus_predicted_word']=sentence
        df = df.append(idf)
    if type_ in ['df','dataframe']:
        return df
    elif type_ in ['list']:
        return df['sentence_plus_predicted_word'].tolist()
    else:
        return df

In [16]:
predict_next_words(sentence='The play opens with',num_words_to_predict=5,type_='list')

['the play opens with hermia',
 'the play opens with hermia who',
 'the play opens with hermia who is',
 'the play opens with hermia who is in',
 'the play opens with hermia who is in love']