# Word Prediction using Markov Model

This notebook makes use of Markov model for word prediction. Specifically 2nd order Markov model is deployed here for next word prediction. As an example of the Markov chain, an attempt is made to generate a new song lyrics from a bunch of Eminem song lyrics.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Preamble
import string
import numpy as np

In [None]:
# Path of the text file containing the training data
training_data_file = '/content/drive/MyDrive/Colab_Notebooks/Merck_NLP_Project/word-prediction/pubmed_abstract.txt'

## Training

### Helper functions

In [None]:
def remove_punctuation(sentence):
    return sentence.translate(str.maketrans('','', string.punctuation))

In [None]:
def add2dict(dictionary, key, value):
    if key not in dictionary:
        dictionary[key] = []
    dictionary[key].append(value)

In [None]:
def list2probabilitydict(given_list):
    probability_dict = {}
    given_list_length = len(given_list)
    for item in given_list:
        probability_dict[item] = probability_dict.get(item, 0) + 1
    for key, value in probability_dict.items():
        probability_dict[key] = value / given_list_length
    return probability_dict

In [None]:
initial_word = {}
second_word = {}
transitions = {}

### Training function

In [None]:
# Trains a Markov model based on the data in training_data_file
def train_markov_model():
    for line in open(training_data_file):
        tokens = remove_punctuation(line.rstrip().lower()).split()
        tokens_length = len(tokens)
        for i in range(tokens_length):
            token = tokens[i]
            if i == 0:
                initial_word[token] = initial_word.get(token, 0) + 1
            else:
                prev_token = tokens[i - 1]
                if i == tokens_length - 1:
                    add2dict(transitions, (prev_token, token), 'END')
                if i == 1:
                    add2dict(second_word, prev_token, token)
                else:
                    prev_prev_token = tokens[i - 2]
                    add2dict(transitions, (prev_prev_token, prev_token), token)
    
    # Normalize the distributions
    initial_word_total = sum(initial_word.values())
    for key, value in initial_word.items():
        initial_word[key] = value / initial_word_total
        
    for prev_word, next_word_list in second_word.items():
        second_word[prev_word] = list2probabilitydict(next_word_list)
        
    for word_pair, next_word_list in transitions.items():
        transitions[word_pair] = list2probabilitydict(next_word_list)
    
    print('Training successful.')

In [None]:
train_markov_model()

Training successful.


## Testing

### Test functions

In [None]:
test_word = ['quantitative analysis', 'mass transitions', 'concentration range', 'flow rate', 'accuracy was', 'lower limit']
word0 = []
word1 = []
number_of_sentences = len(test_word)
for i in range(number_of_sentences):
  word0.append(test_word[i].split()[0])
  word1.append(test_word[i].split()[1])

In [None]:
# Function to generate sample text
def generate():
    for i in range(number_of_sentences):
        sentence = []
        # Initial word
        word_0 = word0[i]
        sentence.append(word_0)
        # Second word
        word_1 = word1[i]
        sentence.append(word_1)
        # Subsequent words untill END
        j = 0
        while True:
            word_2 = sample_word(transitions[(word_0,word_1)])
            if word_2 == 'END':
                break
            sentence.append(word_2)
            word_0 = word_1
            word_1 = word_2
            j += 1
        print(' '.join(sentence))
        #print(sentence)

### Testing

In [None]:
generate()

quantitative analysis of plasma samples obtained following the oral administration of vardenafil in human plasma
mass transitions were mz 4893 3122 for vardenafil and the accuracy was within 127 in terms of relative error
concentration range of 02100 ngml with correlation coefficients or 0995
flow rate of 04 mlmin
accuracy was within 127 in terms of relative error
lower limit of quantitation was set at 02 ngml
