# Assignment 1 on Natural Language Processing

## Date : 26th July, 2019

### Instructor : Prof. Sudeshna Sarkar

### Teaching Assistants : Ishani Mondal, Debanjana Kar, Sukannya Purkayastha

The central idea of this assignment is to make you familiar with programming in python and also the language modelling task of natural language processing using the python library, nltk. Please find the installation details below.



## Installation of NLTK and Anaconda:

To ensure we are all on the same page, the coding environment will be in `python3`. We suggest downloading 
Anaconda3 and creating a separate environment to do this assignment. <br> 


The link to anaconda3 for Windows and Linux is available here https://docs.anaconda.com/anaconda/install/. <br>
The steps to install NLTK is available on the link: <br>

`sudo pip3 install nltk` <br>
`python3` <br>
`nltk.download()` <br>

To install gensim, use the following command: (detailed tutorial) <br>
`conda install -c conda-forge gensim` <br>

<br>

Note : For the purpose of your convenience, we are also providing you with a demo hands-on ipython notebook explaining the basics of language modelling using nltk.

## Assignment Tasks

Use the corpus given. Ignore the .concept files and use the .txt files for each disease abstract.

### Task A: In this sub-task, you are expected to carry out the following tasks:

**Tokenize** the corpus into sentences and words (for each of the pos and neg class). **Print the number of sentences and words.** <br>
**Perform case-folding** on the corpus. <br>
**Remove the stopwords** from the corpus and print the count of the rest of the non stop-words occurring in the corpus.<br>

In [1]:
#Write the code for Task A
import nltk
import re
import os
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))


directory = '.\\NCBI_Data'
words = []
sent_tokens = []
for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        f = open(".\\NCBI_Data\\"+filename).read()
        sents = nltk.sent_tokenize(f)
        length_sen = len(sents)
        for x in range(0,length_sen):
            tokens = nltk.word_tokenize(sents[x])
            length = len(tokens)
            for x in range(0,length):
                match = re.match(r"[a-zA-Z].",tokens[x])
                if match:
                    list1 = re.findall(r".*[a-zA-Z0-9]",tokens[x])
                    if list1:
                            tokens.append(list1[0].lower())
                else:
                    pass
            for x in range(0,length):
                tokens.pop(0)
            words.extend(tokens)
            if tokens:
                sent_tokens.append(tokens)

print("No of sentences -"+str(len(sent_tokens)))
print("No of words -"+str(len(words)))
for word in words:
    if word in stop_words:
        words.remove(word)
print("No of words after removing stopwords -"+str(len(words)))
vocab_len = len(words)

No of sentences -195
No of words -3723
No of words after removing stopwords -2637


### Task B: In this sub-task, you are expected to carry out the following tasks:

1. **Create the following language models** on the training corpus: <br>
    i.   Unigram <br>
    ii.  Bigram <br>
    iii. Trigram <br>
    iv.  Fourgram <br>

2. **List the top 5 bigrams, trigrams, four-grams (with and without Add-1 smoothing).**
(Note: Please remove those which contain only articles, prepositions, determiners. For Example: “of the”, “in a”, etc).

In [2]:
#Write the code for Task B

from nltk.util import ngrams
unigrams=[]
bigrams=[]
trigrams=[]
fourgrams=[]
#n=2
for sentence in sent_tokens:
    unigrams.extend(ngrams(sentence,1))
    bigrams.extend(ngrams(sentence,2))
    ##similar for trigrams and fourgrams
    trigrams.extend(ngrams(sentence,3))
    fourgrams.extend(ngrams(sentence,4))

for bigram in bigrams:
    flag = True
    for token in bigram:
        if token not in stop_words:
            flag = False
    if flag:
        bigrams.remove(bigram)

for trigram in trigrams:
    flag = True
    for token in trigram:
        if token not in stop_words:
            flag = False
    if flag:
        trigrams.remove(trigram)
for fourgram in fourgrams:
    flag = True
    for token in fourgram:
        if token not in stop_words:
            flag = False
    if flag:
        fourgrams.remove(fourgram)
gramlist = [bigrams,trigrams,fourgrams]
def get_prob(sentence,model=2):
    prob = 1.0
    tokens = nltk.word_tokenize(sentence)
    length = len(tokens)
    for x in range(0,length):
        match = re.match(r"[a-zA-Z].",tokens[x])
        if match:
            list1 = re.findall(r".*[a-zA-Z0-9]",tokens[x])
            if list1:
                tokens.append(list1[0].lower())
    for x in range(0,length):
        tokens.pop(0)
    grams = ngrams(tokens,model)
    count1 = count2 = 0
    for gram in grams:
        for ngram in gramlist[model-2]:
            if gram == ngram:
                count1 = count1 + 1 
        for ngram in gramlist[model-2]:
            if gram[:-1] == ngram[:-1]:
                count2 = count2 + 1 
        prob = prob * (count1)/(count2)
    return prob

In [3]:
#stopwords = code for downloading stop words through nltk

#print top 10 unigrams, bigrams after removing stopwords

#print top 10 bigrams, trigrams, fourgrams after removing stopwords
fdist = nltk.FreqDist(bigrams)
print(fdist.most_common(5))
fdist = nltk.FreqDist(trigrams)
print(fdist.most_common(5))
fdist = nltk.FreqDist(fourgrams)
print(fdist.most_common(5))
print(get_prob("mutations in",2))
print(get_prob("at in",2))

[(('mutations', 'in'), 15), (('patients', 'with'), 9), (('detected', 'in'), 8), (('mutation', 'was'), 8), (('deficiency', 'of'), 8)]
[(('mutations', 'in', 'the'), 6), (('deficiency', 'of', 'the'), 6), (('germline', 'mutations', 'in'), 6), (('the', 'rb1', 'gene'), 4), (('component', 'of', 'complement'), 4)]
[(('paternal', 'transmission', 'of', 'congenital_dm'), 3), (('tumour', 'dna', 'from', 'patients'), 2), (('dna', 'from', 'patients', 'with'), 2), (('mutations', 'were', 'detected', 'in'), 2), (('constitutional', 'rb1-gene', 'mutations', 'in'), 2)]
0.5172413793103449
0.0


### With Smoothing

In [4]:
#You are to perform Add-1 smoothing here:
#Probability(unigram) = count(unigram)/Number of unique unigrams + Total number of unigrams

#write similar code for bigram, trigram and fourgrams
def get_prob_after_smoothing(sentence,model=2):
    prob = 1.0
    tokens = nltk.word_tokenize(sentence)
    length = len(tokens)
    for x in range(0,length):
        match = re.match(r"[a-zA-Z].",tokens[x])
        if match:
            list1 = re.findall(r".*[a-zA-Z0-9]",tokens[x])
            if list1:
                tokens.append(list1[0].lower())
    for x in range(0,length):
        tokens.pop(0)
    grams = ngrams(tokens,model)
    count1 = count2 = 0
    for gram in grams:
        for ngram in gramlist[model-2]:
            if gram == ngram:
                count1 = count1 + 1 
        for ngram in gramlist[model-2]:
            if gram[:-1] == ngram[:-1]:
                count2 = count2 + 1 
        prob = prob * ((count1)+1)/((count2)+(vocab_len**(model-1)))
    return prob
print(get_prob_after_smoothing("mutations in",2))
#non zero probability for unknown combinations
print(get_prob_after_smoothing("at in",2))
#Print top 10 unigram, bigram, trigram, fourgram after smoothing
#Since smoothing does not change actual relative probabilities,
#the most most frequent bigrams,trigrams,fourgrams will remain the same

0.006001500375093774
0.00037764350453172205


### Predict the next word using statistical language modelling

Using the above bigram, trigram, and fourgram models that you just experimented with, **predict the next word given the previous n(=2, 3, 4)-grams** for the sentences below.

In [12]:
str1 = 'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently'
str2 = 'The average age of disease'

tokens = nltk.word_tokenize(str1)
length = len(tokens)
for x in range(0,length):
    match = re.match(r"[a-zA-Z].",tokens[x])
    if match:
        list1 = re.findall(r".*[a-zA-Z0-9]",tokens[x])
        if list1:
                tokens.append(list1[0].lower())
for x in range(0,length):
    tokens.pop(0)
fdist = nltk.FreqDist(bigrams)
bigrams_freq = fdist.most_common()
for bigram in bigrams_freq:
    if bigram[0][0] == tokens[-1]:
        print("bigram model prediction: "+str1+" "+bigram[0][1]) 
        break
fdist = nltk.FreqDist(trigrams)
trigrams_freq = fdist.most_common()
for trigram in trigrams_freq:
    if trigram[0][0] == tokens[-2] and trigram[0][1] == tokens[-1]:
        print("trigram model prediction: "+str1+" "+trigram[0][2]) 
        break
fdist = nltk.FreqDist(fourgrams)
fourgrams_freq = fdist.most_common()
for fourgram in fourgrams_freq:
    if fourgram[0][0] == tokens[-3] and fourgram[0][1] == tokens[-2] and fourgram[0][2] == tokens[-1]:
        print("fourgram model prediction: "+str1+" "+fourgram[0][3]) 
        break
tokens = nltk.word_tokenize(str2)
length = len(tokens)
for x in range(0,length):
    match = re.match(r"[a-zA-Z].",tokens[x])
    if match:
        list1 = re.findall(r".*[a-zA-Z0-9]",tokens[x])
        if list1:
                tokens.append(list1[0].lower())
for x in range(0,length):
    tokens.pop(0)
for bigram in bigrams_freq:
    if bigram[0][0] == tokens[-1]:
        print("bigram model prediction: "+str2+" "+bigram[0][1]) 
        break
for trigram in trigrams_freq:
    if trigram[0][0] == tokens[-2] and trigram[0][1] == tokens[-1]:
        print("trigram model prediction: "+str2+" "+trigram[0][2]) 
        break
for fourgram in fourgrams_freq:
    if fourgram[0][0] == tokens[-3] and fourgram[0][1] == tokens[-2] and fourgram[0][2] == tokens[-1]:
        print("fourgram model prediction: "+str2+" "+fourgram[0][3]) 
        break

bigram model prediction: A new tumor suppressor gene, PTEN/MMAC1, was isolated recently at
trigram model prediction: A new tumor suppressor gene, PTEN/MMAC1, was isolated recently at
fourgram model prediction: A new tumor suppressor gene, PTEN/MMAC1, was isolated recently at
bigram model prediction: The average age of disease severity
trigram model prediction: The average age of disease concordance
fourgram model prediction: The average age of disease onset


In [13]:
'''
For str1, you are to predict the next  2 possible word sequences using your trained smoothed models. The answers can be as below:()
    1) 'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently' *genuinely*
    2)  'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently' *yesterday*
For str2, you are to predict the next 2 possible word sequences using your trained smoothed models such as:
    (1) 'The average age of disease', *hinders*
    (2) 'The average age of disease', *past*
The above answers are not solutions but just examples to explain the task.
'''

"\nFor str1, you are to predict the next  2 possible word sequences using your trained smoothed models. The answers can be as below:()\n    1) 'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently' *genuinely*\n    2)  'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently' *yesterday*\nFor str2, you are to predict the next 2 possible word sequences using your trained smoothed models such as:\n    (1) 'The average age of disease', *hinders*\n    (2) 'The average age of disease', *past*\nThe above answers are not solutions but just examples to explain the task.\n"

# Task C: In this task, you are to perform the following tasks:

1. **Train word vectors** on the given corpus. In order to train the word vectors on your corpus, using the gensim module (https://radimrehurek.com/gensim/models/word2vec.html) with pre-trained Google word embeddings (GoogleNews-vectors) . For multi-word disease mentions, concatenate each word with a ‘_’.  <br>

2. **Construct a t-SNE plot** of the trained word vectors of the disease mentions.

3. **Repeat experiment 1. and 2.** using the following hyper-parameter settings:
Use window size = 5, 10.<br>
Use embedding dimension = 50, 100, 200.<br>


In [3]:
#Write the code for Task C
import gensim
import os

# Load Google's pre-trained Word2Vec model.
model = gensim.models.Word2Vec.load_word2vec_format('./GoogleNews-vectors-negative300.bin', binary=True)  
model.train(sent_tokens,window=5,total_examples=len(words), epochs=1)

DeprecationWarning: Deprecated. Use gensim.models.KeyedVectors.load_word2vec_format instead.

### Task D: Predict the next word using neural language modelling

Using LSTM Language modelling, you are expected to **train your own word vectors and predict the next word, given the context**.

In [35]:
#code for Task D

from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding


## Prepare the corpus from the .txt files and store it in a string variable i.e data_str. It should contain the 
## sentences splitted by "\n".
import os, glob

os.chdir("NCBI_Data")
data=[]
for file in glob.glob("*.txt"):
	f=open(file)
	content=f.read()
	for line in content.split("\n"):
		if(line!=""):
			data.append(line)

#print(data)
data_str="\n".join(data)
#print(data_str)

tokenizer = Tokenizer()
tokenizer.fit_on_texts([data_str])

# Write the code for encoding text to sequences here and store in encoded
encoded = tokenizer.texts_to_sequences([data_str])
encoded = encoded[0]

# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# encode 2 words -> 1 word
sequences = []
for i in range(2, len(encoded)):
	sequence = encoded[i-2:i+1]
	sequences.append(sequence)

print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d' % max_length)
# split into input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

# define model
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(LSTM(50))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile network
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit network
model.fit(X, y, epochs=500, verbose=2) 

Vocabulary Size: 1194
[[558, 2, 49], [2, 49, 12], [49, 12, 3], [12, 3, 1], [3, 1, 327], [1, 327, 328], [327, 328, 10], [328, 10, 3], [10, 3, 5], [3, 5, 97], [5, 97, 33], [97, 33, 37], [33, 37, 120], [37, 120, 327], [120, 327, 328], [327, 328, 5], [328, 5, 33], [5, 33, 9], [33, 9, 5], [9, 5, 222], [5, 222, 559], [222, 559, 560], [559, 560, 121], [560, 121, 98], [121, 98, 16], [98, 16, 12], [16, 12, 3], [12, 3, 1], [3, 1, 76], [1, 76, 10], [76, 10, 38], [10, 38, 561], [38, 561, 562], [561, 562, 563], [562, 563, 99], [563, 99, 1], [99, 1, 77], [1, 77, 2], [77, 2, 34], [2, 34, 564], [34, 564, 565], [564, 565, 566], [565, 566, 9], [566, 9, 567], [9, 567, 568], [567, 568, 3], [568, 3, 5], [3, 5, 33], [5, 33, 17], [33, 17, 4], [17, 4, 162], [4, 162, 569], [162, 569, 68], [569, 68, 78], [68, 78, 7], [78, 7, 223], [7, 223, 163], [223, 163, 16], [163, 16, 570], [16, 570, 164], [570, 164, 39], [164, 39, 18], [39, 18, 17], [18, 17, 7], [17, 7, 97], [7, 97, 33], [97, 33, 37], [33, 37, 571], [37, 57

"\n# define model\nmodel = Sequential()\nmodel.add(Embedding(vocab_size, 10, input_length=max_length-1))\nmodel.add(LSTM(50))\nmodel.add(Dense(vocab_size, activation='softmax'))\nprint(model.summary())\n# compile network\nmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n# fit network\nmodel.fit(X, y, epochs=500, verbose=2) \n"

In [53]:
#generate the sequence 

def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    
    in_text = seed_text
    context = seed_text.split(" ")[-2] + " " + seed_text.split(" ")[-1]
    sequences = tokenizer.texts_to_sequences([context])
    max_length = 2
    #max_length = max([len(seq) for seq in sequences])
    sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
    sequences = array(sequences)
    y_prob = model.predict(sequences)
    y_class = y_prob.argmax(axis=-1)
    in_text = in_text + " " + tokenizer.sequences_to_texts([y_class])[0]
    return in_text

# evaluate model
print(generate_seq(model, tokenizer, max_length-1, 'A new tumor suppressor gene, PTEN/MMAC1, was isolated recently', 1))
print(generate_seq(model, tokenizer, max_length-1, 'The average age of disease', 1))


A new tumor suppressor gene, PTEN/MMAC1, was isolated recently at
The average age of disease concordance
