# CS 598 DLH Final Project Experiment 1
by Regan Brown (rnbrown3, group 51)

This is NOT intended as a bonus "Descriptive notebook". This contains all the source code for Experiment 1 of my final project.

The paper I attempted to reproduce is paper 151, "Disease Prediction and Early Intervention System Based on Symptom Similarity Analysis" by Peiying Zhang, Xingzhe Huang, and Maozhen Li. You may access the paper here: https://ieeexplore.ieee.org/document/8924757

There is no code repo for this paper that I could find.

In order to ensure the proper environment to run this code, please download the latest versions of all libraries/imports mentioned in the code blocks. In addition, you will need to install necessary CoreNLP packages and start up an instance of the Stanford Core NLP server, following the instructions at https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK

You do not need to download the Microsoft Research Paraphrase corpus used for this experiment separately; the load_dataset library takes care of downloading the data for you.

## Preprocessing
The first steps to take are loading in the MSRP dataset, extracting the sentence pairs and scores/labels for each, and initializing the Stanford parser.

In [1]:
#be sure to start up Stanford Parser server following steps here: https://github.com/nltk/nltk/wiki/Stanford-CoreNLP-API-in-NLTK
from nltk.parse import CoreNLPParser
from nltk.tree import Tree, ParentedTree
from datasets import load_dataset
import time #for tracking time to run
import tracemalloc #for tracking memory usage
_START_RUNTIME = time.time()
tracemalloc.start()

#first, load in the MSRP training data
df = load_dataset('glue', 'mrpc', split='train')
labels = df['label']
sentence1 = df['sentence1']
sentence2 = df['sentence2']

#also load in the MSRP test data, we will preprocess this as well
df = load_dataset('glue', 'mrpc', split='test')
labels_test = df['label']
sentence1_test = df['sentence1']
sentence2_test = df['sentence2']

#initialize the Stanford parser
parser = CoreNLPParser(url='http://localhost:9000')

Found cached dataset glue (C:/Users/rbrow/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)
Found cached dataset glue (C:/Users/rbrow/.cache/huggingface/datasets/glue/mrpc/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


Next, I define the subject-predicate-object (SPO) algorithm we will use to parse the sentences with the Stanford parser, then identify the most relevant words in the sentence that provide the most meaning (i.e. the sentence trunk). My implementation mostly follows the pseudocode in the original paper, but I found their pseudocode for identifying the object did not return correct results. Instead, I based my implementation for finding the object on code from Github user Hassan Elmadany, found here: https://github.com/HassanElmadany/Extract-SVO/blob/master/Subject_Verb_Object_Extractor.py

In [2]:
#implementation of SPO algorithm as outlined in the paper's pseudocode (Algorithm 1)
#to help make sense of this code, please check the label definitions here: https://stackoverflow.com/questions/1833252/java-stanford-nlp-part-of-speech-labels
def spo(sentence):
    tree = parser.raw_parse(sentence)
    tree = next(tree) #need to pull the Tree out of the iter
    
    subject = ""
    predicate = ""
    obj = ""
    for t in tree[0]:
        if t.label() == 'NP': #identify subject
            for s in t.subtrees():
                for n in s.subtrees():
                    if n.label().startswith("NN"):
                        subject = n[0]
        if t.label() == 'VP': #identify predicate
            for p in t.subtrees():
                for m in p.subtrees():
                    if m.label().startswith("VB"):
                        predicate = m[0]
        if t.label() == 'VP': #identify object (code based on code found here: https://github.com/HassanElmadany/Extract-SVO/blob/master/Subject_Verb_Object_Extractor.py)
            for k in t.subtrees(lambda n: n.label() in ['NP', 'PP', 'ADJP']):
                if k.label() in ['NP', 'PP']:
                    for c in k.subtrees(lambda c: c.label().startswith('NN')):
                        obj = c[0]
                else:
                    for c in k.subtrees(lambda c: c.label().startswith('JJ')):
                        obj = c[0]
    return [subject, predicate, obj]

Now that the SPO algorithm is defined, I call it on all the sentences in the sentence pairs and store the results. This process takes up nearly all the runtime, so be prepared to wait around 15 minutes if you run this.

In [3]:
#parse first sentences in sentence pairs, for both train and test sets
sentence1_parsed = []
for s in sentence1:
    parsed = spo(s)
    sentence1_parsed.append(parsed)
sentence1_parsed_test = []
for s in sentence1_test:
    parsed = spo(s)
    sentence1_parsed_test.append(parsed)

In [4]:
#parse second sentences in sentence pairs, for both train and test sets
sentence2_parsed = []
for s in sentence2:
    parsed = spo(s)
    sentence2_parsed.append(parsed)
sentence2_parsed_test = []
for s in sentence2_test:
    parsed = spo(s)
    sentence2_parsed_test.append(parsed)

If we were to look at the results of the parsing we just did, we would see that some of the sentences did not have an identified subject, predicate and/or object. This means our algorithm did not capture the meaning of that sentence, so comparisions against it can introduce inaccuracy into our model. So below, I do some cleaning to remove any sentence pairs where at least one of the sentences has a missing subject, predicate, or object. Also, I print out some data frames containing the cleaned train and test data, respectively, so you can get a sense of what the data looks like.

In [5]:
#sentences where SPO could not parse out any of the subject, predicate, or object are meaningless to us
#Since accurate comparisons cannot be made, remove any pairs affected by this
import pandas as pd
df1 = pd.DataFrame(sentence1_parsed, columns = ["S1", "P1", "O1"])
df2 = pd.DataFrame(sentence2_parsed, columns = ["S2", "P2", "O2"])
df3 = pd.DataFrame(labels, columns = ["label"])
combined_df = df1.join(df2)
combined_df = combined_df.join(df3)
cleaned_df = combined_df[(combined_df.S1 != '') & (combined_df.P1 != '') & (combined_df.O1 != '')]
cleaned_df = cleaned_df[(cleaned_df.S2 != '') & (cleaned_df.P2 != '') & (cleaned_df.O2 != '')]
display(cleaned_df)
#now split these back out into separate lists; still need to process those via Word2Vec
df1 = cleaned_df.iloc[:,:3]
df2 = cleaned_df.iloc[:,3:6]
df3 = cleaned_df.iloc[:,6:]
sentence1_cleaned = df1.values.tolist()
sentence2_cleaned = df2.values.tolist()
labels = df3.values.tolist()

#now do the same thing for test data
df1 = pd.DataFrame(sentence1_parsed_test, columns = ["S1", "P1", "O1"])
df2 = pd.DataFrame(sentence2_parsed_test, columns = ["S2", "P2", "O2"])
df3 = pd.DataFrame(labels_test, columns = ["label"])
combined_df = df1.join(df2)
combined_df = combined_df.join(df3)
cleaned_df = combined_df[(combined_df.S1 != '') & (combined_df.P1 != '') & (combined_df.O1 != '')]
cleaned_df = cleaned_df[(cleaned_df.S2 != '') & (cleaned_df.P2 != '') & (cleaned_df.O2 != '')]
display(cleaned_df)
#now split these back out into separate lists; still need to process those via Word2Vec
df1 = cleaned_df.iloc[:,:3]
df2 = cleaned_df.iloc[:,3:6]
df3 = cleaned_df.iloc[:,6:]
sentence1_cleaned_test = df1.values.tolist()
sentence2_cleaned_test = df2.values.tolist()
labels_test = df3.values.tolist()

Unnamed: 0,S1,P1,O1,S2,P2,O2,label
0,Amrozi,distorting,evidence,Amrozi,distorting,evidence,1
1,Yucaipa,selling,Safeway,Yucaipa,sold,Safeway,0
3,shares,set,high,shares,closing,high,0
4,stock,close,Exchange,shares,jumped,Friday,1
5,year,dropped,period,year,dropped,period,1
...,...,...,...,...,...,...,...
3661,Department,contain,infection,spokesperson,following,protocol,1
3662,rules,reach,percent,limit,reaching,percent,1
3664,Martin,serving,Barras,Martin,wounding,Fearon,0
3666,notification,reported,MSNBC,MSNBC.com,reported,Friday,1


Unnamed: 0,S1,P1,O1,S2,P2,O2,label
1,sales,expected,backlash,Co.,prompted,backlash,1
3,storm,hit,Monday,storm,hits,coast,0
6,Quaife,remained,operation,Quaife,was,unprecedented,0
8,aide,allied,Thursday,aide,allied,Thursday,1
9,SPX,was,percent,IXIC,was,percent,0
...,...,...,...,...,...,...,...
1715,Yankees,took,pick,Yankees,selected,pick,1
1718,Crews,dump,rain,Crews,use,travel,0
1719,directors,completed,Nvidia,acquisition,close,quarter,1
1722,Hamilton,remained,attack,morning,talked,attack,0


We are almost done with preprocessing! Next, we need to use Word2Vec to convert the words in the pre-processed sentences into numerical vectors. Since our dataset isn't very large (in the kilobytes), we end up with small Word2Vec models. We also make sure to transpose the vectors, as that is what the paper specifies.

In [6]:
#Word2Vec conversion. Dimensions and procedure match what's in the paper
#Point of ambiguity: we only ever have a single word for a subject, predicate, or object; but paper seems to suggest sometimes
#that there can be multi-word subjects/predicates/objects
import os
import numpy as np
RANDOM_SEED = 23432098
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)

import gensim
from gensim.models import Word2Vec
#build a Word2Vec model for train data and one for test data
train_sentences = sentence1_cleaned + sentence2_cleaned
test_sentences = sentence1_cleaned_test + sentence2_cleaned_test
w2v1 = Word2Vec(train_sentences, vector_size=50, workers=1, min_count=1)
w2v1_test = Word2Vec(test_sentences, vector_size=50, workers=1, min_count=1)

#then to get the sentences_final, pull out the .wv for each word in the sentence and transpose it
sentence1_final = []
sentence2_final = []
sentence1_final_test = []
sentence2_final_test = []
for s in sentence1_cleaned:
    words = []
    for w in s:
        mat = w2v1.wv[w]
        words.append(mat.transpose())
    sentence1_final.append(words)
for s in sentence2_cleaned:
    words = []
    for w in s:
        mat = w2v1.wv[w]
        words.append(mat.transpose())
    sentence2_final.append(words)
for s in sentence1_cleaned_test:
    words = []
    for w in s:
        mat = w2v1_test.wv[w]
        words.append(mat.transpose())
    sentence1_final_test.append(words)
for s in sentence2_cleaned_test:
    words = []
    for w in s:
        mat = w2v1_test.wv[w]
        words.append(mat.transpose())
    sentence2_final_test.append(words)

## Model Definition and Training
I have opted not to include a pretrained copy of the model because 1) its actual training time is negligible, and you can parse the data to train it with in a matter of minutes and 2) it does not produce helpful results anyway. Below, you can find my model definition. I tried to match its construction with the details in the paper, but there was some missing/ambiguous info on hidden nodes, etc that I had to guess on.

In [7]:
#define the CNN model
#I am using MaxPool2d as opposed to k-max pooling as we know the sentences should always be the same size
#I have left the print statements intact so you can see how the values become smaller/closer to zero
#just uncomment, then run to see
import torch
import torch.nn as nn
import torch.nn.functional as F
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(50, 17, kernel_size=(3,3), padding=1)
        self.conv2 = nn.Conv2d(17, 6, kernel_size=(3,3), padding=1)
        #self.pool1 = nn.MaxPool2d(3) #errors if set to 3 as the original paper uses for its pooling
        self.pool2 = nn.MaxPool2d(1)
        self.fc1 = nn.Linear(18, 1)

    def forward(self, x_prime):
        x_prime = F.relu(self.conv1(x_prime))
        #print("After first conv layer:")
        #print(x_prime)
        x_prime = F.relu(self.conv2(x_prime))
        #print("After second conv layer:")
        #print(x_prime)
        x_prime = self.pool2(x_prime)
        #print("After first pool:")
        #print(x_prime)
        x_prime = self.pool2(x_prime)
        #print("After second pool:")
        #print(x_prime)
        x_prime = x_prime.view(-1, 18)
        #print("View X prime:")
        #print(x_prime)
        x_prime = self.fc1(x_prime)
        #print("After FC layer:")
        #print(x_prime) #show what is being output from the model
        return x_prime

Now, we just need to do a little bit more manipulation of our training and test data to create our data loaders. Batch size is 64, following the paper.

In [8]:
#define the data loaders
#to do this, construct the training data by binding together final sentence1 and sentence2 with their target score
df1 = pd.DataFrame(sentence1_final, columns = ["S1", "P1", "O1"])
df2 = pd.DataFrame(sentence2_final, columns = ["S2", "P2", "O2"])
df3 = pd.DataFrame(labels, columns = ["label"])
combined_df = df1.join(df2)
combined_df = combined_df.join(df3)
train_data = combined_df.values.tolist()
train_loader = torch.utils.data.DataLoader(train_data, batch_size=64, shuffle=True)
#repeat for test data
df1 = pd.DataFrame(sentence1_final_test, columns = ["S1", "P1", "O1"])
df2 = pd.DataFrame(sentence2_final_test, columns = ["S2", "P2", "O2"])
df3 = pd.DataFrame(labels_test, columns = ["label"])
combined_df = df1.join(df2)
combined_df = combined_df.join(df3)
test_data = combined_df.values.tolist()
val_loader = torch.utils.data.DataLoader(test_data, batch_size=64, shuffle=False)

Below is where we actually train the model. Each sentence in the sentence pair is passed through the CNN model separately; then I follow the methodology in the paper to determine their similarity score using Manhattan distance and using it as an exponent of e to get a score between 0 (dissimilar) and 1 (similar).

In [9]:
#now train the model
criterion = nn.MSELoss()
model = SimpleCNN()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
n_epochs = 10
from scipy.spatial.distance import cityblock
import math
import random
from tqdm import tqdm
def train_model(model, train_dataloader, n_epoch=n_epochs, optimizer=optimizer, criterion=criterion):
    import torch.optim as optim
    model.train() # prep model for training
    for epoch in range(n_epoch):
        curr_epoch_loss = []
        for s1s, s1p, s1o, s2s, s2p, s2o, target in tqdm(train_dataloader):
            #first, process s1 and s2 through the model
            #ensure the batch size is accurate
            batch = s1s.shape[0]
            s1 = np.concatenate([s1s,s1p,s1o])
            s1 = np.reshape(s1, (batch,50,3,1))
            s1 = torch.tensor(s1) 
            s1_processed = model(s1)
            s2 = np.concatenate([s2s,s2p,s2o])
            s2 = np.reshape(s2,(batch,50,3,1))
            s2 = torch.tensor(s2)
            s2_processed = model(s2)
            #need to detach to perform manhattan distance calculation, otherwise throws error
            s1_detached = s1_processed.detach()
            s2_detached = s2_processed.detach()
            y_hats = torch.empty(target.shape[0])
            for i in range(target.shape[0]):
                s1_detached_i = torch.flatten(s1_detached)
                s2_detached_i = torch.flatten(s2_detached)
                #now calculate manhattan distance
                manhattan = cityblock(s1_detached_i, s2_detached_i)
                y_hat = math.e ** (-manhattan)
                #normalize y_hat score to 0 or 1 for MSRP data
                if y_hat >= 0.5:
                    y_hat = 1
                else:
                    y_hat = 0
                y_hats[i] = y_hat
            y_hats = y_hats.requires_grad_()
            target = target.float()
            loss = criterion(y_hats,target)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            curr_epoch_loss.append(loss.cpu().data.numpy())
        print(f"Epoch {epoch}: curr_epoch_loss={np.mean(curr_epoch_loss)}")
    return model
seed = 24
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
model = train_model(model, train_loader)

  return default_collate([torch.as_tensor(b) for b in batch])
100%|██████████| 33/33 [00:00<00:00, 48.74it/s]


Epoch 0: curr_epoch_loss=0.3270833194255829


100%|██████████| 33/33 [00:00<00:00, 54.64it/s]


Epoch 1: curr_epoch_loss=0.3326704502105713


100%|██████████| 33/33 [00:00<00:00, 54.89it/s]


Epoch 2: curr_epoch_loss=0.32149621844291687


100%|██████████| 33/33 [00:00<00:00, 54.09it/s]


Epoch 3: curr_epoch_loss=0.3438447117805481


100%|██████████| 33/33 [00:00<00:00, 52.60it/s]


Epoch 4: curr_epoch_loss=0.3326704502105713


100%|██████████| 33/33 [00:00<00:00, 54.15it/s]


Epoch 5: curr_epoch_loss=0.3438447117805481


100%|██████████| 33/33 [00:00<00:00, 54.20it/s]


Epoch 6: curr_epoch_loss=0.3382575809955597


100%|██████████| 33/33 [00:00<00:00, 54.77it/s]


Epoch 7: curr_epoch_loss=0.32149621844291687


100%|██████████| 33/33 [00:00<00:00, 54.03it/s]


Epoch 8: curr_epoch_loss=0.3326704502105713


100%|██████████| 33/33 [00:00<00:00, 53.37it/s]

Epoch 9: curr_epoch_loss=0.3326704502105713





## Evaluating the Model
The evalution code is below. For Experiment 1, the authors only took note of their attained accuracy and F score. I also included precision and recall in my results table for fun.

In [10]:
#Evaluate the model on the test data
def eval_model(model, dataloader):
    model.eval()
    Y_pred = []
    Y_true = []
    for s1s, s1p, s1o, s2s, s2p, s2o, target in dataloader:
        
        batch = s1s.shape[0]
        s1 = np.concatenate([s1s,s1p,s1o])
        s1 = np.reshape(s1, (batch,50,3,1))
        s1 = torch.tensor(s1)
        s1_processed = model(s1)
        s2 = np.concatenate([s2s,s2p,s2o])
        s2 = np.reshape(s2,(batch,50,3,1))
        s2 = torch.tensor(s2)
        s2_processed = model(s2)
        s1_detached = s1_processed.detach()
        s2_detached = s2_processed.detach()
        y_hats = torch.empty(target.shape[0])
        for i in range(target.shape[0]):
            s1_detached_i = torch.flatten(s1_detached)
            s2_detached_i = torch.flatten(s2_detached)
            #now calculate manhattan distance
            manhattan = cityblock(s1_detached_i, s2_detached_i)
            y_hat = math.e ** (-manhattan)
            #normalize y_hat score to 0 or 1 for MSRP data
            if y_hat >= 0.5:
                y_hat = 1
            else:
                y_hat = 0
            y_hats[i] = y_hat
        Y_pred.append(y_hats)
        Y_true.append(target)
    Y_pred = np.concatenate(Y_pred, axis=0)
    Y_true = np.concatenate(Y_true, axis=0)
    return Y_pred, Y_true
#print metrics
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

y_pred, y_true = eval_model(model, val_loader)
acc = accuracy_score(y_true, y_pred)
prec, recall, fscore, _ = precision_recall_fscore_support(y_true, y_pred, average='binary')
headers = ["Accuracy", "F Score", "Precision", "Recall"]
stats = [acc, fscore, prec, recall]
print(pd.DataFrame(stats,headers))

                  0
Accuracy   0.668757
F Score    0.801503
Precision  0.668757
Recall     1.000000


Here is verification of the runtime and memory usage I listed in my report.

In [11]:
print("Total running time = {:.2f} seconds".format(time.time() - _START_RUNTIME))
print("Current and Peak Memory Usage:")
print(tracemalloc.get_traced_memory())
tracemalloc.stop()

Total running time = 935.92 seconds
Current and Peak Memory Usage:
(35357838, 35483841)
