Practical assignment

The Microsoft Research Sentence Completion Challenge (Zweig and Burges, 2011) requires a system to
be able to predict which is the most likely word (from a set of 5 possibilities) to complete a sentence. In
the labs you have evaluated using unigram and bigram models. In this assignment you are expected to
investigate at least 2 extensions or alternative approaches to making predictions. Your solution does
not need to be novel. You might choose to investigate 2 of the following approaches or 1 of the following
approaches and 1 of your own devising.

•Tri-gram (or even quadrigram) models
•Word similarity methods e.g., using Googlenews vectors or WordNet?
•Combining n-gram methods with word similarity methods e.g., distributional smoothing?
•Using a neural language model?

In [46]:
import numpy as np
import matplotlib.pyplot as plt
import nltk
import os,random,math
import sys
import torch
import copy


training_dir="D:/Documents/Computer Science/Year 3/Semester Two/ANLE/lab2resources/lab2resources"+"/sentence-completion/Holmes_Training_Data/"
#testing_dir="D:/Documents/Computer Science/Year 3/Semester Two/ANLE/lab2resources/lab2resources/sentence-completion/testing_data.csv"

We start by calling in needed libraries and setting the directories of data. We next want to read the data and format it correctly.

The files are split between training and testing data from the holmes training data set. We then gather every line in the file within two large string array for training and testing. 

In [2]:
TRAINING_DIR=os.path.dirname(training_dir) #this needs to be the parent directory for the training corpus
def get_training_testing(training_dir=TRAINING_DIR,split=0.5):
    filenames=os.listdir(training_dir)
    n=len(filenames)
    print("There are {} files in the training directory: {}".format(n,training_dir))
    #random.seed(53) #if you want the same random split every time
    random.shuffle(filenames)
    index=int(n*split)
    return(filenames[:index],filenames[index:])

def readWords(files):
    training=[]
    for afile in files: #look through each file
                #print("Processing {}".format(afile))
                try:
                    sent=""
                    with open(os.path.join(training_dir,afile)) as instream: #get each line and preprocess
                        for line in instream:
                            line=line.lower()
                            line=line.rstrip()
                            sent+=line+" " #gather each line in the array
                    
                except UnicodeDecodeError:
                    print("UnicodeDecodeError processing {}: ignoring file".format(afile))
                except PermissionError:
                    print("denied")
    sent=sent.replace("?",".").replace("!",".").replace("  "," ")
    training=sent.split(".") #get array of sentences
    try: training.remove(" ")  #remove spaces
    except ValueError: pass
    try: training.remove(' m') #remove random characters
    except ValueError: pass
    return training
train,test=get_training_testing() #get the data

print(train[0],test[0])

train_words=readWords(train) #get the train sentences
test_words=readWords(test) #get the test sentences
print(train_words[0:10],test_words[0:10])

There are 522 files in the training directory: D:/Documents/Computer Science/Year 3/Semester Two/ANLE/lab2resources/lab2resources/sentence-completion/Holmes_Training_Data
CUBRK10.TXT FLIRT10.TXT
['*******the project gutenberg etext of tales and fantasies******* #18 in our series by robert louis stevenson  copyright laws are changing all over the world, be sure to check the copyright laws for your country before posting these files', '', ' please take a look at the important information in this header', ' we encourage you to keep this file on your own disk, keeping an electronic path open for the next readers', ' do not remove this', '  **welcome to the world of free plain vanilla electronic texts** **etexts readable by both humans and by computers, since 1971** *these etexts prepared by hundreds of volunteers and donations* information on contacting project gutenberg to get etexts, and further information is included below', ' we need your donations', '  tales and fantasies by robert l

Apply preprocessing techniques on the data

In [5]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

def get_word_list(train):
    dict={}
    for sent in train:
        for word in sent.split():
            dict[word]=0
    return list(dict.keys()) #list of unique words return
def convert_to_vector(words):
    oov_token = "<OOV>"
    tokenizer = Tokenizer(num_words=len(words), oov_token=oov_token)
    tokenizer.fit_on_texts(words)
    #word_index = tokenizer.word_index
    sequences = tokenizer.texts_to_sequences(words)
    return list(pad_sequences(sequences, truncating='post', maxlen=4))
def count_freq(train):
    for sent in train:
        pass
words_train=get_word_list(train_words)
x=convert_to_vector(words_train)

def get_label_as_data(labels):
    x=len(labels) #get the size of the array
    arr=np.zeros((x)) #create an array of empty
    lab=[]
    for i in range(x):
        a=np.copy(arr)
        a[i]=1
        lab.append(a)
    return np.array(lab)
INP_SIZE=len(x[0])

x[0]


array([ 0,  0,  0, 12])

Create the first model

A recurrent neural network is a good choice for this sort of task as the architecture makes it well suited to sequencing. This approach is taken with stochastic gradient descsent as stated in the research paper 'Character-Aware Neural Language Models'.

The second approach used a feed-forward network where the vector as input and vector prediction as output. It encountered problems where the gradient optimization would grow infinitely smaller and eventually be an unrecognized value. Sigmoid was the only activation function that prevented this, however would force values between 0 and 1 resulting in incorrect predictions for a population of training data. 

The next model used labeled representation to represent the output.

In [62]:
class R_Network:
    def __init__(self,wordList,inp=4,hid=4,out=4,batch_size=1):
        self.rnn = torch.nn.RNN(inp,hid,out)
        self.h0 = torch.nn.Parameter(torch.randn(inp, 1, hid)) #output hidden
        self.batch_size=batch_size
        self.words=["word" for i in range(batch_size)] #initialize empty to prevent errors
        self.wordList=wordList
        self.inp=inp
        self.words={}
        vecs=convert_to_vector(wordList)
        for i,word in enumerate(wordList): #create dict
            self.words[word]=vecs[i]
    def train(self,words,targets,epochs=500):
        #get each word in index
        words_train=get_word_list(words)
        #vectorize words
        words=convert_to_vector(words_train)
        #get words in relation of neigbours
        #train word to next word
        optimizer = torch.optim.SGD((self.rnn.parameters()), lr=0.20) #, momentum=0.9
        loss_fn = torch.nn.MSELoss()
        for epoch in range(epochs):
            error=0
            for i in range(len(words)):
                X=words[i]
                y=torch.tensor(targets[i]).double()
                y=torch.tensor(y,requires_grad=True)

                pred=self.forward(X)
                
                optimizer.zero_grad()
                loss = loss_fn(pred, y)
                
                loss.backward()
                optimizer.step()  

                error+=sum(np.round(y.data.numpy() - pred.data.numpy()))/4
            if epoch % 100 == 0:
                print(f"Testing network @ epoch {epoch}")
                print("Error:",error/len(words))
                    
    def forward(self,input,n=3):
        #input = torch.randn(self.batch_size, 3, 1) #fake input for debug
        input=torch.tensor(np.array(input[np.newaxis,np.newaxis,:])).reshape(1,1,4).float()
        hn=self.h0
        output=input
        for i in range(n):
            output, hn = self.rnn(output, hn)
       
        assert len(output)<=self.batch_size,"Invalid batchsize output" #validate output
        output=output.detach().numpy() #convert to numpy and reshape
        output=np.sum(output,axis=1)[:,np.newaxis] #get the sum of the values
        return torch.tensor(output[0][0])
    def getAction(self,word):
        vec=np.array(self.words.get(word,np.array([0,0,0,0]))) #give default
        word_vec = torch.round(self.forward(vec)).detach().numpy() #gather vector
        w=list(self.words.keys())[0]
        for key in self.words:
            if self.words[key].all()==word_vec.all(): #loop through
                w=key
        return w
    def get_accuracy(self,words,labels): #get accuracy
        acc=0
        for word,label in zip(words,labels): #zip them together
            act=self.getAction(word) #get action from this word
            if act==label:
                acc+=1
        return acc/len(word) #return 

        
net=R_Network(words_train)
net.forward(x[0])
net.train(words_train[0:10],x[0:10],epochs=100)
print("Accuracy",net.get_accuracy(words_train[0:10],x[0:10]))

  y=torch.tensor(y,requires_grad=True)


Testing network @ epoch 0
Error: 93.05


  if act==label:


Accuracy 0.0


In [53]:
class FNN:
    def __init__(self,words,inp=4,hid=10,out=4): #feed forward neural network
        self.inp = torch.nn.Parameter(torch.randn(inp, hid)) #output hidden
        self.h0 = torch.nn.Parameter(torch.randn(hid, hid)) #output hidden
        self.h1 = torch.nn.Parameter(torch.randn(hid, out)) #output hidden
        self.b1 = torch.nn.Parameter(torch.randn(1, hid)) #output hidden
        self.b2 = torch.nn.Parameter(torch.randn(1, out)) #output hidden
        self.words={}
        vecs=convert_to_vector(words)
        for i,word in enumerate(words): #create dict
            self.words[word]=vecs[i]
    def forward(self,item): #pass through network
        item=item[:,np.newaxis]
        x=torch.tensor(item).float()
        x=(torch.mm(x.T, self.inp)) #torch.sigmoid
        x=torch.sigmoid(torch.mm(x, self.h0))
        x=(torch.mm(x, self.h1))
        return x[0]
    def train(self,words,targets,epochs=400):
        print("Training...")
        #get each word in index
        
        #get words in relation of neigbours
        #train word to next word
        BestInp=None
        Besth0=None
        Besth1=None
        BestAcc=0
        optimizer = torch.optim.SGD((self.inp,self.h0,self.h1), lr=0.30) #, momentum=0.9
        loss_fn = torch.nn.MSELoss()
        for epoch in range(epochs):
            error=0
            correct=0
            for i in range(len(words)):
                X=self.words.get(words[i],np.array([0,0,0,0]))/80
                y=torch.tensor(targets[i]).double()
                y=torch.tensor(y,requires_grad=True)
               
                pred=self.forward(X)
                #X=torch.tensor(torch.tensor(X).double(),requires_grad=True)
                optimizer.zero_grad()
                loss = loss_fn(pred, y)
                
                loss.backward()
                optimizer.step()  
                if np.argmax(pred.data.numpy())==i:
                    correct+=1
                error+=sum(abs(y.data.numpy() - pred.data.numpy()))/4
            if correct/len(words) > BestAcc:
                BestAcc=correct
                BestInp=copy.deepcopy(self.inp)
                Besth0=copy.deepcopy(self.h0)
                Besth1=copy.deepcopy(self.h1)
            if epoch %200 == 0:
                #print(loss,torch.round(pred),y)
                #print(sum((y.data.numpy() - pred.data.numpy()))/4)
                print("Accuracy",correct/len(words))
                print(f"Testing network @ epoch {epoch}")
                print("Error:",error/len(words))
        self.inp=BestInp
        self.h0=Besth0
        self.h1=Besth1
    def getWordThing(self,word):
        a=np.array([0,0,0,0])
        print(self.words.get(word,a))
        return self.words.get(word,a)
    def get_action(self,word):
        vec=self.words.get(word,np.array([0,0,0,0]))/80 #give default
        word_vec = self.forward(vec).data.numpy() #gather vector
        ind=np.argmax(word_vec)
        #print(word,word_vec,ind)
        return list(self.words.keys())[ind]
    def getWord(self,word_vec):
        w="None"
        for key in self.words:
            #print(self.words[key],word_vec)
            if np.array_equal(self.words[key],word_vec): #loop through
                w=key
        return w
    def get_accuracy(self,words,labels): #get accuracy
        acc=0
        for word,label in zip(words,labels): #zip them together
            act=self.get_action(word) #get action from this word
            #print(act)
            if act==word:
                acc+=1
        return acc/len(words) #return accuracy


words_train=get_word_list(train_words)
labels=get_label_as_data(words_train[0:100])
netF=FNN(words_train,hid=len(labels[0])//2,out=len(labels[0]))
print(netF.forward(x[0]))#
#print(words_train[0:4],labels[0:4])
netF.train(words_train[0:100],labels[0:100],epochs=10000)  
print("End Accuracy",netF.get_accuracy(words_train[0:100],labels[0:100]))
netF.getWordThing(words_train[0])


tensor([ -7.0263,   6.8147,  -6.1705,   2.0332,  10.8733,  -5.4604,  -0.3081,
         -7.8788,  -1.7301,  -2.6504,  -2.7632,  -1.7217,  -2.4411,   2.0993,
          5.5797,  -4.9541,   1.3451,   1.4812,  -6.9932,  -2.6261,  -2.1535,
         -1.3742,   0.8028,  -5.3220,  -3.4034,  -5.4714,  -0.6734,   3.8546,
         -9.9450,   8.1569,  -1.9101,   5.6007,   3.3800,   0.7382,   1.2677,
          5.9711,   2.3239,   0.0953,  -0.0145,  -3.4464,   5.4084,   0.9519,
          0.2902,   5.5677,  -8.0937,   9.1140,   3.3787,   5.6431,  -2.8380,
         14.4866,  -2.9227,   2.5693,  -5.4530,  -0.4428,   5.0368,   1.5918,
         -8.9663,  -4.1025,  -9.1527,   1.2496,   4.3289,  -7.9883,  -1.9852,
          7.3676,  -3.7074,   1.9989,  -4.5460,   0.2901,  -1.7207,  -0.5794,
          1.0810,   0.5027,  -3.2028,  -6.0610,  -2.9303,   1.6782,   5.6528,
          8.7015,  -2.8451,  -9.1710,   5.0370,   2.6042, -11.4110,  -1.7010,
          1.1096,  -5.8426,   2.4253,  -3.3925,  -2.5130,   1.00

  y=torch.tensor(y,requires_grad=True)


Accuracy 0.02
Testing network @ epoch 200
Error: 0.5731528520530924
Accuracy 0.02
Testing network @ epoch 400
Error: 0.5472699021241044
Accuracy 0.02
Testing network @ epoch 600
Error: 0.5388533678622006
Accuracy 0.02
Testing network @ epoch 800
Error: 0.5343723129708985
Accuracy 0.03
[ 0  0  0 12]


array([ 0,  0,  0, 12])

In [42]:
class Agent_defineLayers:
    def __init__(self, num_input, layers, num_output):
        assert type(layers)==type([]), "Error with layers, give array of the number of layers"
        self.num_input = num_input  #set input number
        self.num_output = num_output #set ooutput number
        self.hidden=[]
        last=num_input
        self.num_genes=0
        for layer in layers:
            self.hidden.append(layer)
            self.num_genes+=(last * layer)
            last=layer
        self.num_genes +=(self.hidden[-1]*num_output)+num_output
        self.weights = None
        self.hidden_weights=None
        self.bias = None
    def set_genes(self, gene):
        weight_idxs = self.num_input * self.hidden[0] #size of weights to hidden
        current=weight_idxs
        weights_idxs=[current] #start with end of last
        for i in range(len(self.hidden)-1):
            current+=self.hidden[i]*self.hidden[i+1] #calculate next idx for each layer
            weights_idxs.append(current)
        bias_idxs=None
        weights_idxs.append(self.hidden[-1] * self.num_output + weights_idxs[-1]) #add last layer heading to output
        bias_idxs = weights_idxs[-1]+ self.num_output #sizes of biases
        w = gene[0 : weight_idxs].reshape(self.hidden[0], self.num_input)   #merge genes
        ws=[]
        for i in range(len(self.hidden)-2):
            ws.append(gene[weights_idxs[i] : weights_idxs[i+1]].reshape(self.hidden[i+1], self.hidden[i]))
        ws.append(gene[weights_idxs[-2] : weights_idxs[-1]].reshape(self.num_output, self.hidden[-1]))
        b = gene[weights_idxs[-1]: bias_idxs].reshape(self.num_output,) #merge genes

        self.weights = torch.from_numpy(w) #assign weights
        self.hidden_weights=[]
        for w in ws:
            self.hidden_weights.append(torch.from_numpy(w))
        self.bias = torch.from_numpy(b) #assign biases

    def forward(self, x):
        x=x[:,np.newaxis]
        x = torch.from_numpy(x).double()
        x = torch.mm(x.T, self.weights.T) #first layer
        for i in range(len(self.hidden_weights)-1):
            x = torch.mm(x,self.hidden_weights[i].T) #second layer
        return torch.mm(x,self.hidden_weights[-1].T) + self.bias #third layer
        
    def get_action(self, x):
        arr=torch.round(self.forward(x)[0])
        return arr

In [45]:
words={}
vecs=convert_to_vector(words_train)
for i,word in enumerate(words_train): #create dict
    words[word]=vecs[i]

def getWord(word_vec):
        w="None"
        for key in words:
            #print(self.words[key],type(word_vec))
            if np.array_equal(words[key],word_vec): #loop through
                w=key
        return w
def runFitness(agent,words,labels):
    acc=0
    for word,label in zip(words,labels): #zip them together
        act=agent.get_action(word) #get action from this word
        if act==getWord(label):
            acc+=1
    return acc/len(words) #return accuracy

def mutation(gene, mean=0, std=0.5,size=100):
    assert size<len(gene)
    n=random.randint(0,len(gene)-size-1)
    array=np.random.normal(mean,std,size=size)
    gene = gene[n:n+size] + array #mutate the gene via normal 
    # constraint
    gene[gene >4] = 4
    gene[gene < -4] = -4
    return gene

whegBot=Agent_defineLayers(4,[10,10],4) #define the agent

pop_size=150
gene_pop=[]
for i in range(pop_size): #vary from 10 to 20 depending on purpose of robot
    gene=np.random.normal(0, 0.5, (whegBot.num_genes))
    gene_pop.append(gene)#create

acc=[0]
for gen in range(500):
    ind_1 = random.randint(0,len(gene_pop)-1)
    ind_2=0
    if ind_1>0: ind_2 = ind_1-1
    else: ind_2= ind_1+1
    #get both genes
    gene1=gene_pop[ind_1]
    gene2=gene_pop[ind_2]
    #run trials
    whegBot.set_genes(gene1)
    fit1=runFitness(whegBot,x[0:10],x[0:10])
    whegBot.set_genes(gene2)
    fit2=runFitness(whegBot,x[0:10],x[0:10])
    #selection
    if fit1>fit2:
        gene_pop[ind_2]=np.copy(mutation(gene1))
    elif fit2>fit1:
        gene_pop[ind_1]=np.copy(mutation(gene1))
    acc.append(max(max(acc),max([fit1,fit2])))
    print("Generation",gen,"accuracy",max(acc))

print(acc)

Generation 0 accuracy 0
Generation 1 accuracy 0
Generation 2 accuracy 0
Generation 3 accuracy 0
Generation 4 accuracy 0
Generation 5 accuracy 0
Generation 6 accuracy 0
Generation 7 accuracy 0
Generation 8 accuracy 0
Generation 9 accuracy 0
Generation 10 accuracy 0
Generation 11 accuracy 0
Generation 12 accuracy 0
Generation 13 accuracy 0
Generation 14 accuracy 0
Generation 15 accuracy 0
Generation 16 accuracy 0
Generation 17 accuracy 0
Generation 18 accuracy 0
Generation 19 accuracy 0
Generation 20 accuracy 0
Generation 21 accuracy 0
Generation 22 accuracy 0
Generation 23 accuracy 0
Generation 24 accuracy 0
Generation 25 accuracy 0
Generation 26 accuracy 0
Generation 27 accuracy 0
Generation 28 accuracy 0
Generation 29 accuracy 0
Generation 30 accuracy 0
Generation 31 accuracy 0
Generation 32 accuracy 0
Generation 33 accuracy 0
Generation 34 accuracy 0
Generation 35 accuracy 0
Generation 36 accuracy 0
Generation 37 accuracy 0
Generation 38 accuracy 0
Generation 39 accuracy 0
Generation

An attempt to make a feed forward neural network was placed in as there were issues with backpropagating the RNN

Create the second model

Evaluate performance