<a href="https://colab.research.google.com/github/vgbeck/ComputationalLinguistics/blob/main/BeckemanFinalAssignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Explain what kind of model you chose and why:

I chose an RNN model, specifically LSTM. The reason that I chose this model is because it is useful for flexibility in model design. I wanted to be able to play around with my input sizes and layers to see the effects. I tried a variety of paramaters until I found my best options. It is also useful for handling complex data. RNNs and LSTMs are useful in handling sequential data such as lyrics that contain words and phrases in order so they are good for text analysis.

Explain exactly how you are calculating accuracy:

Accuracy is calculated by comparing the number of correctly predicted authors over the total number of predictions. For training this is done at the end of every epoch round. As I am training on 5 epochs there are five training accuracies. The average of all of these accuracies is then taken for a final score. The testing accuracy is only taken once as the model is already trained and is not split up into epochs. The testing accuracy was 7.13% and the overall training accuracy was 47.4% (this is taken from the average of all of the epochs as seen printed). As the training accuracy is much higher than the testing accuracy this could indicate overfitting.

Errors this model makes:

Some errors that this model makes might be due to overfitting. With this type of data it is possible that there are many occurrences of overfitting. Each of these lines has very specific song lyrics and words so the model might be trained too closely to the examples and unable to classify the testing data that it has not seen before. One of the first patterns that I saw was that out of all of the "correct" predictions, 80% of them were from the song genre. This tells me that my model is better at predicting song targets instead of poem targets. Looking more specifically at the incorrect prediction I see matching genres between the correct and incorrect answers. It is interesting to see how many of the incorrect predictions come from similar styles of music as their target. For example predicting Charlie Puth when the target was Mac Miller. These artists are both pop and might have similar word choices throughout their lyrics. A way to fix this might be to partition the data into training, validation, and testing sets to increase further data seperation. Having more data points would also increase the variety.



In [None]:
import torch #Pytorch is a Python module that can create neural networks  and automatically do backpropogation for training a network.
import torch.nn as nn #Torch.nn is a submodule of torch that can create various types of networks and functions that operate on them.


# single-direction RNN, optionally tied embeddings
class Emb_RNN(nn.Module):
    def __init__(self, params, use_LSTM=False):
        super(Emb_RNN, self).__init__()
        self.d_embs = params['d_emb']
        self.d_hid =  params['d_hid']
        self.embeddings= nn.Embedding(params['num_wds'], self.d_embs)
        self.use_LSTM = use_LSTM
        # input to recurrent layer, default nonlinearity is tanh
        if use_LSTM:
            self.i2R = nn.LSTMCell(self.d_embs, self.d_hid)
        else:
            self.i2R = nn.RNNCell(self.d_embs, self.d_hid)
        # recurrent to output layer
        self.R2o = nn.Linear(self.d_hid, params['num_authors'])

    def forward(self, wd_indices):
        for j, wd_ix in enumerate(wd_indices):
            embs = self.embeddings(wd_ix)
            embs = torch.unsqueeze(embs, 0)
            if self.use_LSTM:
                if j == 0:
                    hidden, context = self.i2R(embs)
                else:
                    hidden, context = self.i2R(embs, (hidden, context))
            else:
                if j == 0:
                    hidden = self.i2R(embs)
                else:
                    hidden = self.i2R(embs, hidden)
        pred = self.R2o(hidden)
        return pred

In [None]:
import torch
import torch.nn as nn
# from model import Emb_RNN
import numpy as np
import re
import sys
import collections
import os
import random
import json

verbose = False

num_epochs = 1

#BEFORE
d_emb = 128
n_layers = 1
d_hid = 128
lr = 0.0002


# d_emb = 128
# n_layers = 1
# d_hid = 128
# lr = 0.0003

use_LSTM = True
if use_LSTM:
    model_type = 'lstm'
else:
    model_type = 'rnn'

In [None]:
def train(net, lines, params):
    criterion = nn.CrossEntropyLoss() #Don't use ignore index!!!
    optimiser = torch.optim.Adam(net.parameters(), lr=lr)
    if os.path.exists(params['save_path']):
        checkpoint = torch.load(params['save_path'])
        print('Loading checkpoint')
        net.load_state_dict(checkpoint['net_state_dict'])
        optimiser.load_state_dict(checkpoint['optimiser_state_dict'])
        net.eval()


    for epoch in range(params['epochs']):
        print("epoch ", epoch)
        ep_loss = 0.
        num_tested = 0
        num_correct = 0
        for counter, i in enumerate(torch.randperm(len(lines))):
        #for counter, i in enumerate(torch.randperm(20)):
            line = torch.LongTensor([wd2ix[wd] for wd in lines[i][0]])
            #here


            pred = net(line)
            pred = pred.contiguous().view(-1, pred.size(-1))
            target = torch.tensor(au2ix[lines[i][1]])
            target = target.contiguous().view(-1)
            target = target.long()
            with torch.no_grad():
                pred_numpy = np.argmax(pred.numpy(), axis=1).tolist()
                target_numpy = target.numpy().tolist()
                num_tested += 1
                if pred_numpy == target_numpy:
                    num_correct += 1
            loss = criterion(pred, target)
            if torch.isnan(loss):
                with torch.no_grad():
                    print(pred, target, lines[i])
                    exit()
            loss.backward()
            optimiser.step()
            optimiser.zero_grad()
            ep_loss += loss.detach()
        print('Epoch', epoch, 'Accuracy', round(num_correct / num_tested, 4), 'Loss', ep_loss)
        print('Saving checkpoint')
        torch.save({'net_state_dict': net.state_dict(),  'optimiser_state_dict': optimiser.state_dict()}, params['save_path'])


In [None]:
def test(net, lines, params):
    criterion = nn.CrossEntropyLoss()
    optimiser = torch.optim.Adam(net.parameters(), lr=lr)
    if os.path.exists(params['save_path']):
        checkpoint = torch.load(params['save_path'])
        print('Loading checkpoint')
        net.load_state_dict(checkpoint['net_state_dict'])
        optimiser.load_state_dict(checkpoint['optimiser_state_dict'])
        net.eval()
    num_tested = 0
    num_correct = 0
    with torch.no_grad():
        current_author = None
        scores = np.zeros(len(authors))
        for line in lines:
            wds = torch.LongTensor([wd2ix[wd] for wd in line[0]])
            pred = net(wds)
            pred = pred.contiguous().view(-1, pred.size(-1))
            target = torch.tensor(au2ix[line[1]])
            target = target.contiguous().view(-1)
            target = target.long()
            pred_numpy = np.argmax(pred.numpy(), axis=1).tolist()
            target_numpy = target.numpy().tolist()
            num_tested += 1
            if pred_numpy == target_numpy:
                num_correct += 1
            author = au2ix[line[1]]
            if author != current_author:
                #We have moved to a new author so we want to collect the scores of the most recent author
                if current_author is not None:
                    #If we are not at the very beginning
                    most_frequent_author = ix2au[str(np.argmax(scores))] #This author was predicted the most of all the lines of the most recent current author
                    print("Current Author:", ix2au[str(current_author)], "Predicted Author:", most_frequent_author) #Does the current author match the most frequently predicted author?
                    print(scores)
                    print(scores[current_author]) #How many hits did the current author actually get?
                    scores = np.zeros(len(authors)) #Reset the scores to zero for each author
                current_author = author #Reset the current author to the most recent author
            scores[pred_numpy[0]] += 1 #Add 1 to the score of the predicted author for the last lline seen
    print('Test accuracy', round(num_correct / num_tested, 4))


In [None]:

#net = Emb_RNN()
vocab = []
authors = []
train_lines = []
test_lines = []

with open('trainfile-1.json', 'r') as f0:
    train_data = json.load(f0)
    for pair in train_data:
        line = pair[0].split()
        line = [re.sub(r'[^a-zA-Z\*’\']', '', wd.lower()) for wd in line]
        for wd in line:
            wd = re.sub(r'[^a-zA-Z\*’\']', '', wd)
            if wd.lower() not in vocab:
                vocab.append(wd.lower())
        if pair[1] not in authors:
            authors.append(pair[1])
        train_lines.append([line, pair[1]])

with open('testfile-1.json', 'r') as f0:
    test_data = json.load(f0)
    for pair in test_data:
        line = pair[0].split()
        line = [re.sub(r'[^a-zA-Z\*’\']', '', wd.lower()) for wd in line]
        for wd in line:
            wd = re.sub(r'[^a-zA-Z\*’\']', '', wd)
            if wd.lower() not in vocab:
                vocab.append(wd.lower())
        if pair[1] not in authors:
            authors.append(pair[1])
        test_lines.append([line, pair[1]])


print('There are', len(vocab), 'words in the vocabulary')
print('There are', len(authors), 'authors')
print(authors)

There are 6212 words in the vocabulary
There are 40 authors
['Beyonce', 'BillyJoel', 'caamp', 'CatStevens', 'Cavetown', 'CharliePuth', 'Coldplay', 'ConanGray', 'DanReynolds', 'DominicFike', 'Drake', 'GratefulDead', 'GregoryAlanIsakov', 'HarryStyles', 'JohnLennon', 'joji', 'KhalilGhibran', 'LadyGaga', 'LandonConrath', 'LucyDacus', 'MacMiller', 'MargaretAtwood', 'MaryOliver', 'MayaAngelou', 'MichaelJackson', 'NikDay', 'NoahKahan', 'NoelGallagher', 'Olivia_Rodrigo', 'Quadeca', 'Queen', 'RickAstley', 'SamSmith', 'SmashMouth', 'solange', 'StevieNicks', 'TaylorSwift', 'TheAllAmericanRejects', 'wallows', 'ZachBryan']


In [None]:
#data transferred into representation for my model
wd2ix = {}
ix2wd = {}
for i,wd in enumerate(vocab):
    wd2ix[wd] = i
    ix2wd[str(i)] = wd
#words_as_indices = [torch.LongTensor([wd2ix[wd] for wd in vocab])]


au2ix = {}
ix2au = {}
for i,au in enumerate(authors):
    au2ix[au] = i
    ix2au[str(i)] = au
#authors_as_indices = [torch.LongTensor([au2ix[au] for au in authors])]
print(au2ix.items())

dict_items([('Beyonce', 0), ('BillyJoel', 1), ('caamp', 2), ('CatStevens', 3), ('Cavetown', 4), ('CharliePuth', 5), ('Coldplay', 6), ('ConanGray', 7), ('DanReynolds', 8), ('DominicFike', 9), ('Drake', 10), ('GratefulDead', 11), ('GregoryAlanIsakov', 12), ('HarryStyles', 13), ('JohnLennon', 14), ('joji', 15), ('KhalilGhibran', 16), ('LadyGaga', 17), ('LandonConrath', 18), ('LucyDacus', 19), ('MacMiller', 20), ('MargaretAtwood', 21), ('MaryOliver', 22), ('MayaAngelou', 23), ('MichaelJackson', 24), ('NikDay', 25), ('NoahKahan', 26), ('NoelGallagher', 27), ('Olivia_Rodrigo', 28), ('Quadeca', 29), ('Queen', 30), ('RickAstley', 31), ('SamSmith', 32), ('SmashMouth', 33), ('solange', 34), ('StevieNicks', 35), ('TaylorSwift', 36), ('TheAllAmericanRejects', 37), ('wallows', 38), ('ZachBryan', 39)])


In [None]:
params = {'num_wds': len(vocab), 'num_authors': len(authors), 'd_emb': 128, 'num_layers': 1, 'd_hid': 128, 'lr': 0.0003, 'epochs': 5, 'save_path': 'authors.pth'}

model = Emb_RNN(params, True)


for j in range(1):
    print("calling train")
    train(model, train_lines, params)
    print("calling test")
    test(model, test_lines, params)

calling train
epoch  0
Epoch 0 Accuracy 0.1689 Loss tensor(51213.2734)
Saving checkpoint
epoch  1
Epoch 1 Accuracy 0.3698 Loss tensor(39469.5859)
Saving checkpoint
epoch  2
Epoch 2 Accuracy 0.5088 Loss tensor(30865.8359)
Saving checkpoint
epoch  3
Epoch 3 Accuracy 0.6159 Loss tensor(24145.3301)
Saving checkpoint
epoch  4
Epoch 4 Accuracy 0.7065 Loss tensor(18632.7891)
Saving checkpoint
calling test
Loading checkpoint
Current Author: Beyonce Predicted Author: DominicFike
[ 1.  5.  1.  1.  1.  0.  2.  6.  6. 16.  2.  0.  1.  0.  0.  0.  0.  0.
  0.  0.  1. 12.  0.  0.  2.  2.  6.  1.  0.  2.  1.  0.  0.  0.  0.  4.
  3.  0.  3.  0.]
1.0
Current Author: BillyJoel Predicted Author: MargaretAtwood
[1. 1. 1. 1. 0. 0. 2. 2. 3. 0. 0. 0. 1. 0. 0. 1. 0. 1. 1. 0. 2. 8. 0. 3.
 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0. 0. 2. 3. 0.]
1.0
Current Author: caamp Predicted Author: Beyonce
[2. 0. 2. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 2. 0. 0. 2. 0. 0. 0. 0.
 0. 0. 1. 0. 2. 1. 0. 0. 0. 0. 0. 0. 2. 0. 0. 0.]
