# 2023 COMP 4446 / 5046 Assignment 1

Assingment 1 is an **individual** assessment. Please note the University's [Academic dishonesty and plagiarism policy](https://www.sydney.edu.au/students/academic-dishonesty.html).

Submission Deadline: Friday, March 17th, 2023, 11:59pm

Submit via Canvas:
- Your notebook
- Run all cells before saving the notebook, so we can see your output

In this assignment, we will explore ways to predict the length of a Wikipedia article based on the first 100 tokens in the article. Such a model could be used to explore whether there are systematic biases in the types of articles that get more detail.

If you are working in another language, please make sure to clearly indicate which part of your code is running which section of the assignment and produce output that provides all necessary information. Submit your code, example outputs, and instructions for executing it.

Note: This assignment contains topics that are not covered at the time of release. Each section has information about which lectures and/or labs covered the relevant material. We are releasing it now so you can (1) start working on some parts early, and (2) know what will be in the assignment when you attend the relevant labs and lectures.

# **TODO: Copy and Name this File**
Make a copy of this notebook in your own Google Drive (File -> Save a Copy in Drive) and change the filename, replacing `YOUR-UNIKEY`. For example, for a person with unikey `mcol1997`, the filename should be:

`COMP-4446-5046_Assignment1_mcol1997.ipynb`

# Readme
*If there is something you want to tell the marker about your submission, please mention it here.* 

[write here - optional]

# Data Download [DO NOT MODIFY THIS]

We have already constructed a dataset for you using a recent dump of data from Wikipedia. Both the training and test datasets are provided in the form of csv files (training_data.csv, test_data.csv) and can be downloaded from Google Drive using the code below. Each row of the data contains:

- The length of the article
- The title of the article
- The first 100 tokens of the article

In case you are curious, we constructed this dataset as follows:
1. Downloaded [a recent dump](https://dumps.wikimedia.org/) of English wikipedia.
2. Ran [WikiExtractor](https://github.com/attardi/wikiextractor) to get the contents of the pages.
3. Filtered out very short pages.
4. Ran [SpaCy](https://spacy.io/) with the `en_core_web_lg` model to tokenise the pages (Note, SpaCy's development is led by an alumnus of USyd!).
5. Counted the tokens and saved the relevant data in the format described above.

This code will download the data. **DO NOT MODIFY IT**

In [69]:
## DO NOT MODIFY THIS CODE
# Code to download files into Colaboratory

# Install the PyDrive library
!pip install -U -q PyDrive

# Import libraries for accessing Google Drive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client.
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

# Function to read the file, save it on the machine this colab is running on, and then read it in
import csv
def read_file(file_id, filename):
  downloaded = drive.CreateFile({'id':file_id})
  downloaded.GetContentFile(filename)
  with open(filename) as src:
    reader = csv.reader(src)
    data = [r for r in reader]
  return data

# Calls to get the data
# If you need to access the data directly (e.g., you are running experiments on a local machine), use these links:
# - Training, https://drive.google.com/file/d/1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W/view?usp=share_link
# - Dev, https://drive.google.com/file/d/1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN/view?usp=share_link
# - Test, https://drive.google.com/file/d/1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB/view?usp=share_link
training_data = read_file('1-UGFS8D-qglAX-czU38KaM4jQVCoNe0W', "/content/training_data.csv")
dev_data = read_file('1RWMEf0mdJMTkWc7dvN0ioks8bjujqZaN', "/content/dev_data.csv")
test_data = read_file('1YVPNzdIFSMmVPeLBP-gf5DOIed3oRFyB', "/content/test_data.csv")

print("------------------------------------")
print("Size of training data: {0}".format(len(training_data)))
print("Size of development data: {0}".format(len(dev_data)))
print("Size of test data: {0}".format(len(test_data)))
print("------------------------------------")

print("------------------------------------")
print("Sample Data")
print("LABEL: {0} / SENTENCE: {1}".format(training_data[0][0], training_data[0][1:]))
print("------------------------------------")

# Preview of the data in the csv file, which has three columns: 
# (1) length of article, (2) title of the article, (3) first 100 words in the article
for v in training_data[:10]:
  print("{}\n{}\n{}\n".format(v[0], v[1], v[2][:100] + "..."))

# Store the data in lists and mofidy the length value to be in [0, 1]
training_lengths = [min(1.0, int(r[0])/10000) for r in training_data]
training_text = [r[2] for r in training_data]

dev_lengths = [min(1.0, int(r[0])/10000) for r in dev_data]
dev_text = [r[2] for r in dev_data]

test_lengths = [min(1.0, int(r[0])/10000) for r in test_data]
test_text = [r[2] for r in test_data]

------------------------------------
Size of training data: 9859
Size of development data: 994
Size of test data: 991
------------------------------------
------------------------------------
Sample Data
LABEL: 6453 / SENTENCE: ['Anarchism', 'Anarchism is a political philosophy and movement that is skeptical of all justifications for authority and seeks to abolish the institutions it claims maintain unnecessary coercion and hierarchy , typically including , though not necessarily limited to , governments , nation states , and capitalism . Anarchism advocates for the replacement of the state with stateless societies or other forms of free associations . As a historically left - wing movement , usually placed on the farthest left of the political spectrum , it is usually described alongside communalism and libertarian Marxism as the libertarian wing ( libertarian socialism )']
------------------------------------
6453
Anarchism
Anarchism is a political philosophy and movement that is ske

# 1 - Predicting article length from initial content

This section relates to content from **the week 1 lecture and the week 2 lab**.

In this section, you will implement training and evaluation of a linear model (as seen in the week 2 lab) to predict the length of a wikipedia article from its first 100 words. You will represent the text using a Bag of Words model (as seen in the week 1 lecture).

## 1.1 Word Mapping [2pt]

In the code block below, write code to go through the training data and for any word that occurs at least 10 times:
- Assign it a unique ID (consecutive, starting at 0)
- Place it in a dictionary that maps from the word to the ID

In [70]:

import textwrap
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize, sent_tokenize
import re
import numpy as np
nltk.download('stopwords')
from nltk.corpus import stopwords as sw

sww = sw.words()
def get_token(documents):
  token_doc = []
  for document in documents:
    tokenized_doc = word_tokenize(document)
    token_doc.append(tokenized_doc)
  return token_doc

def get_Frequency(documents):
  DF = {}
  for sentence in documents:
  # get each unique word in the doc - and count the number of occurrences in the document
    for term in np.unique(sentence):
      if term in DF:
        DF[term] += 1
      else:
        DF[term] = 1
  return DF

def generate_map(TF):
  popular_words = {}
  word_id = 0
  # get each tokenised doc
  for word, freq in TF.items():
    if freq >= 10:
      popular_words[word] = word_id
      word_id += 1
  return popular_words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [71]:
# Train
token_docs = get_token(training_text)
DF = get_Frequency(token_docs)
popular_words = generate_map(DF)

# Dev
token_docs_dev = get_token(dev_text)
DF_dev = get_Frequency(token_docs_dev)
popular_words_dev = generate_map(DF_dev)

# Test
token_docs_test = get_token(test_text)
DF_test = get_Frequency(token_docs_test)
popular_words_test = generate_map(DF_test)

## 1.2 Data to Bag-of-Words Tensors [2pt]

In the code block below, write code to prepare the data in PyTorch tensors.

The text should be converted to a bag of words (ie., a vector the length of the vocabulary in the mapping in the previous step, with counts of the words in the text).

In [72]:
def text_to_bow(token_docs, word_to_id):
    x_data = []
    vocab_size = len(word_to_id)
    for tokens in token_docs:
        bow_vector = [0] * vocab_size
        for token in tokens:
            if token in word_to_id:
                bow_vector[word_to_id[token]] += 1
        x_data.append(bow_vector)
    return x_data

In [73]:
# Your code goes here
import torch

# Train
x_data = text_to_bow(token_docs, popular_words)
x_data = torch.Tensor(x_data)
y_data = torch.tensor(training_lengths).view(-1, 1)

# Dev
x_dev_data = text_to_bow(token_docs_dev, popular_words)
x_dev_data = torch.Tensor(x_dev_data)
y_dev_data = torch.tensor(dev_lengths).view(-1, 1)

# Test
x_test_data = text_to_bow(token_docs_test, popular_words)
x_test_data = torch.Tensor(x_test_data)
y_test_data = torch.tensor(test_lengths).view(-1, 1)

## 1.3 Model Creation [2pt]

Construct a linear model with an SGD optimiser (we recommend a learning rate of `1e-4`) and mean squared error as the loss.

In [74]:
# Your code goes here
import torch.nn as nn
import torch.nn.functional as F

linearRegression =  nn.Linear(len(x_data[1]), 1)
optimizer = torch.optim.SGD(linearRegression.parameters(), lr=1e-4)

# Define the loss function
loss_func = F.mse_loss
# Calculate loss
loss = loss_func(linearRegression(x_data), y_data)

## 1.4 Training [2pt]

Write a loop to train your model for 100 epochs, printing performance on the dev set every 10 epochs.

In [75]:
# Example of timing

def train_model(model, optimizer, epochs, x_data, y_data, x_test, y_test, display):
  no_of_epochs = epochs
  display_interval = 10
  for epoch in range(no_of_epochs):
    predictions = model(x_data)
    loss = loss_func(predictions, y_data)
    loss.backward()
    optimizer.step() #call step() to automatically update the parameters through our defined optimizer, which can be called once after backward()
    optimizer.zero_grad() #reset the gradient as what we did before
    if display:
      if epoch % display_interval == 0 :
          # calculate the loss of the current model
          predictions = model(x_test)
          loss = loss_func(predictions, y_test)
          print("Epoch:", '%04d' % (epoch), "dev loss=", "{:.8f}".format(loss))      


In [76]:
train_model(linearRegression, optimizer, 100, x_data, y_data, x_dev_data, y_dev_data, True)

Epoch: 0000 dev loss= 0.16427095
Epoch: 0010 dev loss= 0.13270222
Epoch: 0020 dev loss= 0.11353304
Epoch: 0030 dev loss= 0.10187028
Epoch: 0040 dev loss= 0.09475305
Epoch: 0050 dev loss= 0.09038949
Epoch: 0060 dev loss= 0.08769500
Epoch: 0070 dev loss= 0.08601298
Epoch: 0080 dev loss= 0.08494583
Epoch: 0090 dev loss= 0.08425264


## 1.5 Measure Accuracy [2pt]

In the code block below, write code to evaluate your model on the test set.

In [77]:
# Your code goes here
def mse(x1, x2):
  diff = x1 - x2
  return torch.sum(diff*diff)/diff.numel()

def evaluate_model(model, x_data, y_data, x_test_data, y_test_data):
  print("=========================================================")
  training_loss = mse(model(x_data), y_data)   
  print("Optimised:", "training loss=", "{:.9f}".format(training_loss.data))
  testing_loss = loss_func(model(x_test_data), y_test_data) 
  print("Testing loss=", "{:.9f}".format(testing_loss.data))
  print("Absolute mean square loss difference:", "{:.9f}".format(abs(
        training_loss.data - testing_loss.data)))

%time evaluate_model(linearRegression, x_data, y_data, x_test_data, y_test_data)

Optimised: training loss= 0.083256312
Testing loss= 0.074223973
Absolute mean square loss difference: 0.009032339
CPU times: user 33.5 ms, sys: 0 ns, total: 33.5 ms
Wall time: 33.8 ms


## 1.6 Analyse the Model [2pt]

In the code block below, write code to identify the 50 words with the highest weights and the 50 words with the lowest weights.

In [78]:
# Extract weights from the linear regression model
weights = linearRegression.weight.detach().numpy().flatten()

# Get indices of the 50 highest and lowest weights
top_50_indices = weights.argsort()[-50:][::-1]
bottom_50_indices = weights.argsort()[:50]

# Map the indices back to the words in the vocabulary
popular_words_items = list(popular_words.items())
top_50_words = [popular_words_items[i] for i in top_50_indices]
bottom_50_words = [popular_words_items[i] for i in bottom_50_indices]

# 2 - Compare Data Storage Methods

This section relates to content from **the week 1 lecture and the week 2 lab**.

Implement a variant of the model with a sparse vector for your input bag of words (See https://pytorch.org/docs/stable/sparse.html for how to switch a vector to be sparse). Use the default sparse vector type (COO).

In [79]:
# Train
x_data_sparse = x_data.to_sparse()
y_data_sparse = torch.tensor(training_lengths).view(-1, 1)

# Dev
x_dev_data_sparse = x_dev_data.to_sparse()
y_dev_data_sparse = torch.tensor(dev_lengths).view(-1, 1)

# Test
x_test_data_sparse = x_test_data.to_sparse()
y_test_data_sparse = torch.tensor(test_lengths).view(-1, 1)

linearRegression_sparse =  nn.Linear(len(x_data_sparse[1]), 1)
optimizer_sparse = torch.optim.SGD(linearRegression_sparse.parameters(), lr=1e-4)

# Define the loss function
loss_func = F.mse_loss
# Calculate loss
loss_sparse = loss_func(linearRegression_sparse(x_data_sparse), y_data_sparse)

## 2.1 Training and Test Speed [2pt]
Compare the time it takes to train and test the new model with the time it takes to train and test the old model.

You can time the execution of a line of code using `%time`.
See [this guide](https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/01.07-Timing-and-Profiling.ipynb#scrollTo=z1gyaC_PNZUB) for more on timing.

In [80]:
# Define the model
linearRegression_comp =  nn.Linear(len(x_data[1]), 1)
optimizer_comp = torch.optim.SGD(linearRegression.parameters(), lr=1e-4)
loss_func_comp = F.mse_loss
# Calculate loss
loss = loss_func(linearRegression(x_data), y_data)


In [81]:
# Dense model
%time dense_test = train_model(linearRegression_comp, optimizer_comp, 100, x_data, y_data, x_test_data, y_test_data, False)
# Sparse model
%time sparse_test = train_model(linearRegression_comp, optimizer_comp, 100, x_data_sparse, y_data_sparse, x_test_data_sparse, y_test_data_sparse, False)

CPU times: user 5.09 s, sys: 17.8 ms, total: 5.11 s
Wall time: 8.07 s
CPU times: user 3.91 s, sys: 9.38 ms, total: 3.92 s
Wall time: 4.51 s


# 3 - Switch to Word Embeddings

This section relates to content from **the week 2 lecture and the week 3 lab**.

In this section, you will implement a model based on word2vec.

1. Use word2vec to learn embeddings for the words in your data.
2. Represent each input document as the average of the word vectors for the words it contains.
3. Train a linear regression model.

In [82]:
# Your code goes here
from gensim.models import Word2Vec
wv_cbow_model = Word2Vec(sentences=token_docs, size=100, window=5, min_count=10, workers=2, sg=0)
#The size could be tuned

In [83]:
def document_to_mean_vector(doc, model):
    word_vectors = [model.wv[word] for word in doc if word in model.wv]
    if not word_vectors:
        return np.zeros(model.vector_size)
    return np.mean(word_vectors, axis=0)

x_data_w2v = [document_to_mean_vector(doc, wv_cbow_model) for doc in token_docs]
x_data_w2v = torch.tensor(x_data_w2v)

x_dev_data_w2v = [document_to_mean_vector(doc, wv_cbow_model) for doc in token_docs_dev]
x_dev_data_w2v = torch.tensor(x_dev_data_w2v)

x_test_data_w2v = [document_to_mean_vector(doc, wv_cbow_model) for doc in token_docs_test]
x_test_data_w2v = torch.tensor(x_test_data_w2v)


In [84]:
linearRegression_w2v = nn.Linear(x_data_w2v.shape[1], 1)
optimizer_w2v = torch.optim.SGD(linearRegression_w2v.parameters(), lr=1e-4)
loss_func_w2v = F.mse_loss

train_model(linearRegression_w2v, optimizer_w2v, 100, x_data_w2v, y_data, x_dev_data_w2v, y_dev_data, True)

Epoch: 0000 dev loss= 0.33477914
Epoch: 0010 dev loss= 0.32882005
Epoch: 0020 dev loss= 0.32300001
Epoch: 0030 dev loss= 0.31731585
Epoch: 0040 dev loss= 0.31176430
Epoch: 0050 dev loss= 0.30634236
Epoch: 0060 dev loss= 0.30104688
Epoch: 0070 dev loss= 0.29587504
Epoch: 0080 dev loss= 0.29082385
Epoch: 0090 dev loss= 0.28589055


## 3.1 Accuracy [1pt]

Calculate the accuracy of your model.

In [85]:
# Your code goes here
evaluate_model(linearRegression_w2v, x_data_w2v, y_data, x_test_data_w2v, y_test_data)

Optimised: training loss= 0.281960368
Testing loss= 0.266987771
Absolute mean square loss difference: 0.014972597


## 3.2 Speed [1pt]

Calcualte how long it takes your model to be evaluated.

In [86]:
# Your code goes here
# Dense model
%time evaluate_model(linearRegression_w2v, x_data_w2v, y_data, x_test_data_w2v, y_test_data)

Optimised: training loss= 0.281960368
Testing loss= 0.266987771
Absolute mean square loss difference: 0.014972597
CPU times: user 4.42 ms, sys: 0 ns, total: 4.42 ms
Wall time: 4.66 ms


# 4 - Open-Ended Improvement

This section relates to content from **the week 1, 2, and 3 lectures and the week 1, 2, and 3 labs**.

This section is an open-ended opportunity to find ways to make your model more accurate and/or faster (e.g., use WordNet to generalise words, try different word features, other optimisers, etc).

We encourage you to try several ideas to provide scope for comparisons.

If none of your ideas work you can still get full marks for this section. You just need to justify the ideas and discuss why they may not have improved performance.


## 4.1 Ideas and Motivation [1pt]

In **this** box, describe your ideas and why you think they will improve accuracy and/or speed.

*Your answer goes here*
1. Data Pre-processing:  
In Task 1, we didn't perform any significant pre-processing on the data. Implementing a pre-processing approach could potentially improve model performance by reducing noise, such as punctuation, and capturing more meaningful features from the text.

2. Hyperparameter Tuning:  
Employing hyperparameter tuning techniques can help us identify the optimal combination of hyperparameters for our model, leading to improved accuracy and/or faster training times. By fine-tuning these hyperparameters, we can optimize model performance and ensure the most effective use of available resources.

## 4.2 Implementation [2pt]

Implement your ideas

In [87]:

#1. Data Pre-processing:
def preprocess_text(text):
    stop_words = set(sw.words('english'))
    word_tokens = word_tokenize(text)
    lower_tokens = [t.lower() for t in word_tokens] # change all words to lower case.
    filtered_text = [word for word in lower_tokens if word not in stop_words and word.isalnum()] # Remove stop words and punctuation
    return filtered_text

# Preprocess the texts in the datasets
preprocessed_train_token = [preprocess_text(text) for text in training_text]
preprocessed_dev_token = [preprocess_text(text) for text in dev_text]
preprocessed_test_token = [preprocess_text(text) for text in test_text]


Get the word frequency that is more than 10

In [88]:
# Train
preprocessed_DF = get_Frequency(preprocessed_train_token)
preprocessed_popular_words = generate_map(preprocessed_DF)

# Dev
preprocessed_DF_dev = get_Frequency(preprocessed_dev_token)
preprocessed_popular_words_dev = generate_map(preprocessed_DF_dev)

# Test
preprocessed_DF_test = get_Frequency(preprocessed_test_token)
preprocessed_popular_words_test = generate_map(preprocessed_DF_test)

In [89]:
# Train
x_pre_data = text_to_bow(preprocessed_train_token, popular_words)
x_pre_data = torch.Tensor(x_pre_data).to_sparse()
y_data = torch.tensor(training_lengths).view(-1, 1)

# Dev
x_dev_pre_data = text_to_bow(preprocessed_dev_token, popular_words)
x_dev_pre_data = torch.Tensor(x_dev_pre_data).to_sparse()
y_dev_data = torch.tensor(dev_lengths).view(-1, 1)

# Test
x_test_pre_data = text_to_bow(preprocessed_test_token, popular_words)
x_test_pre_data = torch.Tensor(x_test_pre_data).to_sparse()
y_test_data = torch.tensor(test_lengths).view(-1, 1)


The tuning process involves testing multiple learning rates and selecting the one that yields the lowest loss, ensuring that we achieve optimal performance in the future.

In [90]:
def fine_tune(learning_rate, x_data, y_data, x_test, y_test):
  linearRegression =  nn.Linear(len(x_data[1]), 1)
  best_lr = 0
  best_loss = 99999
  for lr in learning_rate:
    temp_lr = lr
    optimizer = torch.optim.SGD(linearRegression.parameters(), lr=temp_lr)
    loss_func = F.mse_loss
    no_of_epochs = 50
    display_interval = 10

    for epoch in range(no_of_epochs):
      predictions = linearRegression(x_data)
      loss = loss_func(predictions, y_data)
      loss.backward()
      optimizer.step() #call step() to automatically update the parameters through our defined optimizer, which can be called once after backward()
      optimizer.zero_grad() #reset the gradient as what we did before     
    training_loss = mse(linearRegression(x_data), y_data)   
    testing_loss = loss_func(linearRegression(x_test_data), y_test_data)
    loss_diff = training_loss.data - testing_loss.data
    if loss_diff < best_loss:
      best_lr = lr
      best_loss = loss_diff
  return best_lr


In [91]:
linearRegression_impro =  nn.Linear(len(x_data[1]), 1)
lr_list = [1e-4, 1e-3, 1e-5]
best_lr = fine_tune(lr_list, x_data, y_data, x_dev_data, y_dev_data)
optimizer_impro = torch.optim.SGD(linearRegression.parameters(), lr=best_lr)
# Define the loss function
loss_func = F.mse_loss

train_model(linearRegression_impro, optimizer_impro, 100, x_pre_data, y_data, x_dev_pre_data, y_dev_data, True)

Epoch: 0000 dev loss= 0.17559277
Epoch: 0010 dev loss= 0.17559277
Epoch: 0020 dev loss= 0.17559277
Epoch: 0030 dev loss= 0.17559277
Epoch: 0040 dev loss= 0.17559277
Epoch: 0050 dev loss= 0.17559277
Epoch: 0060 dev loss= 0.17559277
Epoch: 0070 dev loss= 0.17559277
Epoch: 0080 dev loss= 0.17559277
Epoch: 0090 dev loss= 0.17559277


## 4.3 Evaluation [1pt]

Evaluate the speed and accuracy of the model with your ideas

In [92]:
# Your code goes here
%time improved_test = evaluate_model(linearRegression_impro, x_pre_data, y_data, x_test_pre_data, y_test_data)

Optimised: training loss= 0.176300362
Testing loss= 0.161845624
Absolute mean square loss difference: 0.014454737
CPU times: user 16.9 ms, sys: 1 ms, total: 17.9 ms
Wall time: 15.7 ms


In **this** text box, briefly describe the results. Did your improvement work? Why / Why not?

The improvements implemented in Task 4 did not lead to a consistent increase in performance in terms of evaluation speed and accuracy. The evaluation speed for Task 1, Task 3, and Task 4 were 33.8ms, 4.66ms, and 15.7ms, respectively. While the evaluation speed improved from Task 1 to Task 3, it decreased from Task 3 to Task 4. The accuracy, as measured by the absolute mean square loss, was 0.009032339 for Task 1, 0.014972597 for Task 3, and 0.014454737 for Task 4. The accuracy worsened from Task 1 to Task 3 and remained relatively consistent from Task 3 to Task 4.

The improvements may not have worked as intended due to the choice of hyperparameters or data pre-processing techniques. These changes could have unintentionally introduced noise or failed to capture meaningful patterns in the data. Additionally, the choice of a smaller learning rate may have impacted the model's convergence rate, leading to a decrease in evaluation speed.
