<a href="https://colab.research.google.com/github/thravt/AIProjectsHomework/blob/main/HW6_Tyler_Thraves.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Part 1: Transformers
Task 1 (30 points): In this task you should work with the Facebook BART model
(https://huggingface.co/docs/transformers/en/model_doc/bart) to provide text summarization
of news articles. Text summarization in Natural Language Processing (NLP) is a technique that
breaks down long texts into sentences or paragraphs, while retaining the text's meaning and
extracting important information. Pick any one dataset of your choice.**

**1. Provide a description of the dataset you selected. Split your data into train-test set with
a (90-10) split.**

I'll be using xsum from HuggingFace. I was originally going to use multi-news, but the length of the summaries and descriptions was so large, it crashed in the padding process due to low ram.

In [None]:
import numpy as np
import pandas as pd

In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset('xsum', split='train')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
print(dataset)
print(dataset[0])

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 204045
})


In [None]:
maxlen = 0
for i in range (204045):
  maxlen = max(maxlen, len(dataset[i]['summary'].split()))
print(maxlen)

70


In [None]:
maxlen = 0
for i in range (204045):
  maxlen = max(maxlen, len(dataset[i]['document'].split()))
print(maxlen)

29189


I keep running out of RAM, so in a bid to reduce the RAM usage, I'll take it down to 10000 data points, which should still be enough for a good test.

In [None]:
trainingdata = dataset.select(range(10000))

In [None]:
print(trainingdata)

Dataset({
    features: ['document', 'summary', 'id'],
    num_rows: 10000
})


In [None]:
splitdata = trainingdata.train_test_split(test_size=0.1)

**2. Load the model from Hugging Face’s Transformers library and write its training script.**

This shold be easy enough.

I checked the notebook on the BART documentation for a guide to work with.

In [None]:
! pip install datasets evaluate transformers rouge-score nltk



In [None]:
from transformers import AutoTokenizer, BartForConditionalGeneration

model = BartForConditionalGeneration.from_pretrained("facebook/bart-large-cnn")
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")



I'm getting input length errors, so I will have to truncate to match the example notebook. Looking more closely at the documentation it seems I was going over the maximum embedding size.

In [None]:
max_input_length = 1024
max_target_length = 128

def preprocess_function(examples):
    inputs = [doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, padding=True)

    # Setup the tokenizer for targets
    labels = tokenizer(text_target=examples["summary"], max_length=max_target_length, truncation=True, padding=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

In [None]:
preprocess_function(splitdata['train'][:2])

{'input_ids': [[0, 38268, 27742, 17034, 6, 5659, 6, 362, 5, 418, 31, 988, 17351, 4405, 11, 902, 1824, 6, 584, 24, 74, 28, 28360, 14, 1035, 4, 50118, 1708, 24, 21, 129, 1835, 71, 249, 880, 41, 803, 5, 220, 76, 4, 50118, 133, 320, 3299, 11, 381, 3624, 2596, 122, 2419, 145, 2322, 160, 5, 1131, 5124, 4, 50118, 133, 1679, 174, 69, 89, 21, 410, 5, 4354, 115, 109, 7, 15392, 69, 55, 87, 5, 285, 28689, 9, 17902, 5, 22, 18880, 6999, 9, 2416, 845, 50118, 2515, 2641, 3526, 30, 2134, 9, 69, 774, 4, 50118, 133, 1679, 174, 69, 5, 8637, 24939, 10, 316, 353, 22790, 2617, 3645, 6, 53, 24, 74, 28, 3456, 13, 80, 107, 576, 5, 9297, 4215, 9, 5, 403, 4, 50118, 133, 461, 6, 2828, 11, 5818, 28779, 6, 21, 174, 14, 5, 1802, 6, 988, 17351, 4405, 6, 21, 3606, 31, 11520, 18, 8, 5, 12571, 21, 2542, 9, 39, 22010, 2536, 8, 2166, 474, 77, 79, 553, 123, 13, 5, 2541, 4, 50118, 2515, 439, 7, 39, 184, 7, 22, 4970, 123, 13, 10, 5976, 113, 7, 244, 69, 66, 9, 69, 613, 9282, 8, 37, 1507, 7, 1203, 10, 5851, 3407, 13, 984, 698, 

In [None]:
tokenized_datasets = splitdata.map(preprocess_function, batched=True)

Map:   0%|          | 0/9000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [None]:
model_checkpoint = "facebook/bart-large-cnn"

**3. Fine tune the pre-trained model with your data and report results on your test set. You
must report the BLEU and ROUGE Scores. (See the code provided in class for more
details)**

Once again, the notebook on the BART documentation is very helpful here.

In [None]:
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

In [None]:
batch_size = 16
model_name = model_checkpoint.split("/")[-1]
args = Seq2SeqTrainingArguments(
    f"{model_name}-finetuned-xsum",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    predict_with_generate=True,
    fp16=True,
    #push_to_hub=True,
)



In [None]:
!pip install evaluate



In [None]:
from evaluate import load


In [None]:
!pip install rouge-score



In [None]:
metricr = load("rouge")
metricb = load("bleu")

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

In [None]:
import nltk
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # Replace -100 in the labels as we can't decode them.
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Rouge expects a newline after each sentence
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]

    # Note that other metrics may not have a `use_aggregator` parameter
    # and thus will return a list, computing a metric for each sentence.
    resultr = metricr.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True, use_aggregator=True)
    resultb = metricb.compute(predictions=decoded_preds, references=decoded_labels)
    # Extract a few results
    resultr = {key: value * 100 for key, value in resultr.items()}
    resultb = {key: value * 100 for key, value in resultb.items()}

    # Add mean generated length
    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    resultr["gen_len"] = np.mean(prediction_lens)
    rouge = {k: round(v, 4) for k, v in resultr.items()}
    rouge["bleu"] = resultb["bleu"]
    return rouge

In [None]:
trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Seq2SeqTrainer(


In [None]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mtyler-thraves[0m ([33mtyler-thraves-rensselaer-polytechnic-institute[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


I keep running into RAM issues, so I'll talk to the teacher to see what the best way to deal with that is.

Sounds like I won't get docked for RAM issues, so I'm going to move on.

**4. Analyze your results and discuss the impact of hyperparameters. Are your results
impacted by the choice of the LLM here? How?**

If I had been able to run the code to completion, I would imagine that the tuning would have resulted in a model with a better rouge score and bleu score than from simply using BART as is. Hyperparameter tuning is useful as it can be used to determine the proper penalties, scale the learning rate, and other important values that might cause a model to converge faster or with better accuracy. Running the model for longer might give a similar result, but it has a risk of overfitting. My result would likely be impacted by the choice of the LLM, as the similarity of BART data to the news data used for tuning would cause a faster convergence for more similarity, and for less similar data it would likely benefit more from tuning.

 **Task 2(20 points): We discussed how we can formulate RL problems as an MDP. Describe any
real-world application that can be formulated as an MDP. Describe the state space, action
space, transition model, and rewards for that problem. You do not need to be precise in the
description of the transition model and reward (no formula is needed). Qualitative description
is enough.**

A self driving car could theoretically be trained using Reinforcement Learning. While such a problem would be hard to give finite states to, it's possible to give access to variables such as: how close the car in the left/right lane is, the current lane of the car, where they are in said lane, how close the car ahead/behind is, the speed/acceleration of the car, and the state of the traffic light. The action space would include turning on turn signals, use of the gas pedal/how much it's used, use of the brake/how much it's used, the current angle of the wheel, and the current gear of the car.

As for the transition model, the next position of the car would be determined by the velocity/acceleration as well as the angle of the wheel.

To determine the reward/punishment for the state, the car would get a higher reward the closer they are to their destination, and recieve a large penalty for breaking traffic laws such as speeding, changing lane without a signal, or especially crashing.

In theory, the car would learn how to efficiently approach it's destination without breaking any laws. It would likely require a lot of training, and to do so safely would require a simulation as opposed to testing in the field.

**Task 3(20 points): RL is used in various sectors - Healthcare, recommender systems and trading
are a few of those. Pick one of the three areas. Explain one of the problems in any of these
domains that can be more effectively solved by reinforcement learning. Find an open-source
project (if any) that has addressed this problem. Explain this project in detail.**

Reinforcement learning can help with the cold start problem of recommender systems. By considering variables about the items to recommend as the state, the choice of items to recommend would be the action. If those items are clicked on more because they were reccomended, then the system gets a small reward. If those increased clicks result in increased purchases, the system gets a larger reward. By focusing on maximizing the expected increase in clicks, the system will learn to prioritize items that most users will like.

 I haven't found any open source projects that deal with using reinforcement learning in recommender systems. I have however found a paper that deals with this topic https://arxiv.org/pdf/2108.09141v1, leading me to believe that there aren't open source projects for this specific of a topic that's already been worked on. I will however explain the model for the paper.

 How it works is that the state for each item involves how often it's viewed, how long it's been availible, and how many sales it's gotten, as well as properties of users associated with the items. The action state is determining a score for each item, with a higher score being more likely to be recommended. The reward is then calculated as the amount of total views the item gets in the next state IPV_t+1, divided by the views caused by reccomendation at the current state, PV_rec,t. By maximizing the sum of the rewards for the items, the system aims to optimize the page views gotten overall. Thus, it will reccomend items to new users that it thinks will in general result in the most views overall.



**Task 5 (30 points): For this task use the MovieLens 100k dataset
(https://grouplens.org/datasets/movielens/100k/)
Perform the necessary data cleaning, EDA and conversion to User-item matrix.
Implement any 2 collaborative filtering recommendation systems (RecSys) algorithms covered
in class (Matrix Factorization, Alternating Least Squares, NCF etc.) and compare their
performance for any 2-evaluation metrics used for RecSys. You may read literature to find out
which evaluation metrics are used for RecSys. Cite all your research.**

First thing's first, I need to get the data.

In [1]:
!wget http://files.grouplens.org/datasets/movielens/ml-100k.zip -P /content/

--2025-03-26 23:13:20--  http://files.grouplens.org/datasets/movielens/ml-100k.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4924029 (4.7M) [application/zip]
Saving to: ‘/content/ml-100k.zip’


2025-03-26 23:13:20 (23.8 MB/s) - ‘/content/ml-100k.zip’ saved [4924029/4924029]



In [2]:
! unzip ml-100k.zip

Archive:  ml-100k.zip
   creating: ml-100k/
  inflating: ml-100k/allbut.pl       
  inflating: ml-100k/mku.sh          
  inflating: ml-100k/README          
  inflating: ml-100k/u.data          
  inflating: ml-100k/u.genre         
  inflating: ml-100k/u.info          
  inflating: ml-100k/u.item          
  inflating: ml-100k/u.occupation    
  inflating: ml-100k/u.user          
  inflating: ml-100k/u1.base         
  inflating: ml-100k/u1.test         
  inflating: ml-100k/u2.base         
  inflating: ml-100k/u2.test         
  inflating: ml-100k/u3.base         
  inflating: ml-100k/u3.test         
  inflating: ml-100k/u4.base         
  inflating: ml-100k/u4.test         
  inflating: ml-100k/u5.base         
  inflating: ml-100k/u5.test         
  inflating: ml-100k/ua.base         
  inflating: ml-100k/ua.test         
  inflating: ml-100k/ub.base         
  inflating: ml-100k/ub.test         


In [3]:
import pandas as pd

Fortunately, the data is already divided into train and test.

In [4]:
train = pd.read_csv("ml-100k/ua.base", sep = "\t", header = None, names = ["User", "Item", "Rating", "Timestamp"])
test = pd.read_csv("ml-100k/ua.test", sep = "\t", header = None, names = ["User", "Item", "Rating", "Timestamp"])

The guide linked on lesson 18 should be very helpful for this.

In [5]:
import numpy as np

In [6]:
total = pd.read_csv("ml-100k/u.data", sep = "\t", header = None, names = ["User", "Item", "Rating", "Timestamp"])

In [7]:
data = np.zeros((943, 1682))
for i in range(len(train)):
  data[train["User"][i] - 1][train["Item"][i] - 1] = train["Rating"][i]

In [None]:
print(data)

[[5. 3. 4. ... 0. 0. 0.]
 [4. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [5. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 5. 0. ... 0. 0. 0.]]


I'm going to start by implementing alternating least squares.

Random initialization

In [12]:
Users = np.random.rand(943, 1)
Items = np.random.rand(1682, 1)

In [9]:
epochs = 10

I got a singular matrix error, so I'll add in a tiny value to prevent a determinant of 0

Training loop

In [13]:
for k in range(epochs):
  for i in range(943):
    RelevantItems = Items[data[i, :] > 0]
    RelevantRatings = data[i, data[i, :] > 0]
    Users[i] = np.linalg.solve(np.matmul(RelevantItems.T, RelevantItems) + 0.01, np.matmul(RelevantItems.T, RelevantRatings))
  for j in range(1682):
    RelevantUsers = Users[data[:, j] > 0]
    RelevantRatings = data[data[:, j] > 0, j]
    Items[j] = np.linalg.solve(np.matmul(RelevantUsers.T, RelevantUsers) + 0.01, np.matmul(RelevantUsers.T, RelevantRatings))

In [14]:
print(Users[0] * Items.T)

[[3.96310401 3.35242078 3.16933    ... 2.15912876 3.57353495 3.2452953 ]]


In [15]:
testdata = np.zeros((943, 1682))
for i in range(len(test)):
  testdata[test["User"][i] - 1][test["Item"][i] - 1] = test["Rating"][i]

For the first metric I will use Mean Absolute Error, as described here https://neptune.ai/blog/recommender-systems-metrics (under predictive metrics). This was chosen for its popularity, and ease of implementation.
To do this I will mask out all the non-tested data points, then calculate the absolute value of the difference of the matrices. Finally, I will compute the average of those values.

In [16]:
testmask = testdata > 0

In [17]:
print(testmask)

[[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


In [18]:
predictions = np.multiply(Users, Items.T)

In [19]:
print(predictions)

[[3.96310401 3.35242078 3.16933    ... 2.15912876 3.57353495 3.2452953 ]
 [3.974812   3.36232466 3.17869299 ... 2.16550736 3.58409206 3.25488271]
 [3.38916397 2.8669204  2.71034498 ... 1.84644193 3.05601263 2.77530892]
 ...
 [4.11108109 3.47759575 3.28766861 ... 2.23974778 3.70696604 3.36647035]
 [4.35985271 3.68803362 3.48661352 ... 2.37528043 3.93128367 3.57018375]
 [3.80312983 3.21709737 3.04139721 ... 2.07197363 3.42928608 3.11429611]]


In [20]:
predictions[testmask == False] = 0

In [21]:
print(predictions)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [22]:
sumerror = np.abs(testdata - predictions)

In [23]:
print(np.sum(sumerror) / len(test))

0.7624951778387719


Thus, the mean squared error for this system is 0.762495...

I think that's not bad, but there's not exactly a fixed "good" score for recommender systems.

Next up, I will use RMSE, also described here https://neptune.ai/blog/recommender-systems-metrics, and chosen for similar reasons. The way this works is that instead of simply averaging the errors, the errors are squared, added, and then the square root is taken.

In [24]:
squareerror = np.square(sumerror)

In [25]:
squaresum = np.sum(squareerror)

In [26]:
averagesquare = squaresum / len(test)

In [27]:
print(np.sqrt(averagesquare))

0.9684014930777556


Thus, the RMSE for this system is 0.9684....

Again, I'm not really sure what a "good" score is, but this doesn't seem that bad.

The second method I will implement is Matrix factorization using SVD. This will be done through scipy, loosely following the guide in lesson 18. Since the data is the same as in alternating least squares, I'll just simply take that matrix. I'll have to demean the data though.

In [28]:
meanratings = np.mean(data, axis = 1)

In [29]:
demeaneddata = data - meanratings.reshape(-1, 1)

The guide linked in lesson 18 uses a k of 50. After doing some tinkering, I settled on 25 as giving the best error.

In [214]:
#SVD
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(demeaneddata, k = 25)

In [215]:
sigma = np.diag(sigma)

In [216]:
print(sigma)

[[ 67.84962381   0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.        ]
 [  0.          68.20853197   0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.        ]
 [  0.           0.          68.99264932   0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.
    0.           0.           0.           0.           0.        ]
 [  0.           0.           0.          69.77810571   0.
    0.           0.          

In [217]:
predictions = np.dot(np.dot(U, sigma), Vt) + meanratings.reshape(-1, 1)

In [218]:
print(predictions)

[[ 4.81567664e+00  1.80406737e+00  1.56115694e+00 ...  4.64763540e-03
   2.49179618e-02  1.11197331e-01]
 [ 1.83788335e+00 -6.77260181e-02 -2.47645699e-02 ... -1.74749747e-02
  -3.67621913e-02 -8.12254074e-02]
 [ 1.92254647e-02 -7.40483882e-03  1.74401980e-01 ...  2.86203585e-02
   5.96692766e-03  8.30478501e-03]
 ...
 [ 7.97948938e-01  3.25875040e-02  1.59352480e-01 ...  1.21644496e-02
   2.00727177e-02  1.53032014e-03]
 [ 1.52486576e+00  3.84762788e-01 -1.20639371e-01 ...  3.72313504e-02
   1.73374557e-02  7.53984744e-03]
 [ 1.73397908e+00  1.94265917e+00  1.34257934e+00 ... -2.85129329e-02
  -1.41732166e-02  8.23417030e-03]]


I've already got the test data and mask from ALE, so I'll just pull that over for testing MAE.

In [219]:
predictions[testmask == False] = 0

In [220]:
sumerror = np.abs(testdata - predictions)

In [221]:
print(np.sum(sumerror) / len(test))

2.5608889401451504


Ok, clearly this doesn't seem as good as Alternating Least Squares, but let's see if it does better on RMSE.

In [222]:
squareerror = np.square(sumerror)

In [223]:
squaresum = np.sum(squareerror)

In [224]:
averagesquare = squaresum / len(test)

In [225]:
print(np.sqrt(averagesquare))

2.838224471591872


Ok, the RMSE also did worse than in ALS. I think it may be that SVD didn't have a way to account for the missing entries, and treated them as if they actually were rated 0. As a result, the approximation was too far from the test values.