# Collaborative filtering project

In this project, the task is to create a paper recommendation system. The system consists of 10,000 scientists and 1,000 papers. Scientists give ratings between 1–5 to the papers that they read. Since not all scientists have read every paper, we only have a limited amount of observations of these ratings. Additionally, each scientist has a wishlist of papers that they would like to read in the future. Your task is to fill in the missing observations using the provided rating and wishlist data, such that we can recommend papers to scientists that we expect them to rate highly.

More specifically, there are three data sources:
 - `train_tbr.csv` containing wishlist data.
 - `train_ratings.csv` containing observed rating data.
 - `sample_submission.csv` containing (scientist, paper) pairs that have to be rated for the evaluation of your method.

The data is available at `/cluster/courses/cil/collaborative_filtering/data` and an environment has been prepared for you at `/cluster/courses/cil/envs/collaborative_filtering`. You can activate the environment in your shell by running:
```bash
conda activate /cluster/courses/cil/envs/collaborative_filtering
```
If you wish to use notebooks on the cluster, you need to set the Environment path to `/cluster/courses/cil/envs/collaborative_filtering/bin` and load the `cuda/12.6` module.

**Evaluation**: Your models are evaluated using the root mean-squared error (RMSE) metric. Your grade is determined by a linear interpolation between the easy (grade 4) and hard (grade 6) baselines.

**Rules**: You are only allowed to use the data provided in `train_tbr.csv` and `train_ratings.csv` to make your predictions of `sample_submission.csv`. You are not allowed to use external data sources. But, you are allowed to use pre-trained models, as long as they are available publicly. Furthermore, no external API calls are allowed, except for downloading the weights of pre-trained models.

**We will verify your code for plagiarism and using solutions from previous years.**

[Link to Kaggle competition](https://www.kaggle.com/competitions/ethz-cil-collaborative-filtering-2025)


In [1]:
from typing import Tuple, Callable

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
import os
from sklearn.model_selection import KFold
import copy

In [2]:

print("CUDA available:", torch.cuda.is_available())
print("GPU name:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "None")


CUDA available: True
GPU name: NVIDIA GeForce RTX 4070


Make sure that results are reproducible by using a seed.

In [3]:
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")

Using: cuda


In [5]:
class SVDpp(nn.Module):
    def __init__(self, num_scientists: int = 10000, num_papers: int = 10000, emb_dim: int = 32, s2p: dict = dict(), global_mean: torch.float32 = 3.82, s2w: dict = dict()):
        super().__init__()
        self.emb_dim = emb_dim
        self.s2p = s2p
        self.s2w = s2w

        # embeddings for scientists and papers
        self.scientist_factors = nn.Embedding(num_scientists, emb_dim)
        self.paper_factors = nn.Embedding(num_papers, emb_dim)
        self.scientist_bias = nn.Embedding(num_scientists, 1)
        self.paper_bias = nn.Embedding(num_papers, 1)
        self.implicit_factors = nn.Embedding(num_papers, emb_dim)
        self.implicit_wishlist = nn.Embedding(num_papers, emb_dim)

        # global average rating - TODO: maybe come up with smth better
        self.global_bias = nn.Parameter(torch.tensor([global_mean]), requires_grad=False)

        # init weights - TODO: not tuned rn
        nn.init.normal_(self.scientist_factors.weight, std=0.1)
        nn.init.normal_(self.paper_factors.weight, std=0.1)
        nn.init.normal_(self.implicit_factors.weight, std=0.1)
        nn.init.normal_(self.implicit_wishlist.weight, std=0.1)
        nn.init.constant_(self.scientist_bias.weight, 0.0)
        nn.init.constant_(self.paper_bias.weight, 0.0)

    def forward(self, scientist_ids, paper_ids):
        # latent factors and biases for current batch
        scientist_embeddings = self.scientist_factors(scientist_ids)
        paper_embeddings = self.paper_factors(paper_ids)
        # squeeze to remove extra dim
        scientist_biases = self.scientist_bias(scientist_ids).squeeze()
        paper_biases = self.paper_bias(paper_ids).squeeze()

        papers = [self.s2p.get(k, []) for k in scientist_ids]

        # implicit feedback from rated papers
        implicit_embeds = []
        for sp in papers:
            if len(sp) > 0:
                y_j = self.implicit_factors(torch.tensor(sp, device=scientist_ids.device))
                sum_yj = y_j.sum(dim=0)
                norm_yj = sum_yj / torch.sqrt(torch.tensor(len(sp), dtype=torch.float, device=scientist_ids.device))
            else:
                norm_yj = torch.zeros_like(scientist_embeddings[0])
            implicit_embeds.append(norm_yj)
        y_u = torch.stack(implicit_embeds)


        # implicit feedback from wishlist papers
        wishlist = [self.s2w.get(k, []) for k in scientist_ids]

        implicit_embeds_wl = []
        for w in wishlist:
            if len(w) > 0:
                y_j_wl = self.implicit_wishlist(torch.tensor(w, device=scientist_ids.device))
                sum_yj_wl = y_j_wl.sum(dim=0)
                norm_yj_wl = sum_yj_wl / torch.sqrt(torch.tensor(len(w), dtype=torch.float, device=scientist_ids.device))
            else:
                norm_yj_wl = torch.zeros_like(scientist_embeddings[0])
            implicit_embeds_wl.append(norm_yj_wl)
        y_u_wl = torch.stack(implicit_embeds_wl)

        # dot product for interaction
        interaction = ((scientist_embeddings + y_u + y_u_wl)  * paper_embeddings).sum(dim=1)

        # predict ratings
        predicted_ratings = interaction + scientist_biases + paper_biases + self.global_bias
        return predicted_ratings


In [6]:
def contrastive_loss(pos_scores, neg_scores, margin=1.0):
    """
    Contrastive loss that pushes pos_scores up and neg_scores down.
    """
    # Want pos_scores >> neg_scores, so hinge loss on margin
    loss = F.relu(margin - pos_scores + neg_scores)
    return loss.mean()


## Helper functions

In [7]:
#DATA_DIR = "/cluster/courses/cil/collaborative_filtering/data"
DATA_DIR = r"C:\Users\loris\OneDrive\ETH\Group Project"


def read_data_df() -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Reads in data and splits it into training and validation sets with a 75/25 split."""
    df = pd.read_csv(os.path.join(DATA_DIR, "train_ratings.csv"))
    # Split sid_pid into sid and pid columns
    df[["sid", "pid"]] = df["sid_pid"].str.split("_", expand=True)
    df = df.drop("sid_pid", axis=1)
    df["sid"] = df["sid"].astype(int)
    df["pid"] = df["pid"].astype(int)
    train_df, valid_df = train_test_split(df, test_size=0.25)
    global_mean = torch.tensor(np.mean(train_df.rating.values), dtype=torch.float32)
    scientist2papers = df.groupby("sid")["pid"].apply(list).to_dict()
    return train_df, valid_df, scientist2papers, global_mean

def read_data_df_full() -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Reads in data and splits it into training and validation sets with a 75/25 split."""
    df = pd.read_csv(os.path.join(DATA_DIR, "train_ratings.csv"))
    # Split sid_pid into sid and pid columns
    df[["sid", "pid"]] = df["sid_pid"].str.split("_", expand=True)
    df = df.drop("sid_pid", axis=1)
    df["sid"] = df["sid"].astype(int)
    df["pid"] = df["pid"].astype(int)
    global_mean = torch.tensor(np.mean(df.rating.values), dtype=torch.float32)
    scientist2papers = df.groupby("sid")["pid"].apply(list).to_dict()
    return df, scientist2papers, global_mean

def read_wishlist_dict() -> dict:
    wishlist = pd.read_csv(os.path.join(DATA_DIR, "train_tbr.csv"))
    wishlist["sid"] = wishlist["sid"].astype(int)
    wishlist["pid"] = wishlist["pid"].astype(int)
    scientist2wishlist = wishlist.groupby("sid")["pid"].apply(list).to_dict()
    return scientist2wishlist


def read_data_matrix(df: pd.DataFrame) -> np.ndarray:
    """Returns matrix view of the training data, where columns are scientists (sid) and
    rows are papers (pid)."""

    return df.pivot(index="sid", columns="pid", values="rating").values


def evaluate(valid_df: pd.DataFrame, pred_fn: Callable[[np.ndarray, np.ndarray], np.ndarray]) -> float:
    """
    Inputs:
        valid_df: Validation data, returned from read_data_df for example.
        pred_fn: Function that takes in arrays of sid and pid and outputs their rating predictions.

    Outputs: Validation RMSE
    """

    preds = pred_fn(valid_df["sid"].values, valid_df["pid"].values)
    return root_mean_squared_error(valid_df["rating"].values, preds)


def make_submission(pred_fn: Callable[[np.ndarray, np.ndarray], np.ndarray], filename: os.PathLike):
    """Makes a submission CSV file that can be submitted to kaggle.

    Inputs:
        pred_fn: Function that takes in arrays of sid and pid and outputs a score.
        filename: File to save the submission to.
    """

    df = pd.read_csv(os.path.join(DATA_DIR, "sample_submission.csv"))

    # Get sids and pids
    sid_pid = df["sid_pid"].str.split("_", expand=True)
    sids = sid_pid[0]
    pids = sid_pid[1]
    sids = sids.astype(int).values
    pids = pids.astype(int).values

    df["rating"] = pred_fn(sids, pids)
    df.to_csv(filename, index=False)

def impute_values(mat: np.ndarray) -> np.ndarray:
    return np.nan_to_num(mat, nan=3.0)

def get_dataset(df: pd.DataFrame) -> torch.utils.data.Dataset:
    """Conversion from pandas data frame to torch dataset."""
    sids = torch.from_numpy(df["sid"].to_numpy())
    pids = torch.from_numpy(df["pid"].to_numpy())
    ratings = torch.from_numpy(df["rating"].to_numpy()).float()
    return torch.utils.data.TensorDataset(sids, pids, ratings)


Train best model

In [8]:
train_df, valid_df, scientist2papers, global_mean = read_data_df()
train_mat = read_data_matrix(train_df)
train_mat = impute_values(train_mat)

scientist2wishlist = read_wishlist_dict()


In [9]:
# Define model (10k scientists, 1k papers, 32-dimensional embeddings) and optimizer
#model = EmbeddingDotProductModel(10_000, 1_000, 32).to(device)
#optim = torch.optim.Adam(model.parameters(), lr=1e-3)

emb_dim = 64
lr = 6e-4
wd = 4e-5

model = SVDpp(emb_dim=emb_dim, s2p=scientist2papers, global_mean=global_mean, s2w=scientist2wishlist).to(device)
optim = torch.optim.Adam(model.parameters(), lr=lr, weight_decay=wd)

In [10]:


train_dataset = get_dataset(train_df)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=256, shuffle=True)

valid_dataset = get_dataset(valid_df)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=256, shuffle=False)


In [None]:
NUM_EPOCHS = 20
best_rmse = float("inf")
patience = 2
epochs_no_improve = 0
best_model_state = None

for epoch in range(NUM_EPOCHS):
    # Train model for an epoch
    total_loss = 0.0
    total_data = 0
    model.train()
    for sid, pid, ratings in train_loader:
        sid = sid.to(device)
        pid = pid.to(device)
        ratings = ratings.to(device)

        pred = model(sid, pid)
        rating_loss = F.mse_loss(pred, ratings)

        # Contrastive loss
        neg_pids = torch.randint(0, 1000, pid.shape, device=device)

        user_embeds = model.scientist_factors(sid)
        pos_item_embeds = model.paper_factors(pid)
        neg_item_embeds = model.paper_factors(neg_pids)

        pos_scores = F.cosine_similarity(user_embeds, pos_item_embeds)
        neg_scores = F.cosine_similarity(user_embeds, neg_item_embeds)

        contrast_loss = contrastive_loss(pos_scores, neg_scores)

        
        loss = rating_loss + 0.05 * contrast_loss  # You can tune this weight


        optim.zero_grad()
        loss.backward()
        optim.step()

        total_data += len(sid)
        total_loss += len(sid) * loss.item()

    # Evaluate on validation set
    total_val_mse = 0.0
    total_val_data = 0
    model.eval()
    for sid, pid, ratings in valid_loader:
        sid = sid.to(device)
        pid = pid.to(device)
        ratings = ratings.to(device)

        pred = model(sid, pid).clamp(1, 5)
        mse = F.mse_loss(pred, ratings)

        total_val_data += len(sid)
        total_val_mse += len(sid) * mse.item()

    val_rmse = (total_val_mse / total_val_data) ** 0.5
    train_loss = total_loss / total_data
    print(f"[Epoch {epoch+1}] Train loss={train_loss:.3f}, Valid RMSE={val_rmse:.4f}")

    # Early stopping check
    if val_rmse < best_rmse:
        best_rmse = val_rmse
        best_model_state = copy.deepcopy(model.state_dict())
        epochs_no_improve = 0
    else:
        epochs_no_improve += 1

    if epochs_no_improve >= patience:
        print(f"Stopped early at epoch {epoch+1}. Best RMSE: {best_rmse:.4f}")
        break

# load best model back
model.load_state_dict(best_model_state)

# save model to file
torch.save(model.state_dict(), "svdpp_model.pth")


[Epoch 1] Train loss=0.988, Valid RMSE=0.9367
[Epoch 2] Train loss=0.868, Valid RMSE=0.8961
[Epoch 3] Train loss=0.793, Valid RMSE=0.8753
[Epoch 4] Train loss=0.744, Valid RMSE=0.8659
[Epoch 5] Train loss=0.706, Valid RMSE=0.8603
[Epoch 6] Train loss=0.671, Valid RMSE=0.8569


In [None]:
pred_fn = lambda sids, pids: model(
    torch.from_numpy(sids).to(device),
    torch.from_numpy(pids).to(device),
).clamp(1, 5).cpu().numpy()

# Evaluate on validation data
with torch.no_grad():
    val_score = evaluate(valid_df, pred_fn)

print(f"Validation RMSE: {val_score:.3f}")

In [None]:
with torch.no_grad():
    make_submission(pred_fn, "learned_embedding_submission.csv")