# **XCS330 - PS1**

[![Open In Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/scpd-proed/XCS330-PS1/blob/main/src/Google_Colab_XCS330_PS1.ipynb)

Before opening the notebook with the badge, you would need to allow Google Colab to access the GitHub private repositories. Please check therefore [this tutorial](https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb#:~:text=Navigate%20to%20http%3A%2F%2Fcolab,to%20read%20the%20private%20files.).


In this Notebook, using GitHub and Google Services(Drive, Colab Pro) we will be leveraging the GPU to train our models and generate the files required by our grader.

**Note**: to successfully run the experiments on Google Colab you would need at least the Pro subscription, that gives you access to more powerfull GPUs, Network and Storage and no timeouts will occur when session is left running!

Please **read carefully** and follow the instructions from the next cells.

For any issues please contact your Course Facilitator!

## Install required tools

In [None]:
!pip install --upgrade timeout_decorator==0.5.0
!pip install --upgrade -f https://download.pytorch.org/whl/torch torch==2.5.1+cu124
!pip install --upgrade rouge-score==0.1.2
!pip install --upgrade huggingface-hub==0.28.0
!pip install --upgrade transformers==4.48.1
!pip install --upgrade datasets==3.2.0
!pip install --upgrade numpy==2.2.2


## Display the runtimes of each cell

In [None]:
!pip install ipython-autotime
%load_ext autotime

## Cloning GitHub XCS330-PS1 private repository

Unfortunately our Git repositories are based on Git LFS and cannot be cloned properly on Google Drive. Therefore before running the experiments, you could  

*   clone our default XCS330-PS1 repository and update manually the modified files to have the latest stage of your development in Google Colab
*   duplicate our default XCS330-PS1 repository under your GitHub account, and clone it directly in Google Colab. More [here](https://docs.github.com/en/repositories/creating-and-managing-repositories/duplicating-a-repository#).

By default the cells below will use the first option and therefore the default [XCS330-PS1](https://github.com/scpd-proed/XCS330-PS1.git) repository gets cloned.


Enter your GitHub username and [GitHub Personal Access Token](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/managing-your-personal-access-tokens) in the fields that will pop up when executing the cell below.

In [None]:
username = input("Enter the GitHub Username: ")
token = input("Enter the GitHub Personal Access Token: ")

In [None]:
!git clone https://{username}:{token}@github.com/scpd-proed/XCS330-PS1.git

## Mounting Google Drive locally

In [None]:
from google.colab import drive
drive.mount('/content/drive')

The required generated files and the log files will be stored in your Google Drive account under `Stanford/XCS330/PS1` folder.

In [None]:
%mkdir -p "/content/drive/MyDrive/Stanford/XCS330/PS1/run"

## Run PS1 experiments

Check first that all the basic test cases pass, i.e. you have the latest stage of your work available on Google Colab.

In [None]:
%cd "/content/XCS330-PS1/src"
# Download and prepare the dataset
!python3 main.py --cache


In [None]:
!python grader.py

## **Important note**

Uncomment and execute the cell below if all our requirements were coded and all the basic test cases pass locally! The cell will generate all the required experiment files by running the experiments in parallel and therefore make sure before you execute it that the GPU used has at least 14GB of memory, such as the standard T4 GPU.

**Note**: from a cost perspective this is the best option to follow!

In [None]:
# %%bash
# # Experiment 1
# ((time python main.py --factorization_weight 0.99 --regression_weight 0.01 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=True_LF=0.99_LR=0.01 --device gpu) > /content/drive/MyDrive/Stanford/XCS330/PS1/exp_1.log 2>&1 && \
#   cp ./experiment_1.npy /content/drive/MyDrive/Stanford/XCS330/PS1/) &

# # Experiment 2
# ((time python main.py --factorization_weight 0.5 --regression_weight 0.5 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=True_LF=0.5_LR=0.5 --device gpu) > /content/drive/MyDrive/Stanford/XCS330/PS1/exp_2.log 2>&1 && \
#   cp ./experiment_2.npy /content/drive/MyDrive/Stanford/XCS330/PS1/) &

# # Experiment 3
# ((time python main.py --no-shared_embeddings --factorization_weight 0.5 --regression_weight 0.5 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=False_LF=0.5_LR=0.5 --device gpu) > /content/drive/MyDrive/Stanford/XCS330/PS1/exp_3.log 2>&1 && \
#   cp ./experiment_3.npy /content/drive/MyDrive/Stanford/XCS330/PS1/) &

# # Experiment 4
# ((time python main.py --no-shared_embeddings --factorization_weight 0.99 --regression_weight 0.01 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=False_LF=0.99_LR=0.01 --device gpu) > /content/drive/MyDrive/Stanford/XCS330/PS1/exp_4.log 2>&1 && \
#   cp ./experiment_4.npy /content/drive/MyDrive/Stanford/XCS330/PS1/) &

## **Important note**

Uncomment and execute each of the cell below if you want to run the experiments individually!

To keep the costs low, make sure you use the standard available GPU, as there will be no real benefit of using a high end GPU!

**Note**: from a cost perspective this is more expensive as the first option!

In [None]:
# # Experiment 1
# !python main.py --factorization_weight 0.99 --regression_weight 0.01 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=True_LF=0.99_LR=0.01 --device gpu
# !cp ./experiment_1.npy /content/drive/MyDrive/Stanford/XCS330/PS1/

In [None]:
# # Experiment 2
# !python main.py --factorization_weight 0.5 --regression_weight 0.5 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=True_LF=0.5_LR=0.5 --device gpu
# !cp ./experiment_2.npy /content/drive/MyDrive/Stanford/XCS330/PS1/

In [None]:
# # Experiment 3
# !python main.py --no-shared_embeddings --factorization_weight 0.5 --regression_weight 0.5 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=False_LF=0.5_LR=0.5 --device gpu
# !cp ./experiment_3.npy /content/drive/MyDrive/Stanford/XCS330/PS1/

In [None]:
# # Experiment 4
# !python main.py --no-shared_embeddings --factorization_weight 0.99 --regression_weight 0.01 --logdir /content/drive/MyDrive/Stanford/XCS330/PS1/run/shared=False_LF=0.99_LR=0.01 --device gpu
# !cp ./experiment_4.npy /content/drive/MyDrive/Stanford/XCS330/PS1/

# Submission

The experiments will generate the files in your Google Drive account under `Stanford/XCS330/PS1` folder. Revert to the `P12.pdf` to get the full list of files you would need to download from the mentioned GDrive folder and generate a submission!