# **Modern NLP: Course project Milestone 3**

#### **Team**: Alexander Sternfeld, Silvia Romanato and Antoine Bonnet (`syntax-sorcerers`)

> ### **Project Description**
> 
> **Remember**: In **Milestone 1**, we picked a robust prompting stategy to get accurate answers from ChatGPT, which we used to generate answers for questions from EPFL course content. The generated answers were then rated by human annotators. The data collected (available at `project_reference/interactions_v1.json` and `project_reference/solutions_v1.json`) will be used in Milestone 3 for the supervised fine-tuning of a language model to answer questions from EPFL course content.
>
> However, high-quality assistants such as ChatGPT are trained using more than only
supervised learning. They use a technique called Reinforcement Learning with Human
Feedback (RLHF). RLHF requires your training procedure to have access to a reward model
that can evaluate multiple different responses and rank them according to their suitability. 
>
> In **Milestone 2**, we successfully trained a classifier reward model with the [RoBERTa](https://arxiv.org/abs/1907.11692) Transformer-based model base on the EPFL and StackOverflow datasets to rate the quality of answers given a question. This model will now be used to train a **policy model** with RLHF to rank multiple answers from the same question.
>
> In **Milestone 3**:, we now fine-tune a generative pretrained language model so that it learns to produce better demonstrations when prompted with a question from your course. We train our model using supervised learning on some of the data we have collected in the first two parts of your project. We also use our reward function to evaluate the quality of the text generations produced by our model. 


In [2]:
# To run this notebook, you need to install the following packages:
# !pip install -r requirements.txt

from load_data import *
from finetune import *
from chatbot import *
from gen_script_syntax_sorcerers import *

import json

os.environ["NO_DEPRECATION_WARNING"] = "true"

SEED = 0
torch.manual_seed(SEED)
np.random.seed(SEED)

%reload_ext autoreload
%autoreload 2



## **Training the ChatBot**

Our goal is to fine-tune a generative pre-trained language model using supervised learning on the collected data from milestone 1 and 2. This fine-tuning process helps the model learn to generate better responses specific to EPFL course content.

In this notebook, we select the base model, pre-process the labelled fine-tuning data, then run the fine-tuning process. We also evaluate the performance of the fine-tuned model on a validation set.

**REMOVE THIS**: Use your reward model to evaluate the quality of the generated text and guide the fine-tuning process. You can employ RLHF techniques to further improve the chatbot's performance if you choose to do so.

**Requirements**: 
1. The generative model should be able to generate proper answers given a question. 
2. It does not have to handle multi-interaction prompts. You will only be evaluated on one turn prompts.


#### 1. **Model selection**

We use [Google's Large-sized T5 multilingual model](https://huggingface.co/t5-large) as the base for our chat engine. We select its `large` version (770M parameters) because it provides a good balance between computational resources and performance.

In [2]:
# Load base model from HuggingFace (takes a few minutes the first time)
BASE_MODEL_NAME = 't5-base'
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_NAME, model_max_length=512)
model = AutoModelForSeq2SeqLM.from_pretrained(BASE_MODEL_NAME)
config = AutoConfig.from_pretrained(BASE_MODEL_NAME)

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if device == 'cuda':
    print('Using GPU:', torch.cuda.get_device_name(0))
    model.to(device)
else: 
    print('Using CPU.')

Using CPU.


#### 2. **Dataset selection**

To fine-tune our chatbot to answer EPFL course content, we use a high-quality combination of two question-answering datasets. The full dataset is available at `data/gen_model/gen_dataset_syntax-sorcerers.json`. We use the following sources to collect our training data:

1. **StackOverflow dataset**: The [StackOverflow dataset](https://www.kaggle.com/datasets/stackoverflow/stackoverflow) contains 27M+ answers on a wide variety of forums from which we select 9 topics: computer science, computer science theory, data science, mathematics, physics, chemistry, engineering, software engineering, mechanics and quantum physics. We only keep answers that were accepted by the original poster with at least 2 upvotes. This dataset contains 139264 unique questions and their corresponding accepted answer. 

2. **EPFL Student Interactions dataset**: This dataset contains student interactions with ChatGPT concerning questions on EPFL course content. From this dataset, we select provided official solutions as well as question-answer pairs that were rated with the highest confidence level (5) by students, since using lower quality answers may introduce noise and lead to incorrect responses. We only select one-shot interactions. This dataset contains 4450 questions with 7412 distinct answers. 


**NOTE**: We train the chatbot with QA-pairs of the following form: Text: "`Human: What is 2+2?\n\nAssistant: `" --> Label: "`4`". This means that when we will have our trained model, we will prepend the "`Human: `" prefix and append the "`\n\nAssistant: `" suffix to the actual prompt. 



In [4]:
# Load the pre-processed data
stack_df = load_stack_data()
EPFL_df = load_EPFL_data()

print("\nNumber of unique questions in StackExchange dataset: {}".format(len(stack_df['question_id'].unique())))
print("Number of unique questions in EPFL dataset: {}".format(len(EPFL_df['question_id'].unique())))
print('Number of QA samples in StackExchange dataset: {}'.format(len(stack_df)))
print('Number of QA samples in EPFL dataset: {}'.format(len(EPFL_df)))
print('Total number of QA samples: {}'.format(len(stack_df) + len(EPFL_df)))

# Save as a combined json file
QA_PATH = os.path.join(DATA_DIR, 'gen_model', f'gen_dataset_{TEAM_NAME}.json')
if os.path.exists(QA_PATH):
    print(f'\nCombined dataset already exists at {QA_PATH}.')
if not os.path.exists(QA_PATH): 
    print(f'\nSaving combined dataset to {QA_PATH}.')
    merged_data = { 
        'EPFL': EPFL_df.to_dict(orient='records'),
        'StackOverflow': stack_df.to_dict(orient='records')
    }
    with open(QA_PATH, 'w') as f:
        json.dump(merged_data, f, indent=4)


Loading pre-processed StackOverflow data.
Loading pre-processed EPFL data.

Number of unique questions in StackExchange dataset: 95041
Number of unique questions in EPFL dataset: 4450
Number of QA samples in StackExchange dataset: 95041
Number of QA samples in EPFL dataset: 7872
Total number of QA samples: 102913

Combined dataset already exists at /Users/abonnet/Desktop/NLPProject/project-m3-syntax-sorcerers/data/gen_model/gen_dataset_syntax-sorcerers.json.


In [5]:
# Load datasets into QADatasets, split train/val/test
EPFL_dataset = load_dataset(EPFL_df, tokenizer, seed=SEED)
stack_dataset = load_dataset(stack_df, tokenizer, seed=SEED)

#### 3. **Training the Chatbot**

We now fine-tune the GPT2 pre-trained language model using supervised learning over two round of fine-tuning. 

In the first round, the pre-trained model is fine-tuned on the larger and diverse StackOverflow dataset allows the model to learn general language patterns, syntax, and common knowledge across a wide range of topics. This pre-training helps the model acquire a strong language understanding foundation, which can later be further fine-tuned to the specific domain of EPFL course content.

In the second round, the model is fine-tuned on the collected EPFL-specific dataset. We use maximum likelihood estimation (MLE) to train the model. The model can then focus on adapting to the specific question-answer patterns, terminology, and context of EPFL course content. This fine-tuning step enables the model to specialize in generating accurate and relevant responses for EPFL-specific queries.

During training, we use maximum likelihood estimation (MLE) to train the model. We evaluate the current model using the validation set set with metrics including perplexity, BLEU and ROUGE scores. After each round of training, we evaluate the model on the test set. We use the validation set to monitor the model's performance during training and the test set to evaluate the final performance.

We use the cross-entropy loss to fine-tune the GPT2 language model. It measures the dissimilarity between the model's predicted probability distribution over the vocabulary and the true distribution (labels). The model aims to minimize this loss during training to improve its generation capability. 

TODO: 
- Check metrics function: might add BERTScore
- Which metric to choose to save model? Perplexity (aka loss) is not always a good metric to choose the best model.
- Tuning hyperparameters: train batch size (8 vs 16 vs 32), learning rate (1e-4, 1e-5, 1e-6), num training epochs.
- StackOverflow: lower num epochs (just learn general QA patterns), learning rate not sure (higher for pre-trained or finetuning on EPFL?)

In [7]:
# This runs the whole fine-tuning procedure
finetune(seed=7)

Chatbots already fine-tuned. Skipping fine-tuning.


#### 5. **Evaluation**

> - Evaluate the performance of your chat engine using appropriate metrics, such as BLEU, ROUGE, or human evaluation.
> - Continuously analyze and monitor the responses generated by the chat engine to identify areas for improvement. 
> - Collect user feedback and iterate on the model and training process accordingly.
> - Experiment with different training techniques, architectures, and hyperparameters to optimize the chat engine's performance.
> - Remember to iterate and experiment throughout the training process, as finding the optimal approach often requires testing different combinations of models, datasets, and training techniques. Regularly evaluate the performance of your chatbot, seek user feedback, and make adjustments as necessary to ensure its effectiveness as an educational assistant for EPFL course content.


#### 6. **Preparing submission**

In [27]:
# Read df from prompts path
with open(PROMPTS_PATH, 'r') as f:
    prompts = pd.read_json(f, encoding='utf-8')

# Format question and answer
prompts = prompts.replace({np.nan: None})
prompts['question'] = prompts.apply(lambda x: Q_from_solutions(x['question'], x['choices']), axis=1)
prompts['answer'] = prompts.apply(lambda x: A_from_solutions(x['answer'], x['explanation']), axis=1)

# Remove columns choices, explanation
prompts = prompts.drop(columns=['choices', 'explanation'])
prompts.head()
# Generate answers using chatbot



Unnamed: 0,guid,question,answer
0,94cdfd24-de94-4072-a5aa-a3a9dc953d2b,Soit \(f\) une fonction paire (resp. impaire) ...,"Soit \(f\) une fonction paire, possédant\nun d..."
1,3826e1bd-05f7-4f80-a6ad-937af83dc3e4,The goal of the 4 following questions is to pr...,"MapTrCons, ConsAppend, IH, MapTrCons"
2,f6f7e3a3-b63a-4f13-abbe-a2b3f74c64e5,What is the general relation between the entan...,Entanglement is sufficient but not necessary f...
3,0b473eff-0fc8-4fb0-9d10-bc49d18e52e3,"Cet exercice est un bref rappel de maths, il d...",1. f(x)=\cos(x) &\Rightarrow F(x)=\sin(x)+C\n...
4,6654c2b0-4dcd-432e-9c8e-26d638afe0ef,What is the worst case complexity of listing f...,['$O(number of direntries in the directory)$']
