<a href="https://colab.research.google.com/github/ymann/sparknotesqa/blob/master/Sparknotesqa.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Running the Sparknotes models

### 1. Paragraph extraction

In [0]:
import os

In [0]:
!git clone https://github.com/ymann/sparknotesqa.git

In [0]:
os.chdir('/content/sparknotesqa')

In [0]:
!pip install -r ./requirements.txt

In [0]:
# If you are using Infersent embeddings (not recommended), run this cell:
!mkdir scripts/fastText
!curl -Lo scripts/fastText/crawl-300d-2M.vec.zip https://dl.fbaipublicfiles.com/fasttext/vectors-english/crawl-300d-2M.vec.zip
!unzip scripts/fastText/crawl-300d-2M.vec.zip -d scripts/fastText/
!mkdir scripts/encoder
!curl -Lo scripts/encoder/infersent2.pkl https://dl.fbaipublicfiles.com/infersent/infersent2.pkl

In [0]:
# Unzip the dataset then run the paragraph extraction:
!gunzip data/sparknotes_dataset.json.gz

In [0]:
# Now, extract the correct paragraphs. In this example, we use the question 
# concatenated with the correct answer for training and the question 
# concatenated with all answers for test

# Input flag options: 
# embedding_method : tfidf, bert, infersent, sentence_bert (recommended)
# comparison_method: no_answers, correct_answer, all_answers
# pool_method: best_sentence (recommended), sum, average
# context_size: any int (set to -1 for full paragraphs) (50 is recommended)
!python scripts/paragraph_extraction.py -embedding_method sentence_bert \
-comparison_method correct_answer -pool_method best_sentence -context_size 50

In [0]:
!python scripts/paragraph_extraction.py -embedding_method sentence_bert \
-comparison_method all_answers -pool_method best_sentence -context_size 50 

### 2. Run the fine-tuning model

In [0]:
# First, split the data into train, val and test sets:
!python scripts/splitdata.py \
-train_data data/paragraph_extracted_data/sentence_bert_correct_answer_50.csv \
-val_data data/paragraph_extracted_data/sentence_bert_correct_answer_50.csv

In [0]:
# Next, clone the modified Hugging Face repository
os.chdir('/content')
!git clone https://github.com/gauravkmr/transformers.git
os.chdir('/content/transformers')

In [0]:
# Run the fine-tunning model
!python examples/run_multiple_choice.py \
--model_type roberta \
--task_name swag \
--model_name_or_path roberta-base \
--do_train \
--do_eval \
--do_lower_case \
--data_dir /content/sparknotesqa/splitdata \
--learning_rate 5e-5 \
--num_train_epochs 3 \
--max_seq_length 80 \
--output_dir models_bert/swag_base \
--per_gpu_eval_batch_size=16 \
--per_gpu_train_batch_size=16 \
--gradient_accumulation_steps 2 \
--overwrite_output