## Data Preparation

**Steps and Goals**:
  * Download selected files from the Colby CS website
  * Convert pdfs into a pandas df (which can be directly used to feed to the model and make queries)
    - Note here that the files were manually renamed from 'outlinexx' to a name that reflects their contents (e.g BinTree.pdf) to make it easier to read model output later on
  * Convert df into a .json file which will later be annotated into a BERT Q&A format using cdQA annotator to fine tune the model

**Output:**
  * CS231_Ying.json file, which is later downloaded to the local computer for annotation

In [None]:
# installing cdqa necessary for using pdf_converter()
!pip install cdqa

In [None]:
# this fixes issue with importing "is_url" with older version of pandas
!pip install pandas==1.1.0

In [None]:
# import everthing we need here
import os
import pandas as pd
from ast import literal_eval

from cdqa.utils.converters import pdf_converter
from cdqa.pipeline import QAPipeline
from cdqa.utils.download import download_model



In [None]:
## Getting relevant files from cs.colby.edu with the three chosen topics being:
## Linked List, Trees and Graphs

# Linked List
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines12.pdf
# Iterator for Linked List
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines13.pdf
# Doubly Linked List
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines14.pdf
# Tree and Binary Tree
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines16.pdf
# Complete tree and traversals
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines17.pdf
# Node Based Binary Tree
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines18.pdf
# Graphs
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines26.pdf
# Graph Representation and traversal
!wget --no-check-certificate -P ./docs/ https://cs.colby.edu/courses/S18/cs231/notes/outlines27.pdf

In [None]:
# convert the pdfs and store into a pandas df called df
df = pdf_converter(directory_path='./docs/')
df.head()

Unnamed: 0,title,paragraphs
0,Yale1,"[printf(""%d\n"", count);return 0;}examples/vari..."
1,UCBerkeley3b,"[Enscript Output, 04/11/1419:06:29 129 ..."
2,CMU3,"[Unit09, 1 , 15-121 Introduction to Data Struc..."
3,Yale3,"[Figure 4: A directed graph, 339, 5.13.3 Opera..."
4,CMU2,"[Unit06A, 1 , 15-121 Introduction to Data Stru..."


In [None]:
from cdqa.utils.converters import df2squad
# Converting dataframe to .json format
json_data = df2squad(df=df, squad_version='v1.1', output_dir='docs/', filename='other_schools_data.json')

28it [00:00, 3963.84it/s]


Once the file is downloaded to local computer, do the following in Terminal:


```
cd ~/Documents/Thesis/cdQA-annotator/
npm run serve
```

This should run the annotator application at *localhost:8080*.

There, upload the .json file we generated above and start annotating by type out questions and selecting answers from the paragraph given.

**NOTE:** Do not edit answers in any way - just select from paragraph even though the formatting might be odd. Changing the answers to not match with the paragraph will lead to token errors in the future which is really tedious to find and fix manually.

Finally, save the new .json file generated.

## Fitting the model

In [None]:
# downloading the trained model and storing in models
download_model(model='bert-squad_1.1', dir='./models')


Downloading trained model...


In [None]:
!pip install pytorch-transformers

In [None]:
# cdqa_pipeline = QAPipeline(reader='./models/bert_qa.joblib', max_df=1.0)
cdqa_pipeline = QAPipeline(reader='extended_CS231_bert.joblib', max_df=1.0)
cdqa_pipeline.fit_retriever(df=df)

100%|██████████| 231508/231508 [00:00<00:00, 904845.40B/s]


QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po...size=32,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t

In [None]:
query = 'what is dijkstra algorithm?'
prediction = cdqa_pipeline.predict(query, 2)
prediction

[('if the graph is sparse with e = O(n)',
  'UofBirmingham3',
  '111instructions in each run through the loops. However, if the graph is sparse with e = O(n),then multiple runs of Dijkstra’s algorithm can be made to perform with time complexityO(n2log2 n), and be faster than Floyd’s algorithm.',
  11.499287448582807),
 ('The time complexity here is clearly O(n3), since it involves three nested for loops of O(n)',
  'UofBirmingham3',
  'The time complexity here is clearly O(n3), since it involves three nested for loops of O(n).This is the same complexity as running the O(n2) Dijkstra’s algorithm once for each of the npossible starting vertices. In general, however, Floyd’s algorithm will be faster than Dijkstra’s,even though they are both in the same complexity class, because the former performs fewer',
  11.36019416442258)]

In [None]:
print('query: {}'.format(query))
print('answer: {}'.format(prediction[0]))
print('title: {}'.format(prediction[1]))
# print('paragraph: {}'.format(prediction[2]))

query: what is dijkstra algorithm?
answer: ('the algorithm is straightforward, it does perform n memory accesses for a graph with n vertices', 'outlines27', '- Although the algorithm is straightforward, it does perform n memory accesses for a graph with n vertices; the algorithm is O(n).', 9.282458173339506)
title: ('A sentinel node is a specifically designed node used with linked lists and trees as a traversal path terminator', 'outlines14', '- Note: Header and trailer are sentinel nodes. A sentinel node is a specifically designed node used with linked lists and trees as a traversal path terminator. This type of node does not hold or reference any data managed by the data structure.', 7.3822327852249146)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# convert the model we just fine tuned on new data and optimise for CPU
# then save as a .joblib to models
import joblib
cdqa_pipeline.to('cpu')
joblib.dump(cdqa_pipeline, './bert_qa_extended_new.joblib')

['./bert_qa_extended_new.joblib']

In [None]:
# copy the model to my personal drive
!cp ./bert_qa_extended_new.joblib /content/drive/MyDrive/Colab\ Notebooks/models

In [None]:
# load the custom model we just saved
# look at architecture
cdqa_pipeline=joblib.load('./models/bert_qa_custom.joblib')
cdqa_pipeline

QAPipeline(reader=BertQA(adam_epsilon=1e-08, bert_model='bert-base-uncased',
                         do_lower_case=True, fp16=False,
                         gradient_accumulation_steps=1, learning_rate=5e-05,
                         local_rank=-1, loss_scale=0, max_answer_length=30,
                         n_best_size=20, no_cuda=False,
                         null_score_diff_threshold=0.0, num_train_epochs=3.0,
                         output_dir=None, predict_batch_size=8, seed=42,
                         server_ip='', server_po..._size=8,
                         verbose_logging=False, version_2_with_negative=False,
                         warmup_proportion=0.1, warmup_steps=0),
           retrieve_by_doc=False,
           retriever=BM25Retriever(b=0.75, floor=None, k1=2.0, lowercase=True,
                                   max_df=1.0, min_df=2, ngram_range=(1, 2),
                                   preprocessor=None, stop_words='english',
                                   t