In [None]:
!pip install git+https://github.com/deepset-ai/haystack.git

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/deepset-ai/haystack.git
  Cloning https://github.com/deepset-ai/haystack.git to /tmp/pip-req-build-8hwexfe0
  Running command git clone -q https://github.com/deepset-ai/haystack.git /tmp/pip-req-build-8hwexfe0
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting mlflow
  Downloading mlflow-2.0.1-py3-none-any.whl (16.5 MB)
[K     |████████████████████████████████| 16.5 MB 4.2 MB/s 
Collecting huggingface-hub>=0.5.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 90.6 MB/s 
[?25hCollecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[K     |████████████████████████████████| 43 kB 2.4 MB/s 
Collecting sentence-transformers>=2.2.0
  Downloading sentence-transformers-2

In [None]:
import logging
import os

logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
logging.getLogger("haystack").setLevel(logging.INFO)
from haystack.nodes import FARMReader
from haystack.utils import fetch_archive_from_http, launch_es, convert_files_to_docs, print_answers
from haystack.nodes import BM25Retriever
from haystack.pipelines import ExtractiveQAPipeline
from haystack.document_stores.elasticsearch import ElasticsearchDocumentStore
from haystack.nodes.retriever.sparse import ElasticsearchRetriever


import pandas as pd
import certifi
import time
from pprint import pprint

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.9.2

In [None]:
%%bash --bg

sudo -u daemon -- elasticsearch-7.9.2/bin/elasticsearch

We had initially started our finetuning with a DistilBERT model. However, you'll notice that we're finetuning our model with a roBERTa base model. There's a reason for that.

In [None]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
data_dir = '/content/drive/MyDrive/297 Project/data/'
save_dir = '/content/drive/MyDrive/297 Project/models_roberta/'

reader.train(data_dir=data_dir, train_filename="answers.json", use_gpu=True, n_epochs=1, save_dir=save_dir)

INFO:haystack.telemetry:Haystack sends anonymous usage data to understand the actual usage and steer dev efforts towards features that are most meaningful to users. You can opt-out at anytime by calling disable_telemetry() or by manually setting the environment variable HAYSTACK_TELEMETRY_ENABLED as described for different operating systems on the documentation page. More information at https://docs.haystack.deepset.ai/docs/telemetry
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1


Downloading config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model: * LOADING MODEL: 'deepset/roberta-base-squad2' (Roberta)


Downloading pytorch_model.bin:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO:haystack.modeling.model.language_model:Auto-detected model language: english
INFO:haystack.modeling.model.language_model:Loaded 'deepset/roberta-base-squad2' (Roberta model) from model hub.


Downloading tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]

INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.utils:Using devices: CUDA:0 - Number of GPUs: 1
INFO:haystack.modeling.data_handler.data_silo:
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
INFO:haystack.modeling.data_handler.data_silo:LOADING TRAIN DATA
INFO:haystack.modeling.data_handler.data_silo:Loading train set from: /content/drive/MyDrive/297 Project/data/answers.json 
Preprocessing dataset: 100%|██████████| 1/1 [00:00<00:00,  1.15 Dicts/s]
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING DEV DATA
INFO:haystack.modeling.data_handler.data_silo:No dev set is being loaded
INFO:haystack.modeling.data_handler.data_silo:
INFO:haystack.modeling.data_handler.data_silo:LOADING TEST DATA
INFO:haystack.modeling.data_handler.data_silo:No test set is being loaded
I

Data augmentation is usually the way to go according to Deepset's guide of finetuning models for QA pipelines, but due to computational limits, we haven't been able to augment our data using Haystack's augment_squad.py functions.

Given that exception, distilling the model on a larger BERT model only makes the model worse, not better. Thus, we switched to roBERTa instead of continuing with DistilBERT.

In [None]:
# student = FARMReader(model_name_or_path=save_dir)
# teacher = FARMReader(model_name_or_path="deepset/bert-large-uncased-whole-word-masking-squad2")
# student.distil_prediction_layer_from(teacher, data_dir="/content/drive/MyDrive/297 Project/data", train_filename="answers.json",
#                     learning_rate=3e-5, save_dir = '/content/drive/MyDrive/297 Project/distilled_model', distillation_loss_weight=1.0, temperature=5)

In [None]:
host = os.environ.get("ELASTICSEARCH_HOST", "localhost")

In [None]:
document_store = ElasticsearchDocumentStore(host=host, username="", password="", index="document")

In [None]:
files = pd.read_csv('/content/drive/MyDrive/297 Project/data/bedtimestories.csv', encoding='latin1')
files.columns = ['title', 'content']

In [None]:
files.head()

Unnamed: 0,title,content
0,Noctuary,Damian sat with his legs dangling over the edge of the rocking deck of the s...
1,Behind Closed Doors,"Go brush your teeth, honey. Red and swollen were her eyes as she said it, he..."
2,Butterscotch the Brave,Not so long ago and not so far away there lived a knight: Butterscotch the B...
3,Fantastic Variations on an Old Rhyme - ~A Pastiche~,Fantastic Variations on an Old Rhyme~A Pastiche~There it was again: a faint ...
4,Green Crumbs,A pale 17-year-old boy rests his elbows on the glass table-top in front of h...


In [None]:
i=0
for index, (title, content) in files.iterrows():
    if i > len(files):
       break
    else:
      try:
        f = open('/content/drive/MyDrive/297 Project/data/raw/'+str(i)+'_'+str(title)+'.txt', 'w')
      except FileNotFoundError:
        continue
      f.write(content)
      f.close()
      i+=1

In [None]:

doc_dir = "/content/drive/MyDrive/297 Project/data/raw"
docs = convert_files_to_docs(dir_path=doc_dir, split_paragraphs=True)
document_store.write_documents(docs)

INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/219_Night Sun.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/220_Mom&son's and sun.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/221_A Silly Little Game.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/222_Destiny of Chris and Ping..txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/223_a summer kidnapping.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/224_Road trips during the summer?.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/225_Febuary 31st.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/297 Project/data/raw/226_Rest Now, Friend.txt
INFO:haystack.utils.preprocessing:Converting /content/drive/MyDrive/2

[<Document: {'content': "Andy rolled over, pulling the light quilt along with him, then yanking it over his head like a monks cowl.The room was filled with dazzling lemony sunlight though it was only seven or so in the morning.And Saturday! thoughtknowHe peeked an eye out from his blankey-cave and turned curiously back towards the window.As usual, the cheery cacophony of bird voices wafted in through the east facing window.He recognized just about all of them, a hobby of his: the sweet peeping of house wrens, the singsong warbling of colorful goldfinches and the repetitive trilling of magpies.The entire symphony of birds didnt disturb his rest in the least, even the obnoxious jay's raspy,  that punctuated the melodies like an out of tune tuba.He found the morning birdsong to be as meditative as the nights, though the night had the added bonus of crickets and his favorite, owls.The sleep stealing culprit was Ralph, his dog.Hed pulled the entire right side of the curtain aside and was st

In [None]:
retriever = BM25Retriever(document_store = document_store)

In [None]:
pipe = ExtractiveQAPipeline(reader, retriever)

In [None]:
# You can configure how many candidates the Reader and Retriever shall return
# The higher top_k_retriever, the better (but also the slower) your answers.
prediction = pipe.run(query="When is Katie\'s Birthday?", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}})

Inferencing Samples: 100%|██████████| 3/3 [00:02<00:00,  1.42 Batches/s]


In [None]:
pprint(prediction)

{'answers': [<Answer {'answer': 'march 9th', 'type': 'extractive', 'score': 0.7783768177032471, 'context': "s going to have powers! She said that on my 11th birthday which was on march 9th. Of course, I didn'tbelieve her. I was in shock. So later that night ", 'offsets_in_document': [{'start': 1549, 'end': 1558}], 'offsets_in_context': [{'start': 71, 'end': 80}], 'document_id': '2170d9e6fc62cfeb461997e45e05e38f', 'meta': {'name': '413_5 years.txt'}}>,
             <Answer {'answer': 'Tomorrow', 'type': 'extractive', 'score': 0.595199704170227, 'context': 'he was reading it. We must. The girlwhispered and looked at the sky.***Tomorrow is my birthday and my mother and I will bake a cake for the village to', 'offsets_in_document': [{'start': 1731, 'end': 1739}], 'offsets_in_context': [{'start': 71, 'end': 79}], 'document_id': 'ac755238cb740f5a96cb77c1dbbf4ea7', 'meta': {'name': '320_It\x92s All better in a Fairy Tale.txt'}}>,
             <Answer {'answer': 'march 9th', 'type': 'extractiv