# Quick Start

This notebook gives the bare minimum steps needed to run the ELI5 QA system, using the code in src/qa_utils.py. 

If you are looking for an understanding of what's going on under the hood, look at the [qa_step_by_step.ipynb](qa_step_by_step.ipynb) notebook and of course the original blog post itself: https://yjernite.github.io/lfqa.html.

In [3]:
import sys, os
sys.path.append("../src") # where the source code lies

import torch as torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from datasets import load_dataset

from qa_utils import embed_passages, make_index, QAKnowledgeBase

# Making the Embeddings

In [4]:
# load the dataset
wiki40b_snippets = load_dataset("wiki_snippets", name = "wiki40b_en_100_0")["train"]

Reusing dataset wiki_snippets (/home/rob/.cache/huggingface/datasets/wiki_snippets/wiki40b_en_100_0/1.0.0/d152a0e6a420c02b9b26e7f75f45fb54c818cae1d83e8f164f0b1a13ac7998ae)


  0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
# load the RETRIBERT model for embedding the wiki40b_snippet passages
qar_tokenizer = AutoTokenizer.from_pretrained('yjernite/retribert-base-uncased')
qar_model = AutoModel.from_pretrained('yjernite/retribert-base-uncased')

Some weights of RetriBertModel were not initialized from the model checkpoint at yjernite/retribert-base-uncased and are newly initialized: ['bert_query.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# make the embeddings for the index. This will take about 15 hours if you do the full dataset

index_filename = "../wiki_embeddings/test_embed.dat" # change this to the full file path where you want to store the index

#Note, if the file exists, running embed passages will overwrite it. Be careful and don't do that!
if not os.path.isfile(index_filename): # here as a safety measure
    print("starting to embed")
    embed_passages(wiki40b_snippets, 
                   qar_tokenizer, 
                   qar_model, 
                   n_batches = 2, # REMOVE THIS LINE TO EMBED THE ENTIRE DATA SET!!!!!
                   index_file_name = index_filename,
                   device="cuda:0")
else:
    print("file already exists are you sure you want to overwrite and re-embed?")

starting to embed
embedding 512 passages in 2 separate batches
embedding batch 0 of 2 batches 256 passages in batch at time 0.0013992786407470703


# Ask a question

In [7]:
# Make the index from the saved embeddings
full_index_filename = "../wiki_embeddings/wiki_index_embed.dat"
wiki40b_index = make_index(full_index_filename, device="cpu")

Putting index on cpu.


In [8]:
# load the BART model that is fine-tuned on the ELI5 task
qa_s2s_tokenizer = AutoTokenizer.from_pretrained('yjernite/bart_eli5')
qa_s2s_model = AutoModelForSeq2SeqLM.from_pretrained('yjernite/bart_eli5')

In [9]:
# Make an instance of the QAKnowledgeBase class
qakb = QAKnowledgeBase(qar_model, qar_tokenizer, qa_s2s_model, qa_s2s_tokenizer, wiki40b_index, wiki40b_snippets)

In [10]:
# and ask a question with the ask_question method
qakb.ask_question("How do birds fly?", max_answer_length=256)

['Birds use their wings to generate lift. When they flap their wings, the up and down motion of the wing pushes the air in front of them, creating lift.']

In [11]:
# if you want multiple answers use this method.they can be a little nutty
qakb.get_best_answers("How do birds fly?")

['>  Flapping involves two stages: the downstroke, which provides the thrust, and the upstroke, *which can also* provide some thrust. The bird can also adjust its wing angle between each flapping stroke, allowing the wing to generate lift on both the up and downstroke. Birds have much',
 'They flap their wings with up and down thrusts. The difference being they have a large number of tiny muscles in them which create a tremendous amount of energy.',
 "They have 2 arms, each with a different angle, their up/down movement is proportional to their wing's angle. They use a special flap between the legs of the bird so it can generate lift.",
 'They use their talons to generate lift. They then flap their wings. The wings make the air that is behind them lift. This works for a small bird, and for a large bird, too.',
 'Birds have small wings but they use their weight to fly. They\'re very light and have a high efficiency to make lift. So when birds flap their wings, the airflow is pushed upwa

So it works, sort of! Have fun playing around with it.