# Quick Start

This notebook gives the bare minimum steps needed to run the ELI5 QA system, using the code in src/qa_utils.py. 

If you are looking for an understanding of what's going on under the hood, look at the ELI5-redux.ipynb notebook and of course the original blog post itself: https://yjernite.github.io/lfqa.html.

In [1]:
import sys, os
sys.path.append("../src") # where the source code lies

import torch as torch
from transformers import AutoTokenizer, AutoModel, AutoModelForSeq2SeqLM
from datasets import load_dataset

from qa_utils import embed_passages, make_index, QAKnowledgeBase

# Making the Embeddings

In [2]:
# load the dataset
wiki40b_snippets = load_dataset("wiki_snippets", name = "wiki40b_en_100_0")["train"]

Reusing dataset wiki_snippets (/home/rob/.cache/huggingface/datasets/wiki_snippets/wiki40b_en_100_0/1.0.0/d152a0e6a420c02b9b26e7f75f45fb54c818cae1d83e8f164f0b1a13ac7998ae)


In [3]:
# load the RETRIBERT model for embedding the wiki40b_snippet passages
qar_tokenizer = AutoTokenizer.from_pretrained('yjernite/retribert-base-uncased')
qar_model = AutoModel.from_pretrained('yjernite/retribert-base-uncased')

Some weights of RetriBertModel were not initialized from the model checkpoint at yjernite/retribert-base-uncased and are newly initialized: ['bert_query.embeddings.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [4]:
# make the embeddings for the index. This will take about 15 hours if you do the full dataset

index_filename = "my_wiki_embeddings.dat" # change this to the full file path where you want to store the index

#Note, if the file exists, running embed passages will overwrite it. Be careful and don't do that!
if not os.path.isfile(index_filename): # here as a safety measure
    print("starting to embed")
    embed_passages(wiki40b_snippets, 
                   qar_tokenizer, 
                   qar_model, 
                   n_batches = 2, # REMOVE THIS LINE TO EMBED THE ENTIRE DATA SET!!!!!
                   index_file_name = index_filename,
                   device="cuda:0")
else:
    print("file already exists are you sure you want to overwrite and re-embed?")

file already exists are you sure you want to overwrite and re-embed?


# Ask a question

In [4]:
# Make the index from the saved embeddings
full_index_filename = "../kb_index/wiki_index_embed.dat"
wiki40b_index = make_index(full_index_filename, device="cpu")

Putting index on cpu.


In [5]:
# load the BART model that is fine-tuned on the ELI5 task
qa_s2s_tokenizer = AutoTokenizer.from_pretrained('yjernite/bart_eli5')
qa_s2s_model = AutoModelForSeq2SeqLM.from_pretrained('yjernite/bart_eli5')

In [6]:
# Make an instance of the QAKnowledgeBase class
qakb = QAKnowledgeBase(qar_model, qar_tokenizer, qa_s2s_model, qa_s2s_tokenizer, wiki40b_index, wiki40b_snippets)

In [7]:
# and ask a question with the ask_question method
qakb.ask_question("How does a computer work?", max_answer_length=256)

["[This video]( URL_0 ) does a pretty good job of explaining how a computer works. It's a bit long, but it's worth it."]

In [8]:
# if you want multiple answers use this method.they can be a little nutty
qakb.get_best_answers("How does a computer work?")

["I'm not an expert in this area, but just to make a short answer, the computer operates by manipulating electrical signals from one type of light bulb (transistors) on an electromagnet (transistor) to a different type of wire (transicluder). The switches (hard drive and RAM",
 'A computer is basically just a bunch of transistors connected together in a circuit. The transistors in turn run electrical devices (the processor) in parallel.',
 'There is a lot of interesting stuff here. So a computer is actually an extremely complex network of switches that are a little like switches in a computer. There are hundreds of millions and billions of switches in that network, which are used to hold millions of things like all the data and to do things. In some',
 "A computer's processor has a very specific set of instructions on what the computer does. The computer takes this instruction (which it's programmed in) and does something (generally, like taking a number and converting it into a digit,