# Retriever, ranker and question answering

The objective of this notebook is to build a simple pipeline to perform the extractive question answering task.We will build a retriever, ranker and question answering architecture to do so. The retriever acts as a first filter in this architecture based on the word matches between the question and the documents. The ranker filters the documents a second time based on the semantic similarity between the documents and the question. Finally we will use a question answering model to extract the answer to the question from the filtered documents. The model of extractive question answering is slow. Filtering the documents helps to reduce the time to answer to a question. It would make sense to use a GPU for this type of pipeline using QA models.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from cherche import data, rank, retrieve, qa
from sentence_transformers import SentenceTransformer
from transformers import pipeline

We can use the `towns` corpus for this example, we can ask questions about the cities of Bordeaux, Toulouse, Paris and Lyon.

In [3]:
documents = data.load_towns()
documents[:4]

[{'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017

We start by creating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using `on` parameter.

In [4]:
retriever = retrieve.TfIdf(on=["title", "article"], k = 30)

We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. 
similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.

In [5]:
ranker = rank.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = "article",
    k = 3,
    path = "encoder.pkl"
)

In [6]:
question_answering = qa.QA(
    model = pipeline("question-answering", model = "deepset/roberta-base-squad2", tokenizer = "deepset/roberta-base-squad2"),
    on = "article",
)

We initialise the pipeline and ask the retrievers to index the documents and the ranker to pre-compute the document embeddings. This step can take some time if you have a lot of documents. A GPU could speed up the process. Embeddings of the ranker will be stored in the file `encoder.pkl`.

In [7]:
search = retriever + ranker + question_answering
search.add(documents)

TfIdf retriever
 	 on: title, article
 	 documents: 105
Encoder ranker
	 on: article
	 k: 3
	 similarity: cosine
	 embeddings stored at: encoder.pkl
Question Answering
	 model: deepset/roberta-base-squad2
	 on: article

Paris Saint Germain is the name of the football team of Paris. The Question Answering Pipeline provides the ranking-related similarity score called `similarity` and the question answering task-related score `qa_score`. The higher the `qa_score` the more likely the answer is. The answers are sorted from the most likely to the least likely.

In [8]:
search("What is the name of the football club of Paris?")

[{'start': 18,
  'end': 37,
  'answer': 'Paris Saint-Germain',
  'qa_score': 0.9848363399505615,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.',
  'similarity': 0.69058955},
 {'start': 16,
  'end': 31,
  'answer': 'Stade de France',
  'qa_score': 0.8121964335441589,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The 80,000-seat Stade de France, built for the 1998 FIFA World Cup, is located just north of Paris in the neighbouring commune of Saint-Denis.',
  'similarity': 0.44287413},
 {'start': 29,
  'end': 35,
  'answer': '\u200b[paʁi',
  'qa_score': 2.7218469767831266e-05,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in 

Toulouse in France is known as "The Pink City".

In [9]:
search("What is the color of Toulouse?")

[{'start': 39,
  'end': 46,
  'answer': 'pinkish',
  'qa_score': 0.525703489780426,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'The city\'s unique architecture made of pinkish terracotta bricks has earned Toulouse the nickname La Ville Rose ("The Pink City").',
  'similarity': 0.5668377},
 {'start': 11,
  'end': 35,
  'answer': 'too-LOOZ, French: [tuluz',
  'qa_score': 0.024280384182929993,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'Toulouse ( too-LOOZ, French: [tuluz] (listen); Occitan: Tolosa [tuˈluzɔ]; Latin: Tolosa [tɔˈloːsa]) is the prefecture of the French department of Haute-Garonne and of the larger region of Occitanie.',
  'similarity': 0.60262233},
 {'start': 165,
  'end': 169,
  'answer': 'Nice',
  'qa_score': 0.0002655540592968464,
  'title': 'Toulouse',
  'url': 'https://en.wikipedia.org/wiki/Toulouse',
  'article': 'It is the fourth-largest commune in France, with 479,553 inhabitants 

Bordeaux is known worldwide for its wines

In [10]:
search("What is the speciality of Bordeaux ?")

[{'start': 31,
  'end': 35,
  'answer': 'wine',
  'qa_score': 0.7739053964614868,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': "Bordeaux is a world capital of wine, with its castles and vineyards of the Bordeaux region that stand on the hillsides of the Gironde and is home to the world's main wine fair, Vinexpo.",
  'similarity': 0.65491754},
 {'start': 85,
  'end': 89,
  'answer': 'port',
  'qa_score': 0.3996890187263489,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux ( bor-DOH, French: [bɔʁdo] (listen); Gascon Occitan: Bordèu [buɾˈðɛw]) is a port city on the river Garonne in the Gironde department, Southwestern France.',
  'similarity': 0.65354306},
 {'start': 57,
  'end': 92,
  'answer': 'architectural and cultural heritage',
  'qa_score': 0.376557856798172,
  'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is an international tourist destination

Every year there is a silk festival in Lyon.

In [11]:
search("What is the speciality of Lyon ?")

[{'start': 41,
  'end': 48,
  'answer': 'banking',
  'qa_score': 0.6367459297180176,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Economically, Lyon is a major centre for banking, as well as for the chemical, pharmaceutical and biotech industries.',
  'similarity': 0.6472855},
 {'start': 74,
  'end': 78,
  'answer': 'silk',
  'qa_score': 0.3443596661090851,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon was historically an important area for the production and weaving of silk.',
  'similarity': 0.64086735},
 {'start': 132,
  'end': 142,
  'answer': 'urban area',
  'qa_score': 0.0976468175649643,
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon or Lyons (UK: , US: , French: [ljɔ̃] (listen); Arpitan: Liyon, pronounced [ʎjɔ̃]) is the third-largest city and second-largest urban area of France.',
  'similarity': 0.66174024}]