# Retriever, ranker and Summarization

This notebook is a tutorial for building a summarization pipeline with a retriever, ranker and summarizer architecture. In this architecture, the retriever acts as a first filter based on word matches between the query and the documents. The ranker filters the documents a second time based on the semantic similarity between the documents and the question. Finally, we will use a model to summarise the documents relevant to the query.

In [1]:
%load_ext autoreload
%autoreload 2

In [9]:
from cherche import data, rank, retrieve, summary
from sentence_transformers import SentenceTransformer
from transformers import pipeline

We can use the `towns` corpus for this example, we can ask to summarize relevant documents for a given query.

In [3]:
documents = data.load_towns()
documents[:4]

[{'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017

We start by creating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using `Union` operator.

In [4]:
retriever = retrieve.TfIdf(on="title", k = 30) | retrieve.TfIdf(on="article", k = 30) 

We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. 
similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.

In [5]:
ranker = rank.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = "article",
    k = 3,
    path = "encoder.pkl"
)

In [30]:
summarization = summary.Summary(
    model = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6", tokenizer="sshleifer/distilbart-cnn-12-6", framework="pt"),
    on = "article",
)

We initialise the pipeline and ask the retrievers to index the documents and the ranker to pre-compute the document embeddings. This step can take some time if you have a lot of documents. A GPU could speed up the process. Embeddings of the ranker will be stored in the file `encoder.pkl`.

In [31]:
search = retriever + ranker + summarization
search.add(documents)

Union
-----
TfIdf retriever
 	 on: title
 	 documents: 420
TfIdf retriever
 	 on: article
 	 documents: 420
-----
Encoder ranker
	 on: article
	 k: 3
	 similarity: cosine
	 embeddings stored at: encoder.pkl
Summarization model
	 on: article
	 min length: 5
	 max length: 30

Few examples of summarization:

Bordeaux is a French city based in Gironde.

In [32]:
search("Bordeaux")

'Bordeaux is a port city on the river Garonne in the Gironde department, Southwestern France. It is the centre of'

Toulouse is a French city based in Occitanie.

In [33]:
search("Toulouse")

'Toulouse is the prefecture of Haute-Garonne and of the larger region of Occitanie. It is the'

Lyon is known for it's food.

In [34]:
search("Lyon food")

'Lyon is the third-largest city and second-largest urban area of France. The city is recognised for its cuisine and gastronomy'