# Retriever, ranker and Summarization

Here we build a summary pipeline. This pipeline will consist of a retriever, a ranker, and a summarizer. The retriever and the ranker will filter the input documents before summarizing the documents relevant to our query.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from cherche import data, rank, retrieve, summary
from sentence_transformers import SentenceTransformer
from transformers import pipeline

We can use the `towns` corpus for this example, we can ask to summarize relevant documents for a given query.

In [3]:
documents = data.load_towns()
documents[:4]

[{'id': 0,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'id': 1,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'id': 2,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'id': 3,
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had 

We start by creating a retriever whose mission will be to quickly filter the documents. This retriever will find documents based on the title and content of the article using `on` parameter.

In [4]:
retriever = retrieve.TfIdf(key="id", on=["title", "article"], documents=documents, k=30)

We then add a ranker to the pipeline to filter the results according to the semantic similarity between the query and the retrieved documents. 
similarity between the query and the retriever's output documents. The ranker will be based on the content of the article.

In [5]:
ranker = rank.Encoder(
    key = "id",
    on = ["title", "article"],
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 5,
)

In [6]:
summarization = summary.Summary(
    model = pipeline("summarization", 
        model = "sshleifer/distilbart-cnn-12-6", 
        tokenizer = "sshleifer/distilbart-cnn-12-6", 
        framework = "pt"
    ),
    on = "article",
)

We initialize our pipeline. The `add` method asks the ranker to pre-compute the embeddings of each document. A GPU could speed up the process.

⚠️ When summarizing or doing questions answering, we need to match our index in output from the ranker to the set of documents. To do this, we add `documents` to our pipeline. This step is optional when using Elasticsearch, Meilisearch or TypeSense as a retriever.

In [7]:
search = retriever + ranker + documents + summarization
search.add(documents)

Ranker embeddings calculation.: 100%|█| 2/2 [00:02<00:


TfIdf retriever
 	 key: id
 	 on: title, article
 	 documents: 105
Encoder ranker
	 key: id
	 on: title, article
	 k: 5
	 similarity: cosine
	 Embeddings pre-computed: 105
Mapping to documents
Summarization model
	 on: article
	 min length: 5
	 max length: 30

Few examples of summarization:

Bordeaux is a French city based in Gironde.

In [8]:
search("Bordeaux")

'The "Pearl of Aquitaine" has been voted European Destination of the year in a 2015 online poll. Bordeaux is'

Toulouse is a French city based in Occitanie.

In [9]:
search("Toulouse")

'Toulouse is the prefecture of Haute-Garonne and of the larger region of Occitanie. It is the'

Lyon is known for it's food.

In [10]:
search("Lyon food")

'Lyon is the third-largest city and second-largest urban area of France. The city is recognised for its cuisine and gastronomy'