# Union and intersection of rankers

This notebook neural search pipeline made from an ensemble of rankers. The union operator is `|`, the documents retained will be those proposed by at least one of the rankers. The `&` operator is the intersection, the documents retained will be those proposed by all the rankers.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from cherche import data, rank, retrieve
from sentence_transformers import SentenceTransformer

The first step is to define the corpus on which we will perform the neural search. The towns dataset contains about a hundred documents, all of which have three attributes, the `title` of the article, the `url` and the content of the `article`.

In [3]:
documents = data.load_towns()
documents[:4]

[{'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': "Since the 17th century, Paris has been one of Europe's major centres of finance, diplomacy, commerce, fashion, gastronomy, science, and arts."},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The City of Paris is the centre and seat of government of the region and province of Île-de-France, or Paris Region, which has an estimated population of 12,174,880, or about 18 percent of the population of France as of 2017.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The Paris Region had a GDP of €709 billion ($808 billion) in 2017

## Union

We start by creating a retriever whose mission will be to quickly filter the documents. This retriever will match the query with the documents using the title and content of the article with `Union` operator.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

retriever = (
    retrieve.TfIdf(on="article", k = 20, tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3)) |
    retrieve.TfIdf(on="title", k = 20, tfidf=TfidfVectorizer(ngram_range=(4, 10), analyzer="char_wb", max_df=0.3)) 
) 

We will use a ranker composed of the union of two pre-trained models.

In [5]:
ranker = (
    rank.Encoder(
        encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
        on = "article",
        k = 5,
        path = "encoder.pkl"
    ) |
    rank.Encoder(
        encoder = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1").encode,
        on = "article",
        k = 5,
        path = "encoder_2.pkl"
    )
)

In [6]:
search = retriever + ranker
search.add(documents)

Union
-----
TfIdf retriever
 	 on: article
 	 documents: 105
TfIdf retriever
 	 on: title
 	 documents: 105
-----
Union
-----
Encoder ranker
	 on: article
	 k: 5
	 similarity: cosine
	 embeddings stored at: encoder.pkl
Encoder ranker
	 on: article
	 k: 5
	 similarity: cosine
	 embeddings stored at: encoder_2.pkl
-----

In [7]:
search("Paris football")

[{'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris hosts the annual French Open Grand Slam tennis tournament on the red clay of Roland Garros.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris received 12.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'The 80,000-seat Stade de France, built for the 1998 FIFA World Cup, is located just north of Paris in the neighbouring commune of Sai

In [8]:
search("speciality Lyon")

[{'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon or Lyons (UK: , US: , French: [ljɔ̃] (listen); Arpitan: Liyon, pronounced [ʎjɔ̃]) is the third-largest city and second-largest urban area of France.'},
 {'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon is the prefecture of the Auvergne-Rhône-Alpes region and seat of the Departmental Council of Rhône (whose jurisdiction, however, no longer extends over the Metropolis of Lyon since 2015).'},
 {'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon was historically an important area for the production and weaving of silk.'},
 {'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Economically, Lyon is a major centre for banking, as well as for the chemical, pharmaceutical and biotech industries.'},
 {'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': 'Lyon became a major economic hub during the

## Intersection

We will build a set of rankers consisting of two different pre-trained models with the intersection operator `&`. The pipeline will only offer the documents returned by the union of the two retrievers and the intersection of the rankers.

In [9]:
ranker = (
    rank.Encoder(
        encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
        on = "article",
        k = 5,
        path = "encoder.pkl"
    ) &
    rank.Encoder(
        encoder = SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1").encode,
        on = "article",
        k = 5,
        path = "encoder_2.pkl"
    )
)

In [10]:
search = retriever + ranker
search.add(documents)

Union
-----
TfIdf retriever
 	 on: article
 	 documents: 210
TfIdf retriever
 	 on: title
 	 documents: 210
-----
Intersection
-----
Encoder ranker
	 on: article
	 k: 5
	 similarity: cosine
	 embeddings stored at: encoder.pkl
Encoder ranker
	 on: article
	 k: 5
	 similarity: cosine
	 embeddings stored at: encoder_2.pkl
-----

In [11]:
search("Paris football")

[{'article': 'The football club Paris Saint-Germain and the rugby union club Stade Français are based in Paris.',
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris'},
 {'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).',
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris'},
 {'article': 'Paris hosts the annual French Open Grand Slam tennis tournament on the red clay of Roland Garros.',
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris'},
 {'article': 'Paris received 12.',
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris'},
 {'article': 'The 80,000-seat Stade de France, built for the 1998 FIFA World Cup, is located just north of Paris in the neighbouring commune of Saint-Denis.',
  'title': 'Paris',
  'url': 'https://en.wikipedia.org/w

In [12]:
search("speciality Lyon")

[{'article': 'Lyon or Lyons (UK: , US: , French: [ljɔ̃] (listen); Arpitan: Liyon, pronounced [ʎjɔ̃]) is the third-largest city and second-largest urban area of France.',
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon'},
 {'article': 'Lyon is the prefecture of the Auvergne-Rhône-Alpes region and seat of the Departmental Council of Rhône (whose jurisdiction, however, no longer extends over the Metropolis of Lyon since 2015).',
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon'},
 {'article': 'Lyon was historically an important area for the production and weaving of silk.',
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon'},
 {'article': 'Lyon became a major economic hub during the Renaissance.',
  'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon'}]