# Encoder as a retriever

Usually we use semantic similarity-based models to filter BM25 models, but sometimes the user's query does not match any document, especially for small corpora. This is where neural search becomes very interesting. The encoder can play the role of a spare wheel to retrieve documents when the traditional BM25 has not found anything.

In [1]:
from cherche import retrieve, rank, data
from sentence_transformers import SentenceTransformer

Let's load a dummy dataset

In [2]:
documents = data.load_towns()
documents[0]

{'title': 'Paris',
 'url': 'https://en.wikipedia.org/wiki/Paris',
 'article': 'Paris (French pronunciation: \u200b[paʁi] (listen)) is the capital and most populous city of France, with an estimated population of 2,175,601 residents as of 2018, in an area of more than 105 square kilometres (41 square miles).'}

First, we will perform a search with a TfIdf to show that the model's ability to retrieve documents may be limited.

In [3]:
retriever  = retrieve.TfIdf(on=["article", "title"], k=10)
retriever.add(documents = documents)

TfIdf retriever
 	 on: article, title
 	 documents: 105

There is a single document that match the query "food" using default TfIdf.

In [4]:
retriever("food")

[{'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.'}]

We can now compare these results with the results of Sentence Bert by initialising our retriever based on semantic similarity.

In [5]:
retriever = retrieve.Encoder(
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = ["title", "article"],
    k = 5,
    path = "encoder.pkl"
)

retriever.add(documents=documents)

Encoder retriever
 	 on: title, article
 	 documents: 105

As can be seen, the encoder recalls more documents, even if they do not systematically contain the word "food". These documents seem relevant.

In [6]:
retriever("food")

[{'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List.",
  'similarity': 0.6018157045009224},
 {'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.',
  'similarity': 0.5962209382904881},
 {'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.',
  'similarity': 0.5876269088959901},
 {'title': 'Paris',
  'url': 'https

We can create a neural search pipeline that has the precision of a TfIdf retriever with an encoder as a ranker and the recall of an encoder as a retriver. We can use the union operator `|` to achieve this. The pipeline we will create is composed of a TfIdf whose documents will be reranked by a Sentence Bert model. In a second step, the union operator will concatenate the returns of the encoder as a retriever while avoiding duplicates.

In [7]:
encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode

precision = (
    retrieve.TfIdf(on=["article", "title"], k = 30) + 
    rank.Encoder(encoder = encoder, on = ["title", "article"], k = 5, path = "encoder.pkl")
)

recall = retrieve.Encoder(encoder = encoder, on = ["title", "article"], k = 5, path = "encoder.pkl")

search = precision | recall

search.add(documents=documents)

Union Pipeline
-----
TfIdf retriever
 	 on: article, title
 	 documents: 105
Encoder ranker
	 on: title, article
	 k: 5
	 similarity: cosine
	 embeddings stored at: encoder.pkl
Encoder retriever
 	 on: title, article
 	 documents: 105
-----

Our pipeline will first propose documents from the `precision` pipeline and then documents proposed by the `recall` pipeline. This neural search pipeline can allow us to propose documents even if the query words are not referenced in the documents.

In [8]:
search("food")

[{'title': 'Montreal',
  'url': 'https://en.wikipedia.org/wiki/Montreal',
  'article': 'It remains an important centre of commerce, aerospace, transport, finance, pharmaceuticals, technology, design, education, art, culture, tourism, food, fashion, video game development, film, and world affairs.'},
 {'title': 'Lyon',
  'url': 'https://en.wikipedia.org/wiki/Lyon',
  'article': "The city is recognised for its cuisine and gastronomy, as well as historical and architectural landmarks; as such, the districts of Old Lyon, the Fourvière hill, the Presqu'île and the slopes of the Croix-Rousse are inscribed on the UNESCO World Heritage List."},
 {'title': 'Bordeaux',
  'url': 'https://en.wikipedia.org/wiki/Bordeaux',
  'article': 'Bordeaux is also one of the centers of gastronomy and business tourism for the organization of international congresses.'},
 {'title': 'Paris',
  'url': 'https://en.wikipedia.org/wiki/Paris',
  'article': 'Paris received 12.'},
 {'title': 'Lyon',
  'url': 'https://en