# Handle large corpus

We designed Cherche primarily to create a neural search pipeline on a corpus of moderate size. A large corpus is where not all documents can be stored in memory or when the retrievers `retrieve.TfIdf`, `retrieve.BM25Okapi`, `retrieve.Lunr` are not fast enough. If we want to work with large corpora, consider looking at [Jina](https://github.com/jina-ai/jina). Nevertheless, Cherche is compatible with a neural search pipeline on large corpora using Python's Elasticsearch client `retrieve.Elastic`. **In this tutorial, we will use ElasticSearch to act as a retriever and to store the ranker embeddings.**

Of course, to establish the connection with Elasticsearch, we need to have a server with Elasticsearch running or to run Elasticsearch on our local machine. The installation of Elasticsearch is explained [here](https://www.elastic.co/downloads/elasticsearch). The first step is to initialise the `retrieve.Elastic` retriever. `retrieve.Elastic` takes a parameter `es` that establishes the connection with Elasticsearch. 

Also, to create a neural search on a large corpus, we will need a GPU at least to pre-compute document embeddings. However, a GPU is not mandatory in a production environment if we do not want to answer questions or summarize.

**In this tutorial, we will present two distinct solutions for implementing the neural search pipeline.**

- **Scenario 1: Connecting remotely to ElasticSearch from the GPU computer to index documents and embeddings.**
- **Scenario 2: Index documents and embeddings on Elasticsearch without a remote connection.**

### Scenario 1: Connect remotely to ElasticSearch from the GPU computer to index documents and embeddings.

We are on the computer that owns a GPU here.

In [1]:
from cherche import retrieve, rank
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

My Elasticsearch server runs locally on port 9200 on my computer. You should replace `localhost:9200` with your own Elasticsearch adress if it's remote.

In [21]:
es = Elasticsearch(hosts="localhost:9200")

if es.ping():
    print("Elasticsearch is running.")
else:
    print("Elasticsearch is not running.")

Elasticsearch is running.


We declare our neural search pipeline make of a ranker and a retriever

In [3]:
retriever = retrieve.Elastic(
    es = es,
    key = "id",
    on = "document",
    k = 100,
    index = "large_corpus"
)

ranker = rank.Encoder(
    key = "id",
    on = "document",
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device="cuda").encode,
    k = 10,
)

Now we will be able to index our documents and embeddings simultaneously. This process takes time if we have a lot of documents. We could run it in parallel from several computers for example.

In [4]:
# Imagine 1 millions documents instead of 3. ðŸ˜…

documents = [
    {"id": 0, "document": "Toulouse is a municipality in south-west France. With 486,828 inhabitants as of 1 January 2018, Toulouse is the fourth most populous commune in France after Paris, Marseille and Lyon, having gained 101,000 inhabitants over the last 47 years (1968-2015)"},
    {"id": 1, "document": "Montreal is the main city of Quebec. A large island metropolis and port on the St. Laurent River at the foot of the Lachine Rapids, it is the second most populous city in Canada, after Toronto."},
    {"id": 2, "document": "Bordeaux is a French commune located in the Gironde department in the Nouvelle-Aquitaine region."}
]

retriever.add_embeddings(documents=documents, ranker=ranker)

Elastic retriever
 	 key: id
 	 on: document
 	 documents: 3

Et voilÃ .

### Scenario 2: Index documents and embeddings on Elasticsearch without a remote connection.

We are on the GPU computer.

In [5]:
import json

from cherche import rank
from sentence_transformers import SentenceTransformer

In [6]:
ranker = rank.Encoder(
    key = "id",
    on = "document",
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2", device="cuda").encode,
    k = 10,
)

On the GPU machine, we can compute document embeddings and save the embeddings as a json file or Pickle file for loading on the computer that has access to ElasticSearch.

In [7]:
# Imagine 1 millions documents instead of 3. ðŸ˜…

documents = [
    {"id": 0, "document": "Toulouse is a municipality in south-west France. With 486,828 inhabitants as of 1 January 2018, Toulouse is the fourth most populous commune in France after Paris, Marseille and Lyon, having gained 101,000 inhabitants over the last 47 years (1968-2015)"},
    {"id": 1, "document": "Montreal is the main city of Quebec. A large island metropolis and port on the St. Laurent River at the foot of the Lachine Rapids, it is the second most populous city in Canada, after Toronto."},
    {"id": 2, "document": "Bordeaux is a French commune located in the Gironde department in the Nouvelle-Aquitaine region."}
]

for document, embedding in zip(documents, ranker.embs(documents=documents)):

    # embeddings is important here, you should not change the key.
    document["embedding"] = embedding.tolist()

# You can process the documents per batch and export them in differents json files.
with open("documents_embeddings.json", "w") as documents_embeddings:
    json.dump(documents, documents_embeddings, indent = 4)

We are now on a computer that has access to the running Elasticsearch server.  We have previously transferred the json file from the machine that has a GPU to the machine that has access to Elasticsearch.

In [8]:
import json

from cherche import retrieve
from elasticsearch import Elasticsearch

My Elasticsearch server runs locally on port 9200 on my computer. You should replace `localhost:9200` with your own Elasticsearch adress if it's remote.

In [9]:
es = Elasticsearch(hosts="localhost:9200")

if es.ping():
    print("Elasticsearch is running.")
else:
    print("Elasticsearch is not running.")

Elasticsearch is running.


We will be able to index the documents and embeddings that we have previously calculated.

In [10]:
with open("documents_embeddings.json", "r") as documents_embeddings:
    json.load(documents_embeddings)

retriever = retrieve.Elastic(
    key = "id",
    on = "document",
    es = es,
    k = 100,
    index = "large_corpus"
)

retriever.add(documents)

Elastic retriever
 	 key: id
 	 on: document
 	 documents: 3

Et voila.

You can now query your neural search pipeline via the `retrieve.Elastic` retriever without a GPU and have decent performance.

In [11]:
from cherche import retrieve, rank
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="localhost:9200")

retriever = retrieve.Elastic(
    key = "id",
    on = "document",
    es = es,
    k = 100,
    index = "large_corpus"
)

ranker = rank.Encoder(
    key = "id",
    on = "document",
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    k = 1,
)

search = retriever + ranker

In [12]:
search("Toulouse")

[{'id': 0,
  'document': 'Toulouse is a municipality in south-west France. With 486,828 inhabitants as of 1 January 2018, Toulouse is the fourth most populous commune in France after Paris, Marseille and Lyon, having gained 101,000 inhabitants over the last 47 years (1968-2015)',
  'similarity': 0.6678053305504484}]

### Time for a real demo on 600,000 documents - CPU.

To make the demonstration more convincing, I indexed 600,000 wikipedia articles following scenario 2 with google collaboratory to calculate the embeddings and an Elasticsearch server running locally on my pc. Now we don't need a GPU anymore since we pre-computed embeddings.

In [13]:
from cherche import retrieve, rank
from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="localhost:9200")

retriever = retrieve.Elastic(
    key="id",
    es = es,
    on = "document",
    k = 100,
    index = "wiki" # My wiki index contains 700000 documents
)

ranker = rank.Encoder(
    key="id",
    encoder = SentenceTransformer("sentence-transformers/all-mpnet-base-v2").encode,
    on = "document",
    k = 10,
)

search = retriever + ranker

The neural search pipeline references 600,000 documents.

In [14]:
search

Elastic retriever
 	 key: id
 	 on: document
 	 documents: 700000
Encoder ranker
	 key: id
	 on: document
	 k: 10
	 similarity: cosine

On my computer, which only uses a CPU, it takes 100 ms to query all these documents, which is great. 

In [15]:
%timeit search("Toulouse")

96.5 ms Â± 7.66 ms per loop (mean Â± std. dev. of 7 runs, 10 loops each)


In [16]:
search("Toulouse")

[{'document': 'Toulouse MÃ©tropole is the metropolis, an intercommunal structure, centred on the city of Toulouse. It is located in the Haute-Garonne department, in the Occitanie region, southern France.',
  'similarity': 0.6366361226971764},
 {'document': 'France is home to aerospace giant Airbus, which has its headquarters and main facilities located in Toulouse.',
  'similarity': 0.5271759181852023},
 {'document': 'It was created on January 1, 2015, succeeding the urban community of Toulouse, which had itself succeeded in 2009 and 2001 to previous districts created in 1992 with less powers than either the urban community or the current metropolitan region.',
  'similarity': 0.49571503621282487},
 {'document': 'Due to local political feuds, Toulouse MÃ©tropole only hosts 59% of the population of the metropolitan area (see infobox at Toulouse article for the metropolitan area), the other independent communes of the metropolitan area having refused to join in, notably Muret and the tec

Of course we can connect a question answering model or a summarization model to our neural pipeline. **However, these models are heavier and require a GPU to maintain the performance level.**

In [17]:
from cherche import qa
from transformers import pipeline

search = retriever + ranker + qa.QA(
    model = pipeline("question-answering", model = "deepset/roberta-base-squad2", tokenizer = "deepset/roberta-base-squad2"),
    on = "document",
    k = 2
)

In [18]:
search("What is Python?")

[{'start': 10,
  'end': 49,
  'answer': 'dynamically-typed and garbage-collected',
  'qa_score': 0.8619149327278137,
  'document': 'Python is dynamically-typed and garbage-collected. It supports multiple programming paradigms, including structured (particularly, procedural), object-oriented and functional programming.',
  'similarity': 0.701137335987238},
 {'start': 10,
  'end': 77,
  'answer': 'an interpreted, high-level and general-purpose programming language',
  'qa_score': 0.6966809630393982,
  'document': 'Python is an interpreted, high-level and general-purpose programming language. Pythons design philosophy emphasizes code readability with its notable use of significant whitespace.',
  'similarity': 0.8061618668192251}]

Summarization pipeline

In [19]:
from cherche import summary

search = retriever + ranker + summary.Summary(
    model = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", tokenizer="sshleifer/distilbart-cnn-6-6", framework="pt"),
    on = "document",
)

In [20]:
search("What is Python?")

'Python is an interpreted, high-level and general-purpose programming language. It supports multiple programming paradigms, including structured (particularly'