About [BEIR](https://github.com/beir-cellar/beir):

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.

Refer to the repo via the link for a full list of supported datasets.

Here, we test the `all-MiniLM-L6-v2` sentence-transformer embedding, which is one of the fastest for the given accuracy range. We set the top_k value for the retriever to 10. We also use the nfcorpus dataset.

In [3]:
from llama_index.evaluation.benchmarks import BeirEvaluator
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from llama_index import LangchainEmbedding, ServiceContext, VectorStoreIndex


def create_retriever(documents):
    embed_model = LangchainEmbedding(
        HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    )
    service_context = ServiceContext.from_defaults(embed_model=embed_model)
    index = VectorStoreIndex.from_documents(
        documents, service_context=service_context, show_progress=True
    )
    return index.as_retriever(similarity_top_k=30)


BeirEvaluator().run(create_retriever, datasets=["nfcorpus"])

Dataset: nfcorpus downloaded at: /home/jonch/.cache/llama_index/datasets/BeIR__nfcorpus
Evaluating on dataset: nfcorpus
-------------------------------------


100%|█████████████████████████████████████████████████████████████████████████████| 3633/3633 [00:00<00:00, 108080.99it/s]
Parsing documents into nodes: 100%|██████████████████████████████████████████████████| 3633/3633 [00:06<00:00, 552.33it/s]
Generating embeddings: 100%|██████████████████████████████████████████████████████████| 3649/3649 [03:41<00:00, 16.47it/s]


Retriever created for:  nfcorpus
Evaluating retriever on questions against qrels


100%|███████████████████████████████████████████████████████████████████████████████████| 323/323 [01:17<00:00,  4.18it/s]

Results for: nfcorpus
{'NDCG@10': 0.31348, 'MAP@10': 0.10961, 'Recall@10': 0.15835, 'precision@10': 0.23901}
-------------------------------------





Higher is better for all the evaluation metrics.

This [towardsdatascience article](https://towardsdatascience.com/ranking-evaluation-metrics-for-recommender-systems-263d0a66ef54) covers NCDG, MAP and MRR in greater depth.