[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/generation/better-rag/00-rerankers.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/generation/better-rag/00-rerankers.ipynb)

# Rerankers

Rerankers have been a common component of retrieval pipelines for many years. They allow us to add a final "reranking" step to our retrieval pipelines — like with **R**etrieval **A**ugmented **G**eneration (RAG) — that can be used to dramatically optimize our retrieval pipelines and improve their accuracy.

In the example notebook we'll learn how to create retrieval pipelines with reranking using the [Cohere reranking model](https://txt.cohere.com/rerank/) (which is available for free).

To begin, we setup our prerequisite libraries.

In [None]:
!pip install -qU \
    datasets==2.14.5 \
    openai==0.28.1 \
    pinecone-client==2.2.4 \
    cohere==4.27

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/519.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/519.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.6/519.6 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.4/179.4 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.6/47.6 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m22.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m

## Data Preparation

We start by downloading a dataset that we will encode and store. The dataset [`jamescalam/ai-arxiv-chunked`](https://huggingface.co/datasets/jamescalam/ai-arxiv-chunked) contains scraped data from many popular ArXiv papers centred around LLMs. Including papers from Llama 2, GPTQ, and the GPT-4 technical paper.

In [None]:
from datasets import load_dataset

data = load_dataset("jamescalam/ai-arxiv-chunked", split="train")
data

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/153M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['doi', 'chunk-id', 'chunk', 'id', 'title', 'summary', 'source', 'authors', 'categories', 'comment', 'journal_ref', 'primary_category', 'published', 'updated', 'references'],
    num_rows: 41584
})

We have 41.5K chunks, where each chunk is roughly the length of 1-2 paragraphs in length. Here is an example of a single record:

In [None]:
data[0]

{'doi': '1910.01108',
 'chunk-id': '0',
 'chunk': 'DistilBERT, a distilled version of BERT: smaller,\nfaster, cheaper and lighter\nVictor SANH, Lysandre DEBUT, Julien CHAUMOND, Thomas WOLF\nHugging Face\n{victor,lysandre,julien,thomas}@huggingface.co\nAbstract\nAs Transfer Learning from large-scale pre-trained models becomes more prevalent\nin Natural Language Processing (NLP), operating these large models in on-theedge and/or under constrained computational training or inference budgets remains\nchallenging. In this work, we propose a method to pre-train a smaller generalpurpose language representation model, called DistilBERT, which can then be ﬁnetuned with good performances on a wide range of tasks like its larger counterparts.\nWhile most prior work investigated the use of distillation for building task-speciﬁc\nmodels, we leverage knowledge distillation during the pre-training phase and show\nthat it is possible to reduce the size of a BERT model by 40%, while retaining 97%\nof i

Format the data into the format we need, this will contain `id`, `text` (which we will embed), and `metadata`. For this use-case we don't need metadata but it can be useful to include so that if needed in the future we can make use of metadata filtering.

In [None]:
data = data.map(lambda x: {
    "id": f'{x["id"]}-{x["chunk-id"]}',
    "text": x["chunk"],
    "metadata": {
        "title": x["title"],
        "url": x["source"],
        "primary_category": x["primary_category"],
        "published": x["published"],
        "updated": x["updated"],
        "text": x["chunk"],
    }
})
# drop uneeded columns
data = data.remove_columns([
    "title", "summary", "source",
    "authors", "categories", "comment",
    "journal_ref", "primary_category",
    "published", "updated", "references",
    "doi", "chunk-id",
    "chunk"
])
data

Map:   0%|          | 0/41584 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'text', 'metadata'],
    num_rows: 41584
})

We need to define an embedding model to create our embedding vectors for retrieval, for that we will be using OpenAI's text-embedding-ada-002. There is some cost associated with this model, so be aware of that (costs for running this notebook are <$1).

In [None]:
import os
import openai
import getpass  # platform.openai.com

# get API key from top-right dropdown on OpenAI website
openai.api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

embed_model = "text-embedding-ada-002"

Enter your OpenAI API key: ··········


Now we create our vector DB to store our vectors. For this we need to get a [free Pinecone API key](https://app.pinecone.io) — the API key and environment variable are found in the "API Keys" button found in the left navbar of the Pinecone dashboard.

In [None]:
import pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.getenv("PINECONE_API_KEY") or getpass.getpass()
# find your environment next to the api key in pinecone console
env = os.getenv("PINECONE_ENVIRONMENT") or input()

pinecone.init(api_key=api_key, environment=env)

··········
us-west1-gcp


Creating an index, we set `dimension` equal to to dimensionality of Ada-002 (`1536`), and use a `metric` also compatible with Ada-002 (this can be either `cosine` or `dotproduct`).

In [None]:
import time

index_name = "rerankers"

# check if index already exists (it shouldn't if this is first time)
if index_name not in pinecone.list_indexes():
    # if does not exist, create index
    pinecone.create_index(
        index_name,
        dimension=1536,  # dimensionality of ada 002
        metric='dotproduct'
    )
    # wait for index to be initialized
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

# connect to index
index = pinecone.Index(index_name)
time.sleep(1)
# view index stats
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 41584}},
 'total_vector_count': 41584}

We can see the index is currently empty with a `total_vector_count` of `0`. We can begin populating it with OpenAI's `text-embedding-ada-002` built embeddings like so:

In [None]:
from tqdm.auto import tqdm

batch_size = 100  # how many embeddings we create and insert at once

for i in tqdm(range(0, len(data), batch_size)):
    passed = False
    # find end of batch
    i_end = min(len(data), i+batch_size)
    # create batch
    batch = data[i:i_end]
    # create embeddings (exponential backoff to avoid RateLimitError)
    for j in range(5):  # max 5 retries
        try:
            res = openai.Embedding.create(input=batch["text"], engine=embed_model)
            passed = True
        except openai.error.RateLimitError:
            time.sleep(2**j)  # wait 2^j seconds before retrying
            print("Retrying...")
    if not passed:
        raise RuntimeError("Failed to create embeddings.")
    # get embeddings
    embeds = [record['embedding'] for record in res['data']]
    to_upsert = list(zip(batch["id"], embeds, batch["metadata"]))
    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

Now let's test retrieval _without_ Cohere's reranking model.

In [None]:
def get_docs(query: str, top_k: int):
    # encode query
    xq = embed([query])[0]
    # search pinecone index
    res = index.query(xq, top_k=top_k, include_metadata=True)
    # get doc text
    docs = {x["metadata"]['text']: i for i, x in enumerate(res["matches"])}
    return docs

In [None]:
query = "can you explain why we would want to do rlhf?"
docs = get_docs(query, top_k=25)
print("\n---\n".join(docs.keys()))

whichmodels areprompted toexplain theirreasoningwhen givena complexproblem, inorder toincrease
the likelihood that their ﬁnal answer is correct.
RLHF has emerged as a powerful strategy for ﬁne-tuning Large Language Models, enabling signiﬁcant
improvements in their performance (Christiano et al., 2017). The method, ﬁrst showcased by Stiennon et al.
(2020) in the context of text-summarization tasks, has since been extended to a range of other applications.
In this paradigm, models are ﬁne-tuned based on feedback from human users, thus iteratively aligning the
models’ responses more closely with human expectations and preferences.
Ouyang et al. (2022) demonstrates that a combination of instruction ﬁne-tuning and RLHF can help ﬁx
issues with factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LLMs. Bai
et al. (2022b) partially automates this ﬁne-tuning-plus-RLHF approach by replacing the human-labeled
ﬁne-tuningdatawiththemodel’sownself-critiquesandrevisions,

Good, but can we get better?

## Reranking Responses

We can easily get the responses we need when we include _many_ responses, but this doesn't work well with LLMs. The recall performance for LLMs [decreases as we add more into the context window](https://www.pinecone.io/blog/why-use-retrieval-instead-of-larger-context/) — we call this excessive filling of the context window _"context stuffing"_.

Fortunately reranking offers us a solution that helps us find those records that may not be within the top-3 results, and pull them into a smaller set of results to be given to the LLM.

We will use Cohere's rerank endpoint for this, to use it you will need a [Cohere API key](https://dashboard.cohere.com/api-keys). Once you have your key you use it to create authenticate your Cohere client like so:

In [None]:
import cohere

os.environ["COHERE_API_KEY"] = os.getenv("COHERE_API_KEY") or getpass.getpass()
# init client
co = cohere.Client(os.environ["COHERE_API_KEY"])

··········


Now we can rerank our results with `co.rerank`. Let's try it with our earlier results.

In [None]:
rerank_docs = co.rerank(
    query=query, documents=docs.keys(), top_n=25, model="rerank-english-v2.0"
)

This returns a list of `RerankResult` objects:

In [None]:
type(rerank_docs[0])

cohere.responses.rerank.RerankResult

We access the text content of the docs like so:

In [None]:
rerank_docs[0].document["text"]

'whichmodels areprompted toexplain theirreasoningwhen givena complexproblem, inorder toincrease\nthe likelihood that their ﬁnal answer is correct.\nRLHF has emerged as a powerful strategy for ﬁne-tuning Large Language Models, enabling signiﬁcant\nimprovements in their performance (Christiano et al., 2017). The method, ﬁrst showcased by Stiennon et al.\n(2020) in the context of text-summarization tasks, has since been extended to a range of other applications.\nIn this paradigm, models are ﬁne-tuned based on feedback from human users, thus iteratively aligning the\nmodels’ responses more closely with human expectations and preferences.\nOuyang et al. (2022) demonstrates that a combination of instruction ﬁne-tuning and RLHF can help ﬁx\nissues with factuality, toxicity, and helpfulness that cannot be remedied by simply scaling up LLMs. Bai\net al. (2022b) partially automates this ﬁne-tuning-plus-RLHF approach by replacing the human-labeled\nﬁne-tuningdatawiththemodel’sownself-critiquesan

The reordered results look like so:

In [None]:
[docs[doc.document["text"]] for doc in rerank_docs]

[0,
 23,
 14,
 3,
 12,
 6,
 9,
 8,
 1,
 17,
 7,
 21,
 2,
 16,
 10,
 20,
 18,
 22,
 24,
 13,
 19,
 4,
 15,
 11,
 5]

Let's write a function to allow us to more easily compare the original results vs. reranked results.

In [None]:
def compare(query: str, top_k: int, top_n: int):
    # first get vec search results
    docs = get_docs(query, top_k=top_k)
    i2doc = {docs[doc]: doc for doc in docs.keys()}
    # rerank
    rerank_docs = co.rerank(
        query=query, documents=docs.keys(), top_n=top_n, model="rerank-english-v2.0"
    )
    original_docs = []
    reranked_docs = []
    # compare order change
    for i, doc in enumerate(rerank_docs):
        rerank_i = docs[doc.document["text"]]
        print(str(i)+"\t->\t"+str(rerank_i))
        if i != rerank_i:
            reranked_docs.append(f"[{rerank_i}]\n"+doc.document["text"])
            original_docs.append(f"[{i}]\n"+i2doc[i])
    for orig, rerank in zip(original_docs, reranked_docs):
        print("ORIGINAL:\n"+orig+"\n\nRERANKED:\n"+rerank+"\n\n---\n")

Beginning with our `"can you explain why we would want to do rlhf?"` query, let's take a look at the top-3 results with / without reranking:

In [None]:
compare(query, 25, 3)

0	->	0
1	->	23
2	->	14
ORIGINAL:
[1]
We examine the inﬂuence of the amount of RLHF training for two reasons. First, RLHF [13, 57] is an
increasingly popular technique for reducing harmful behaviors in large language models [3, 21, 52]. Some of
these models are already deployed [52], so we believe the impact of RLHF deserves further scrutiny. Second,
previous work shows that the amount of RLHF training can signiﬁcantly change metrics on a wide range of
personality, political preference, and harm evaluations for a given model size [41]. As a result, it is important
to control for the amount of RLHF training in the analysis of our experiments.
3.2 Experiments
3.2.1 Overview
We test the effect of natural language instructions on two related but distinct moral phenomena: stereotyping
and discrimination. Stereotyping involves the use of generalizations about groups in ways that are often
harmful or undesirable.4To measure stereotyping, we use two well-known stereotyping benchmarks, BBQ
[40] 

Both results from reranking provide many more reasons as to why we would want to use RLHF than the original records. Let's try another query:

In [None]:
compare("what is red teaming?", top_k=25, top_n=3)

0	->	0
1	->	3
2	->	17
ORIGINAL:
[1]
red-teaming expertise valuable for organizations with suf ﬁcient resources. However, it would also be
beneﬁcial to experiment with the formation of a community of AI red teaming professionals that draws
together individuals from different organizations and bac kgrounds, speciﬁcally focused on some subset
of AI (versus AI in general) that is relatively well-deﬁned a nd relevant across multiple organizations.25
A community of red teaming professionals could take actions such as publish best practices, collectively
analyze particular case studies, organize workshops on eme rging issues, or advocate for policies that
would enable red teaming to be more effective.
Doing red teaming in a more collaborative fashion, as a commu nity of focused professionals across
23Red teaming could be aimed at assessing various properties o f AI systems, though we focus on safety and security in this
subsection given the expertise of the authors who contribut ed to it.
24F

Again, the results provide more relevant responses when using reranking rather than the original search.

Don't forget to delete your index when you're done to save resources!

In [None]:
pinecone.delete(index_name)

---