# Question answering using a search API and re-ranking

Searching for relevant information can sometimes feel like looking for a needle in a haystack, but don’t despair, GPTs can actually do a lot of this work for us. In this guide we explore a way to augment existing search systems with various AI techniques, helping us sift through the noise.

Two ways of retrieving information for GPT are:

1. **Mimicking Human Browsing:** [GPT triggers a search](https://openai.com/blog/chatgpt-plugins#browsing), evaluates the results, and modifies the search query if necessary. It can also follow up on specific search results to form a chain of thought, much like a human user would do.
2. **Retrieval with Embeddings:** Calculate [embeddings](https://platform.openai.com/docs/guides/embeddings) for your content and a user query, and then [retrieve the content](Question_answering_using_embeddings.ipynb) most related as measured by cosine similarity. This technique is [used heavily](https://blog.google/products/search/search-language-understanding-bert/) by search engines like Google.

These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.

By combining these approaches, and drawing inspiration from [re-ranking](https://www.sbert.net/examples/applications/retrieve_rerank/README.html) methods, we identify an approach that sits in the middle. **This approach can be implemented on top of any existing search system, like the Slack search API, or an internal ElasticSearch instance with private data**. Here’s how it works:

![search_augmented_by_query_generation_and_embeddings_reranking.png](../images/search_rerank_answer.png)

**Step 1: Search**

1.  User asks a question.
2.  GPT generates a list of potential queries.
3.  Search queries are executed in parallel.

**Step 2: Re-rank**

1.  Embeddings for each result are used to calculate semantic similarity to a generated hypothetical ideal answer to the user question.
2.  Results are ranked and filtered based on this similarity metric.

**Step 3: Answer**

1.  Given the top search results, the model generates an answer to the user’s question, including references and links.

This hybrid approach offers relatively low latency and can be integrated into any existing search endpoint, without requiring the upkeep of a vector database. Let's dive into it! We will use the [News API](https://newsapi.org/) as an example domain to search over.

## Setup

In addition to your `OPENAI_API_KEY`, you'll have to include a `NEWS_API_KEY` in your environment. You can get an API key [here](https://newsapi.org/).


In [1]:
%%capture
%env NEWS_API_KEY = YOUR_NEWS_API_KEY

In [1]:
# Dependencies
from datetime import date, timedelta  # date handling for fetching recent news
from IPython import display  # for pretty printing
import json  # for parsing the JSON api responses and model outputs
from numpy import dot  # for cosine similarity
import openai  # for using GPT and getting embeddings
import os  # for loading environment variables
import requests  # for making the API requests
from tqdm.notebook import tqdm  # for printing progress bars

In [2]:
# Load environment variables
news_api_key = os.getenv("NEWS_API_KEY")

GPT_MODEL = "gpt-3.5-turbo"


# Helper functions
def json_gpt(input: str):
    completion = openai.ChatCompletion.create(
        model=GPT_MODEL,
        messages=[
            {"role": "system", "content": "Output only valid JSON"},
            {"role": "user", "content": input},
        ],
        temperature=0.5,
    )

    text = completion.choices[0].message.content
    parsed = json.loads(text)

    return parsed


def embeddings(input: list[str]) -> list[list[str]]:
    response = openai.Embedding.create(model="text-embedding-ada-002", input=input)
    return [data.embedding for data in response.data]

## 1. Search

It all starts with a user question.


In [8]:
# User asks a question
USER_QUESTION = "Which famous American writer passed away in June 2023? Tell me about his or her work."

Now, in order to be as exhaustive as possible, we use the model to generate a list of diverse queries based on this question.


In [42]:
QUERIES_INPUT = f"""
You have access to a search API that returns recent news articles.
Generate an array of search queries that are relevant to this question.
Use a variation of related keywords for the queries, trying to be as general as possible.
Include as many queries as you can think of, including and excluding terms.
For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].
Be creative. 
Please be concise and include only the keywords in the queries. 
The more queries you include, the more likely you are to find relevant results.

User question: {USER_QUESTION}

Format: {{"queries": ["query_1", "query_2", "query_3"]}}
"""

queries = json_gpt(QUERIES_INPUT)["queries"]

# Let's include the original question as well for good measure
queries.append(USER_QUESTION)

print(queries)

['famous American writer death June 2023', 'American author death 2023', 'writer obituary June 2023', 'author passing June 2023', 'writer death bio', 'American writer legacy', 'author achievements', 'famous writer bibliography', 'American novelist death news', 'writer biography', 'author literary works', 'writer contribution to literature', 'Which famous American writer passed away in June 2023? Tell me about his or her work.']


The queries look good, so let's run the searches.


In [43]:
def search_news(
    query: str,
    news_api_key: str = news_api_key,
    num_articles: int = 50,
    from_datetime: str = "2023-06-05", 
    to_datetime: str = "2023-06-19",
) -> dict:
    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": num_articles,
            "sortBy": "relevancy",
            "from": from_datetime,
            "to": to_datetime,
        },
    )

    return response.json()


articles = []

for query in tqdm(queries):
    result = search_news(query)
    if result["status"] == "ok":
        articles = articles + result["articles"]
    else:
        raise Exception(result["message"])

# remove duplicates
articles = list({article["url"]: article for article in articles}.values())

print("Total number of articles:", len(articles))
print("Top 5 articles of query 1:", "\n")

for article in articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print()


  0%|          | 0/13 [00:00<?, ?it/s]

Total number of articles: 357
Top 5 articles of query 1: 

Title: 15 Trivia Tidbits About ‘All in the Family’
Description: By Zanandi Botes Published: June 13th, 2023
Content: The program you are about to see is All in the Family. It seeks to throw a humorous spotlight on our...

Title: This week on "Sunday Morning" (June 11)
Description: A look at the features for this week's broadcast of the #1 Sunday morning news program, hosted by Jane Pauley.
Content: The Emmy Award-winning "CBS News Sunday Morning" is broadcast on CBS Sundays beginning at 9:00 a.m. ...

Title: Apple TV+ shows and movies: Everything to watch on Apple TV Plus
Description: Apple TV+ offers exclusive Apple original TV shows and movies in 4K HDR quality. You can watch across all of your screens and pick up where you left off on any device. Apple TV+ costs $6.99 per month. Here’s every Apple original television show and movie avai…
Content: Apple TV+ offers exclusive Apple original TV shows and movies in 4K HDR quality

As we can see, oftentimes, the search queries will return a large number of results, many of which are not relevant to the original question asked by the user. In order to improve the quality of the final answer, we use embeddings to re-rank and filter the results.

# 2. Re-rank

Drawing inspiration from [HyDE (Gao et al.)](https://arxiv.org/abs/2212.10496), we first generate a hypothetical ideal answer to rerank our compare our results against. This helps prioritize results that look like good answers, rather than those similar to our question. Here’s the prompt we use to generate our hypothetical answer.


In [15]:
HA_INPUT = f"""
Generate a hypothetical answer to the user's question. This answer which will be used to rank search results. 
Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders
like NAME did something, or NAME said something at PLACE. Also pretend you are in the year 2030 and have access to all the knowledge up to that year. 

User question: {USER_QUESTION}

Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""

hypothetical_answer = json_gpt(HA_INPUT)["hypotheticalAnswer"]

print(hypothetical_answer)

Mark Twain passed away in June 2023. He was known for his witty and satirical writing, with notable works such as The Adventures of NAME Finn and The NAME of the NAME. His writing often tackled social issues and provided commentary on American life during his time.


Now, let's generate embeddings for the search results and the hypothetical answer. We then calculate the cosine distance between these embeddings, giving us a semantic similarity metric. Note that we can simply calculate the dot product in lieu of doing a full cosine similarity calculation since the OpenAI embeddings are returned normalized in our API.


In [45]:
hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]
article_embeddings = embeddings(
    [
        f"{article['title']} {article['description']} {article['content'][0:100]}"
        for article in articles
    ]
)

# Calculate cosine similarity
cosine_similarities = []
for article_embedding in article_embeddings:
    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))

cosine_similarities[0:10]

[0.6962791375040007,
 0.7181057968180035,
 0.6612304187122091,
 0.661220916024921,
 0.7087449610313865,
 0.7846035385705046,
 0.716807401567893,
 0.6982201729846651,
 0.7489642967838637,
 0.6937529201134262]

Finally, we use these similarity scores to sort and filter the results.


In [46]:
scored_articles = zip(articles, cosine_similarities)

# Sort articles by cosine similarity
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)

# Print top 5 articles
print("Top 5 articles:", "\n")

for article, score in sorted_articles[0:10]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print("Score:", score)
    print()


Top 5 articles: 

Title: Cormac McCarthy, revered American novelist, dies at 89
Description: NEW YORK - Celebrated author Cormac McCarthy, an unflinching chronicler of America's bleak frontiers and grim underbelly, died on Tuesday aged 89, his publisher said.
Content: NEW YORK - Celebrated author Cormac McCarthy, an unflinching chronicler of America's bleak frontiers...
Score: 0.8271407791897417

Title: Great American Novelist Cormac McCarthy Boldly Waded Into The Bleakness Of The Human Condition
Description: McCarthy explored how even a 'Christianized' American can remain a violent wilderness in search of God and meaning.
Content: Quintessentially American novelist Cormac McCarthy died on Tuesday, June 13, 2023. Since his death, ...
Score: 0.8244918938044037

Title: Cormac McCarthy, Revered American Novelist, Dies At 89
Description: Celebrated author Cormac McCarthy, an unflinching chronicler of America's bleak frontiers and grim underbelly, died on Tuesday aged 89, his publisher said

In [47]:
[s[1] for s in sorted_articles[:20]]

[0.8271407791897417,
 0.8244918938044037,
 0.8211298941056193,
 0.8145666870741505,
 0.8134215641004402,
 0.8103722855349912,
 0.8095520087274358,
 0.8067701485571332,
 0.8060943165897994,
 0.8027768100931316,
 0.8023430179372025,
 0.8018686253231713,
 0.7991268498142123,
 0.7989662834219854,
 0.7960174867696671,
 0.7955277816338275,
 0.7954717275490183,
 0.792841239391181,
 0.7927699058886994,
 0.7925346097681247]

Awesome! These results look a lot more relevant to our original query. Now, let's use the top 20 results to generate a final answer.

## 3. Answer


In [49]:
formatted_top_results = [
    {
        "title": article["title"],
        "description": article["description"],
        "url": article["url"],
    }
    for article, _score in sorted_articles[0:5]
]

ANSWER_INPUT = f"""
Generate an answer to the user's question based on the given search results. 
TOP_RESULTS: {formatted_top_results}
USER_QUESTION: {USER_QUESTION}

Include as much information as possible in the answer. Reference the relevant search result urls as markdown links.
"""

completion = openai.ChatCompletion.create(
    model=GPT_MODEL,
    messages=[{"role": "user", "content": ANSWER_INPUT}],
    temperature=0.5,
    stream=True,
)

text = ""
for chunk in completion:
    text += chunk.choices[0].delta.get("content", "")
    display.clear_output(wait=True)
    display.display(display.Markdown(text))

Cormac McCarthy, a celebrated American novelist known for his unflinching depictions of America's bleak frontiers and grim underbelly, passed away in June 2023 at the age of 89. McCarthy was a Pulitzer Prize-winning author who wrote acclaimed works such as "The Road," "No Country for Old Men," and "Blood Meridian." His writing often explored the violence and darkness of the human condition, and his work has been hailed as some of the greatest American literature of the modern era. For more information on McCarthy's life and work, you can read articles from [Bangkok Post](https://www.bangkokpost.com/world/2591170/cormac-mccarthy-revered-american-novelist-dies-at-89), [The Federalist](https://thefederalist.com/2023/06/15/great-american-novelist-cormac-mccarthy-boldly-waded-into-the-bleakness-of-the-human-condition/), [IBTimes](https://www.ibtimes.com/cormac-mccarthy-revered-american-novelist-dies-89-3699623), [Japan Today](https://japantoday.com/category/world/cormac-mccarthy-revered-american-novelist-dies-at-891), and [CNN](https://www.cnn.com/style/article/cormac-mccarthy-author-death/index.html).