# Question Answering with Langchain, Qdrant and OpenAI

This notebook presents how to implement a Question Answering system with Langchain, Qdrant as a knowledge based and OpenAI embeddings. If you are not familiar with Qdrant, it's better to check out the [Getting_started_with_Qdrant_and_OpenAI.ipynb](Getting_started_with_Qdrant_and_OpenAI.ipynb) notebook.

This notebook presents an end-to-end process of:
1. Calculating the embeddings with OpenAI API.
2. Storing the embeddings in a local instance of Qdrant to build a knowledge base.
3. Converting raw text query to an embedding with OpenAI API.
4. Using Qdrant to perform the nearest neighbour search in the created collection to find some context.
5. Asking LLM to find the answer in a given context.

All the steps will be simplified to calling some corresponding Langchain methods.

## Prerequisites

For the purposes of this exercise we need to prepare a couple of things:

1. Qdrant server instance. In our case a local Docker container.
2. The [qdrant-client](https://github.com/qdrant/qdrant_client) library to interact with the vector database.
3. [Langchain](https://github.com/hwchase17/langchain) as a framework.
3. An [OpenAI API key](https://beta.openai.com/account/api-keys).

### Start Qdrant server

We're going to use a local Qdrant instance running in a Docker container. The easiest way to launch it is to use the attached [docker-compose.yaml] file and run the following command:

In [1]:
! docker-compose up -d

qdrant_qdrant_1 is up-to-date


We might validate if the server was launched successfully by running a simple curl command:

In [2]:
! curl http://localhost:6333

{"title":"qdrant - vector search engine","version":"1.0.1"}

### Install requirements

This notebook obviously requires the `openai`, `langchain` and `qdrant-client` packages.


In [None]:
! pip install openai qdrant-client "langchain==0.0.100" wget

### Prepare your OpenAI API key

The OpenAI API key is used for vectorization of the documents and queries.

If you don't have an OpenAI API key, you can get one from [https://beta.openai.com/account/api-keys](https://beta.openai.com/account/api-keys).

Once you get your key, please add it to your environment variables as `OPENAI_API_KEY`.

In [4]:
# Test that your OpenAI API key is correctly set as an environment variable
# Note. if you run this notebook locally, you will need to reload your terminal and the notebook for the env variables to be live.
import os

# Note. alternatively you can set a temporary env variable like this:
# os.environ["OPENAI_API_KEY"] = "-----------------------------"

if os.getenv("OPENAI_API_KEY") is not None:
    print("OPENAI_API_KEY is ready")
else:
    print("OPENAI_API_KEY environment variable not found")

OPENAI_API_KEY is ready


## Load data

In this section we are going to load the data containing some natural questions and answers to them. All the data will be used to create a Langchain application with Qdrant being the knowledge base.

In [5]:
# All the examples come from https://ai.google.com/research/NaturalQuestions
# This is a sample of the training set that we download and extract for some
# futher processing.

!wget -c https://storage.googleapis.com/dataset-natural-questions/questions.json
!wget -c https://storage.googleapis.com/dataset-natural-questions/answers.json

--2023-02-16 18:06:29--  https://storage.googleapis.com/dataset-natural-questions/questions.json
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.203.208, 216.58.208.208, 142.250.75.16, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.203.208|:443... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.

--2023-02-16 18:06:29--  https://storage.googleapis.com/dataset-natural-questions/answers.json
Resolving storage.googleapis.com (storage.googleapis.com)... 216.58.208.208, 142.250.186.208, 216.58.215.112, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.208.208|:443... connected.
HTTP request sent, awaiting response... 416 Requested range not satisfiable

    The file is already fully retrieved; nothing to do.



In [7]:
import json

with open("questions.json", "r") as fp:
    questions = json.load(fp)

with open("answers.json", "r") as fp:
    answers = json.load(fp)

In [8]:
print(questions[0])

when is the last episode of season 8 of the walking dead


In [9]:
print(answers[0])

No . overall No. in season Title Directed by Written by Original air date U.S. viewers ( millions ) 100 `` Mercy '' Greg Nicotero Scott M. Gimple October 22 , 2017 ( 2017 - 10 - 22 ) 11.44 Rick , Maggie , and Ezekiel rally their communities together to take down Negan . Gregory attempts to have the Hilltop residents side with Negan , but they all firmly stand behind Maggie . The group attacks the Sanctuary , taking down its fences and flooding the compound with walkers . With the Sanctuary defaced , everyone leaves except Gabriel , who reluctantly stays to save Gregory , but is left behind when Gregory abandons him . Surrounded by walkers , Gabriel hides in a trailer , where he is trapped inside with Negan . 101 `` The Damned '' Rosemary Rodriguez Matthew Negrete & Channing Powell October 29 , 2017 ( 2017 - 10 - 29 ) 8.92 Rick 's forces split into separate parties to attack several of the Saviors ' outposts , during which many members of the group are killed ; Eric is critically injure

## Chain definition

Langchain is already integrated with Qdrant and performs all the indexing for given list of documents. In our case we are going to store the set of answers we have.

In [10]:
from langchain.vectorstores import Qdrant
from langchain.embeddings import OpenAIEmbeddings
from langchain import VectorDBQA, OpenAI

embeddings = OpenAIEmbeddings()
doc_store = Qdrant.from_texts(
    answers, embeddings, host="localhost" 
)

At this stage all the possible answers are already stored in Qdrant, so we can define the whole QA chain.

In [11]:
llm = OpenAI()
qa = VectorDBQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    vectorstore=doc_store,
    return_source_documents=False,
)

## Search data

Once the data is put into Qdrant we can start asking some questions. A question will be automatically vectorized by OpenAI model, and the created vector will be used to find some possibly matching answers in Qdrant. Once retrieved, the most similar answers will be incorporated into the prompt sent to OpenAI Large Language Model. The communication between all the services is shown on a graph:

![](https://qdrant.tech/articles_data/langchain-integration/flow-diagram.png)


In [16]:
import random

random.seed(52)
selected_questions = random.choices(questions, k=5)

In [17]:
for question in selected_questions:
    print(">", question)
    print(qa.run(question), end="\n\n")

> where do frankenstein and the monster first meet
 Victor retreats into the mountains and the Creature finds him and pleads for Victor to hear his tale, so they first meet in the mountains.

> who are the actors in fast and furious
 The actors in Fast and Furious are Vin Diesel as Dominic Toretto, Paul Walker as Brian O'Conner, Michelle Rodriguez as Letty Ortiz, Jordana Brewster as Mia Toretto, Tyrese Gibson as Roman Pearce, Ludacris as Tej Parker, Lucas Black as Sean Boswell, Sung Kang as Han Lue, Gal Gadot as Gisele Yashar, Dwayne Johnson as Luke Hobbs, Matt Schulze as Vince, Chad Lindberg as Jesse, Johnny Strong as Leon, Eva Mendes as Monica Fuentes, Devon Aoki as Suki, Nathalie Kelley as Neela, Bow Wow as Twinkie, Tego Calderón as Tego Leo, Don Omar as Rico Santos, Elsa Pataky as Elena Neves, and Kurt Russell as Mr. Nobody.

> properties of red black tree in data structure
 Red black trees have the following properties: each node is either red or black, the root is black, all leav