# Developing a Movie Q&A Bot with RAG Using LangChain, Pinecone, OpenAI, and OpenEmbedding


Retrieval-augmented generation, or _RAG_, is a cutting-edge technique that leverages large language models (LLMs) to provide enhanced contextual responses without the need for extensive fine-tuning or retraining. This approach addresses a common limitation of traditional language models: their tendency to generate responses that may lack factual accuracy. By integrating external knowledge sources, RAG significantly improves the factual reliability of generated responses.

## Project Overview

The primary objective of this project is to develop a sophisticated question-answering bot capable of addressing movie-related queries with high accuracy. To achieve this, we will employ RAG to supply the language model with pertinent factual information. The workflow involves several key steps, each contributing to the overall functionality and performance of the bot.

### Tools and Models

We will utilize the following tools and models to build our question-answering system:

- **OpenAI's `gpt-3.5-turbo` model**: This model will be used for generating prompt completions and answering user queries.
- **OpenAI's `text-embedding-ada-002` model**: This model will create vector embeddings of movie descriptions, enabling efficient similarity searches.
- **Pinecone**: A vector database that will store the embeddings and facilitate rapid retrieval of relevant information.
- **Langchain**: A powerful tool that will streamline interactions between OpenAI's models and Pinecone, ensuring seamless integration and operation.

### Dataset

The dataset for this project is sourced from the Kaggle dataset [IMDb Movies/Shows with Descriptions](https://www.kaggle.com/datasets/ishikajohari/imdb-data-with-descriptions). It contains comprehensive information about various movies, including titles, genres, descriptions, ratings, and more.

### Workflow

1. **Data Preparation**: 
   - Load and preprocess the dataset to ensure it is clean and structured.
   - Extract relevant columns such as movie titles, descriptions, and other metadata.

2. **Embedding Creation**:
   - Use the `text-embedding-ada-002` model to generate vector embeddings for each movie description.
   - Store these embeddings in the Pinecone vector database.

3. **Query Processing**:
   - When a user submits a movie-related question, convert the query into an embedding using the same embedding model.
   - Search the Pinecone database for the most relevant movie descriptions based on the query embedding.

4. **Response Generation**:
   - Provide the retrieved context to the `gpt-3.5-turbo` model.
   - Generate a coherent and factual response to the user's query using the context.

5. **Evaluation and Iteration**:
   - Continuously evaluate the performance of the bot.
   - Fine-tune the retrieval and generation processes to enhance accuracy and relevance.

### Goals and Scope

The ultimate goal of this project is to create a robust and reliable question-answering bot that can handle a wide range of movie-related questions. By integrating RAG, we aim to achieve the following:

- **Enhanced Accuracy**: Provide factually correct answers by leveraging external knowledge sources.
- **Improved User Experience**: Deliver coherent and contextually relevant responses to user queries.
- **Scalability**: Ensure the system can handle a large volume of queries efficiently.

This project will serve as a valuable addition to your generative AI portfolio, showcasing your ability to implement advanced techniques and build practical applications using state-of-the-art models and tools.

Let's dive into the implementation and bring this movie question-answering bot to life!
```

## Setup

To perform this analysis, we need to install the following packages:

- `openai`: for interacting with OpenAI.
- `pinecone-client`: for interacting with Pinecone.
- `langchain`: a framework for developing with generative AI.
- `langchain-openai` and `langchain-pinecone`: Langchain extension modules with functionality for OpenAI and Pinecone.
- `tiktoken`: a string encoder that generates tokens used by OpenAI. It is useful for estimating the number of tokens used.

### Install the corresponding packages

In [119]:
!pip install openai==1.27
!pip install pinecone-client==4.0.0
!pip install langchain==0.1.19
!pip install langchain-openai==0.1.6
!pip install langchain-pinecone==0.1.0
!pip install tiktoken==0.7.0
!pip install typing_extensions==4.11.0

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting pinecone-client==4.0.0
  Using cached pinecone_client-4.0.0-py3-none-any.whl.metadata (16 kB)
Using cached pinecone_client-4.0.0-py3-none-any.whl (214 kB)
Installing collected packages: pinecone-client
  Attempting uninstall: pinecone-client
    Found existing installation: pinecone-client 3.2.2
    Uninstalling pinecone-client-3.2.2:
      Successfully uninstalled pinecone-client-3.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

## Import the Movies Data

We'll start with importing the dataset we mentioned at the top of this project. You have the dataset available as a CSV in your workspace: `"IMDB.csv"`. We need to import the dataset and transform it into a convenient format.

### Open dataset

- Import the `pandas` package as `pd`
- Import `"IMDB.csv"` into a variable `movies_raw`.
- Print the head of `movies_raw`.

In [120]:
# Import pandas
import pandas as pd

# Import IMBD.csv.
movies_raw =  pd.read_csv("IMDB.csv")

# Print the data
print(movies_raw.head())
print("Basic info about the Dataset:\n", movies_raw.describe()) 
print("Columns in Dataset:\n", movies_raw.columns) 

   Unnamed: 0  ...                                        Description
0           0  ...  Jodie Foster stars as Clarice Starling, a top ...
1           1  ...  In this sequel set eleven years after "The Ter...
2           2  ...  This Disney animated feature follows the adven...
3           3  ...  Vincent Vega (John Travolta) and Jules Winnfie...
4           4  ...  Andy Dufresne (Tim Robbins) is sentenced to tw...

[5 rows x 21 columns]
Basic info about the Dataset:
        Unnamed: 0         index  ...     ordering  isOriginalTitle
count  7850.00000   7850.000000  ...  7850.000000           7850.0
mean   3924.50000   5286.804076  ...    16.440382              0.0
std    2266.24414   2851.482333  ...    12.871290              0.0
min       0.00000      0.000000  ...     1.000000              0.0
25%    1962.25000   2847.250000  ...     6.000000              0.0
50%    3924.50000   5284.500000  ...    13.000000              0.0
75%    5886.75000   7602.750000  ...    24.000000        

In [121]:
display(movies_raw)

Unnamed: 0.1,Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,ordering,title,region,language,types,attributes,isOriginalTitle,Description
0,0,0,tt0102926,movie,The Silence of the Lambs,The Silence of the Lambs,0,1991,\N,118,"Crime,Drama,Thriller",8.6,1473918,50,The Silence of the Lambs,US,en,\N,\N,0,"Jodie Foster stars as Clarice Starling, a top ..."
1,1,1,tt0103064,movie,Terminator 2: Judgment Day,Terminator 2: Judgment Day,0,1991,\N,137,"Action,Sci-Fi",8.6,1128166,17,Terminator 2: Judgment Day,US,en,dvd,\N,0,"In this sequel set eleven years after ""The Ter..."
2,2,3,tt0110357,movie,The Lion King,The Lion King,0,1994,\N,88,"Adventure,Animation,Drama",8.5,1090882,18,The Lion King 3D,US,en,\N,3-D version,0,This Disney animated feature follows the adven...
3,3,4,tt0110912,movie,Pulp Fiction,Pulp Fiction,0,1994,\N,154,"Crime,Drama",8.9,2118762,22,Pulp Fiction,US,en,\N,\N,0,Vincent Vega (John Travolta) and Jules Winnfie...
4,4,5,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,\N,142,Drama,9.3,2759621,2,The Shawshank Redemption,US,en,\N,\N,0,Andy Dufresne (Tim Robbins) is sentenced to tw...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7845,7845,10269,tt9789686,movie,The Blonde One,Un rubio,0,2019,\N,108,"Drama,Romance",7.3,3779,11,The Blonde One,CA,en,imdbDisplay,\N,0,Two men begin a romantic relationship in Bueno...
7846,7846,10270,tt9814900,tvSeries,Trailer Park Boys: The Animated Series,Trailer Park Boys: The Animated Series,0,2019,2020,25,"Animation,Comedy",7.5,3318,4,Trailer Park Boys: The Animated Series,CA,en,imdbDisplay,\N,0,Nova Scotia's trailer parks are colorful thank...
7847,7847,10271,tt9845110,movie,Two of Us,Deux,0,2019,\N,99,"Drama,Romance",7.2,3346,29,Two of Us,CA,en,imdbDisplay,\N,0,"Two retired women, Nina and Madeleine, have be..."
7848,7848,10272,tt9845398,movie,End of the Century,Fin de siglo,0,2019,\N,84,Drama,6.9,3646,6,Fin De Siglo,CA,en,imdbDisplay,\N,0,A 30-something Argentine poet on vacation in B...


### EDA: Transform on `movies_raw` and assign to `movies`.
  
- Rename `primaryTitle` to `movie_title` and `Description` to `movie_description`
- Create a column `source` that contains the identifier of the movie, prefixed by `"https://www.imdb.com/title/"`. The end result should be a working link to the movie. The identifier can be found in the `"tconst"` column in `"IMDB.csv"`. For example, `"https://www.imdb.com/title/tt0102926/"`.
- Filter out all rows that do not have `"movie"` as a `titleType`
- Select the `movie_title`, `movie_description`, `source` and `genres` columns
- Show the head of `movies`.

In [122]:
# Rename columns
movies = movies_raw.rename(columns = {
    "primaryTitle": "movie_title",
    "Description": "movie_description",
})

# Add source column from tconst
movies["source"] = "https://www.imdb.com/title/" + movies["tconst"]

# Subset for titleType equal to "movie"
movies = movies.loc[movies["titleType"] == "movie"]

# Select important columns
movies = movies[["movie_title", "movie_description", "source", "genres"]]

# output
movies.head(10)

Unnamed: 0,movie_title,movie_description,source,genres
0,The Silence of the Lambs,"Jodie Foster stars as Clarice Starling, a top ...",https://www.imdb.com/title/tt0102926,"Crime,Drama,Thriller"
1,Terminator 2: Judgment Day,"In this sequel set eleven years after ""The Ter...",https://www.imdb.com/title/tt0103064,"Action,Sci-Fi"
2,The Lion King,This Disney animated feature follows the adven...,https://www.imdb.com/title/tt0110357,"Adventure,Animation,Drama"
3,Pulp Fiction,Vincent Vega (John Travolta) and Jules Winnfie...,https://www.imdb.com/title/tt0110912,"Crime,Drama"
4,The Shawshank Redemption,Andy Dufresne (Tim Robbins) is sentenced to tw...,https://www.imdb.com/title/tt0111161,Drama
5,Titanic,"James Cameron's ""Titanic"" is an epic, action-p...",https://www.imdb.com/title/tt0120338,"Drama,Romance"
6,Corpse Bride,Victor (Johnny Depp) and Victoria's (Emily Wat...,https://www.imdb.com/title/tt0121164,"Animation,Drama,Family"
7,Gladiator,Commodus (Joaquin Phoenix) takes power and str...,https://www.imdb.com/title/tt0172495,"Action,Adventure,Drama"
8,A.I. Artificial Intelligence,"A robotic boy, the first programmed to love, D...",https://www.imdb.com/title/tt0212720,"Drama,Sci-Fi"
12,The Dummy,A murderous ventriloquist dummy terrorizes new...,https://www.imdb.com/title/tt0246592,"Comedy,Drama,Romance"


## Create Documents from the Data

Later in this project, we will be creating vector embeddings for all of the rows in the `movies` DataFrame. Before we do so, we need to create [Document](https://docs.langchain.com/docs/components/schema/document) objects from the data in the DataFrame. To accomplish this, we can utilize the `DataFrameLoader` class provided by langchain, which allows us to create documents from a pandas DataFrame.

For the main content of the documents, we will create a summary string that includes relevant information about each movie. To achieve this, we will combine the movie title, description, and genre into a `page_content` column. Additionally, we will retain the IMDB link in the `source` column as metadata.

- Import `DataFrameLoader` from `langchain.document_loaders`
- Create a column `page_content` that creates strings that contain information about the movie title, genre and description. For example, the first movie should look like this:
```
Title: The Silence of the Lambs
Genre: Crime,Drama,Thriller
Description: Jodie Foster stars as Clarice Starling, a top student at the FBI's training academy. Jack Crawford (Scott Glenn) wants Clarice to interview Dr. Hannibal Lecter (Anthony Hopkins), a brilliant psychiatrist who is also a violent psychopath, serving life behind bars for various acts of murder and cannibalism. Crawford believes that Lecter may have insight into a case and that Starling, as an attractive young woman, may be just the bait to draw him out.
```
- Only keep the columns `page_content` and `source` in the movies DataFrame

In [123]:
# Import DataFrameLoader
from langchain.document_loaders import DataFrameLoader

# Check the columns of the DataFrame
movies["page_content"] = "Title: " + movies["movie_title"] + "\n" + \
                         "Genre: " + movies["genres"] + "\n" + \
                         "Description: " +movies["movie_description"]

# Select page_content and source columns
movies = movies[["page_content", "source"]]
movies.head(10)

Unnamed: 0,page_content,source
0,"Title: The Silence of the Lambs\nGenre: Crime,...",https://www.imdb.com/title/tt0102926
1,Title: Terminator 2: Judgment Day\nGenre: Acti...,https://www.imdb.com/title/tt0103064
2,"Title: The Lion King\nGenre: Adventure,Animati...",https://www.imdb.com/title/tt0110357
3,"Title: Pulp Fiction\nGenre: Crime,Drama\nDescr...",https://www.imdb.com/title/tt0110912
4,Title: The Shawshank Redemption\nGenre: Drama\...,https://www.imdb.com/title/tt0111161
5,"Title: Titanic\nGenre: Drama,Romance\nDescript...",https://www.imdb.com/title/tt0120338
6,"Title: Corpse Bride\nGenre: Animation,Drama,Fa...",https://www.imdb.com/title/tt0121164
7,"Title: Gladiator\nGenre: Action,Adventure,Dram...",https://www.imdb.com/title/tt0172495
8,Title: A.I. Artificial Intelligence\nGenre: Dr...,https://www.imdb.com/title/tt0212720
12,"Title: The Dummy\nGenre: Comedy,Drama,Romance\...",https://www.imdb.com/title/tt0246592


### Using `DataFrameLoader()` to generate the document

- Use DataFrameLoader to load documents from the movies DataFrame into docs. Use "page_content" as the `page_content_column`.
- Print the first 3 documents and the total number of documents

In [124]:
# Load documents
docs = DataFrameLoader(movies, page_content_column = "page_content").load()

# output
print(f"First 3 documents:\n {docs[:3]}\n")
print(f"Total Number of documents:\n {len(docs)}")

First 3 documents:
 [Document(page_content="Title: The Silence of the Lambs\nGenre: Crime,Drama,Thriller\nDescription: Jodie Foster stars as Clarice Starling, a top student at the FBI's training academy. Jack Crawford (Scott Glenn) wants Clarice to interview Dr. Hannibal Lecter (Anthony Hopkins), a brilliant psychiatrist who is also a violent psychopath, serving life behind bars for various acts of murder and cannibalism. Crawford believes that Lecter may have insight into a case and that Starling, as an attractive young woman, may be just the bait to draw him out.", metadata={'source': 'https://www.imdb.com/title/tt0102926'}), Document(page_content='Title: Terminator 2: Judgment Day\nGenre: Action,Sci-Fi\nDescription: In this sequel set eleven years after "The Terminator," young John Connor (Edward Furlong), the key to civilization\'s victory over a future robot uprising, is the target of the shape-shifting T-1000 (Robert Patrick), a Terminator sent from the future to kill him. Anothe

## Estimate the Cost of Embedding

We're going to be using OpenAI to calculate [vector embeddings](https://platform.openai.com/docs/guides/embeddings/embeddings) of the document texts. Creating embeddings is a form of dimensionality reduction, where we assign the text to a point in an N-dimensional space. Texts that are semantically close to each other should end up being close to each other in the N-dimensional space.

Luckily, OpenAI has several models that are trained to calculate these kinds of embeddings, so we don't have to do that ourselves. Of course, a cost is associated with this. You can derive the cost from the [pricing page of OpenAI](https://openai.com/pricing).

The calculation is based on the amount of _tokens_ in the text. All text is encoded into tokens to be used by OpenAI. On average, a token consists of roughly 3 characters. However, we can calculate the exact tokens for a string of text by using the `tiktoken` package.

The goal of this task is to calculate the number of tokens in the documents, to then extrapolate the estimated cost.

### Let´s calculate the cost

- Import `tiktoken`
- Create the encoder, use the `"cl100k_base"` encoder. This is the encoder used by OpenAI to calculate the embeddings for text using the `text-embedding-ada-002` model.
- Create a list that contains the amount of tokens for each document
- Calculate the estimated cost: the sum of all tokens, divided by 1000 tokens, multiplied with $0.0001

In [125]:
# Import tiktoken
import tiktoken

# encoder
encoder = tiktoken.get_encoding("cl100k_base")

# list containing the number of tokens for each document
tokens_per_doc = [len(encoder.encode(doc.page_content)) for doc in docs]

# Show estimated cost
total_tokens = sum(tokens_per_doc)
cost_per_1000_tokens = 0.0001
cost = (total_tokens / 1000) * cost_per_1000_tokens
print("Total Cost in USD:", cost)

Total Cost in USD: 0.0374556


## Create the Index on Pinecone

Looks like calculating the embeddings is not going to be too expensive. It's always smart to get a rough estimate on the amount of tokens used, so you get an idea of the cost of calculating the embeddings using OpenAI.

Now we're ready to create the index on Pinecone. An [index in Pinecone](https://docs.pinecone.io/docs/indexes) can be used to store vectors. You can compare an index in Pinecone to a table in SQL, it stores information of one type of object.

In a later task, we'll be creating vectors from the documents we just created using OpenAI's second-generation embedding model. It's important to already know the embeddings we're going to use since we need to know the output dimensions to create an index. For `text-embedding-ada-002`, this is `1536` dimensions ([source](https://platform.openai.com/docs/guides/embeddings/second-generation-models)).

At the end of this task, you should be able to find your new index, `imdb-movies`, in the [Pinecone UI](https://app.pinecone.io/).

### Initialize Pinecone

- Import the `os` package.
- Import the `pinecone` package.
- Set the pinecone api key from the environment variable. Assign to `api_key`.
- Initialize Pinecone using the API key. Assign to `pc`.

In [126]:
import os
import pinecone

# pinecone api key 
api_key = os.getenv("PINECONE_API_KEY")

# Initialize pinecone
pc = pinecone.Pinecone(api_key)

- List the names of available indexes. Assign to `existing_index_names`.
- Use `.create_index` to create an index with the name `"imdb-movies"`, but only if it does not exist yet. The metric we'll use is the `"cosine"` distance, and as we mentioned above, the embeddings wil have `1536` dimensions. Use a Serverless specification setting `cloud` to `aws` and `region` to `us-east-1`. 

In [127]:
print("Available Index are:")
pc.list_indexes()

Available Index are:


{'indexes': [{'dimension': 1536,
              'host': 'squad-search-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'squad-search',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'imdb-movies-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'imdb-movies',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'vector-index-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'vector-index',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
      

In [128]:
print(" Chek again available Index are:")
pc.list_indexes()

 Chek again available Index are:


{'indexes': [{'dimension': 1536,
              'host': 'squad-search-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'squad-search',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'imdb-movies-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'imdb-movies',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
              'host': 'vector-index-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'vector-index',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 1536,
      

In [129]:
# set index name and check the names of available indexes
index_name = "imdb-movies"

existing_index_names = [idx.name for idx in pc.list_indexes().indexes]

# First check that the given index does not exist yet before we create it here
if index_name not in existing_index_names:
    print(f"Your index name called {index_name} does not exixst yet, so we can create this")
    pc.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,
        spec=pinecone.ServerlessSpec(cloud="aws", region="us-east-1")
    )

In [130]:
# index stats
index_stats = pc.describe_index(index_name)
index_stats

{'dimension': 1536,
 'host': 'imdb-movies-yy0cjqi.svc.aped-4627-b74a.pinecone.io',
 'metric': 'cosine',
 'name': 'imdb-movies',
 'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
 'status': {'ready': True, 'state': 'Ready'}}

## Fill the Index with the Documents

Now that we have the vector index at our disposal, it's time to populate it with some vectors. In this task, we'll need to:

1. Generate vector embeddings for all documents in `docs`. We'll utilize OpenAI for this purpose. langchain provides a convenient helper for this task, `langchain.embeddings.openai.OpenAIEmbeddings`, which you can use to generate embeddings using the latest `text-embedding-ada-002` model.
2. Populate the vector index in Pinecone with these embeddings. Fortunately, langchain also offers assistance with this through the [`langchain.vectorstores.Pinecone`](https://python.langchain.com/docs/integrations/vectorstores/pinecone) helper.

These two steps can be combined using the convenient helper method `.from_document` of the `Pinecone` class. This method accepts an embedding model as input and efficiently calculates the embeddings, subsequently uploading them to Pinecone. We will also introduce some control flow to the code to ensure we do not add data to the Pinecone index if it already contains data. To achieve this, we can make use of the `.from_existing_index` method of `Pinecone`.

In addition to storing vectors, Pinecone allows the storage of additional metadata. When using the langchain helpers, it automatically assumes that vectors should be created from the `page_content` property of each `Document`. All other properties will be included as metadata.

You can verify that everything has worked correctly by accessing the `imdb-movies` index in the Pinecone UI.

- From the `langchain_openai` package, import `OpenAIEmbeddings`.
- From the `langchain_pinecone` package, import `PineconeVectorStore`.
- Create the embeddings object.
- Create an index from its name.

In [131]:
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore

# embeddings object
embeddings = OpenAIEmbeddings()

# we will create an index from its name
index = pc.Index(index_name)

### Fill the index. _Some control flow is provided._

- Count the number of vectors in the index. Assign to `n_vectors`.
- Check if there is already some data in the index. (If `n_vectors` is greater than zero.)
- If there is data in the index, get the documents to search from the index. Assign to `docsearch`.
- If there is no data, fill the index from the documents and return those docs to assign to `docsearch`.
- Use the predefined question to ask about movies.
- Convert the vector database to a retriever and get the relevant documents for a question.

In [132]:
n_vectors = index.describe_index_stats()['total_vector_count']
print(f"There are {n_vectors} vectors in the index already.")

if n_vectors > 0:
    docsearch = PineconeVectorStore.from_existing_index(index_name, embeddings)
else:
    docsearch = PineconeVectorStore.from_documents(docs, embeddings, index_name=index_name)

# first question about movies to search
question = "What's a good movie about gladiators or the Roman Empire?"
    
print("These are the documents most relevant to the question:")
docsearch.as_retriever().invoke(question)

There are 4963 vectors in the index already.
These are the documents most relevant to the question:


[Document(page_content='Title: Gladiator\nGenre: Action,Adventure,Drama\nDescription: Commodus (Joaquin Phoenix) takes power and strips rank from Maximus (Russell Crowe), one of the favored generals of his predecessor and father, Emperor Marcus Aurelius, the great stoical philosopher. Maximus is then relegated to fighting to the death in the gladiator arenas.', metadata={'source': 'https://www.imdb.com/title/tt0172495'}),
 Document(page_content="Title: Imperium\nGenre: Biography,Crime,Drama\nDescription: An idealistic FBI agent (Daniel Radcliffe) goes under cover to infiltrate a white supremacist group that's plotting an act of terror.", metadata={'source': 'https://www.imdb.com/title/tt4781612'}),
 Document(page_content='Title: Caesar Must Die\nGenre: Drama\nDescription: Inmates in a high-security prison prepare for a public performance of Shakespeare\'s "Julius Caesar."', metadata={'source': 'https://www.imdb.com/title/tt2177511'}),
 Document(page_content='Title: War for the Planet o

In [133]:
#second query for another topic
question_2 = "What's a good movie about animals and animation?"

print("These are the documents most relevant to animals and animation movies:")
docsearch.as_retriever().invoke(question_2)

These are the documents most relevant to animals and animation movies:


[Document(page_content='Title: Life, Animated\nGenre: Comedy,Documentary,Drama\nDescription: "The Little Mermaid," "The Lion King" and other animated Disney movies help a young autistic man to develop reading, writing and communication skills.', metadata={'source': 'https://www.imdb.com/title/tt3917210'}),
 Document(page_content="Title: The Big Bad Fox and Other Tales\nGenre: Adventure,Animation,Comedy\nDescription: The countryside isn't always as calm and peaceful as it's made out to be, and the animals on this farm are particularly agitated: a fox who mothers a family of chicks, a rabbit who plays the stork, and a duck who wants to be Santa Claus.", metadata={'source': 'https://www.imdb.com/title/tt5851904'}),
 Document(page_content='Title: Zootopia\nGenre: Adventure,Animation,Comedy\nDescription: From the largest elephant to the smallest shrew, the city of Zootopia is a mammal metropolis where various animals live and thrive. When Judy Hopps (Ginnifer Goodwin) becomes the first rabb

## Create Prompts for RAG

In the previous task, we observed that the vector store can be utilized to retrieve relevant documents related to specific queries. For instance, when asked "What's a good movie about vikings?", the movie 'The Northman' was returned as a result. It is important to note that we did not incorporate any measure of movie quality into the system, so the notion of the movie being "good" is not explicitly encoded in the embeddings. It is crucial to always consider the data provided to the system and interpret the results of the AI system within that context. To enhance the results, one approach could be to include information about the movie quality in the movie description.

The remarkable aspect of RAG is the ability to provide relevant context to the LLM within the prompt itself. In the aforementioned example, we would include a description of 'The Northman' in the prompt, enabling the LLM to generate factual information beyond its knowledge cutoff. 'The Northman' was released in 2022, while the knowledge cutoff for the GPT-3.5 Turbo model is set at September 2021.

Now that you understand how the retriever can be employed to retrieve relevant documents from the vector database, we need to devise a prompt that presents this information to the LLM when we pose a question.

We require two types of [prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/):
- A template that demonstrates how the information in relevant documents is presented to the LLM
- A template that combines the context with the rest of the prompt

Some example prompt templates are provided in the sample code below, which you are free to edit. Notice that these example templates contain `=========` separators between different parts of the text. These kinds of delimiters are a common tactic to help the LLM distinguish between different parts of your input prompt.

### Create the question and document prompts.

- Import `PromptTemplate` from `langchain.prompts`
- Some example prompt templates are already provided for you. You are free to adapt them at your will. There are two prompt templates:
  - `DOCUMENT_PROMPT`: this template shows how a summary text is created for each document. The properties between the curly brackets (`{`) are replaced with the properties of each `Document`.
  - `QUESTION_PROMPT`: this template creates the full prompt that is sent to the LLM. `question` is replaced by the question asked by the user, and `summaries` is replaced with the summary of each relevant document, created by the `DOCUMENT_PROMPT` template
- Create the `PromptTemplate` objects by using `PromptTemplate.from_template`. Call them `document_prompt` and `question_prompt`, respectively.

In [134]:
# Import PromptTemplate
from langchain.prompts import PromptTemplate

DOCUMENT_PROMPT = """{page_content}
IMDB link: {source}
========="""

QUESTION_PROMPT = """Given the following extracted parts of a movie database and a question, create a final answer with the IMDB link as source ("SOURCE").
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCE" part in your answer.

QUESTION: What's a good movie about a robot to watch with my kid?
=========
Title: A.I. Artificial Intelligence
Genre: Drama,Sci-Fi
Description: A robotic boy, the first programmed to love, David (Haley Joel Osment) is adopted as a test case by a Cybertronics employee (Sam Robards) and his wife (Frances O'Connor). Though he gradually becomes their child, a series of unexpected circumstances make this life impossible for David. Without final acceptance by humans or machines, David embarks on a journey to discover where he truly belongs, uncovering a world in which the line between robot and machine is both vast and profoundly thin.
IMDB link: https://www.imdb.com/title/tt0212720
=========
Title: I, Robot
Genre: Action,Mystery,Sci-Fi
Description: In 2035, highly intelligent robots fill public service positions throughout the world, operating under three rules to keep humans safe. Despite his dark history with robotics, Detective Del Spooner (Will Smith) investigates the alleged suicide of U.S. Robotics founder Alfred Lanning (James Cromwell) and believes that a human-like robot (Alan Tudyk) murdered him. With the help of a robot expert (Bridget Moynahan), Spooner discovers a conspiracy that may enslave the human race.
IMDB link: https://www.imdb.com/title/tt0343818
=========
Title: The Iron Giant
Genre: Action,Adventure,Animation
Description: In this animated adaptation of Ted Hughes' Cold War fable, a giant alien robot (Vin Diesel) crash-lands near the small town of Rockwell, Maine, in 1957. Exploring the area, a local 9-year-old boy, Hogarth, discovers the robot, and soon forms an unlikely friendship with him. When a paranoid government agent, Kent Mansley, becomes determined to destroy the robot, Hogarth and beatnik Dean McCoppin (Harry Connick Jr.) must do what they can to save the misunderstood machine.
IMDB link: https://www.imdb.com/title/tt0129167
=========
FINAL ANSWER: 'The Iron Giant' is an animated movie about a friendship between a robot and a kid. It would be a good movie to watch with a kid.
SOURCE: https://www.imdb.com/title/tt0129167

QUESTION: {question}
=========
{summaries}
FINAL ANSWER:"""

# Create prompt template objects
document_prompt = PromptTemplate.from_template(DOCUMENT_PROMPT)
question_prompt = PromptTemplate.from_template(QUESTION_PROMPT)

## Chain Everything Together to Perform RAG

Finally, we have the vector index filled up with information, we have the prompt templates set up. That means we have everything we need to build a question-answering bot, which can use the information retrieved from Pinecone to answer questions about movies.

We'll use the GPT-3.5 Turbo model of OpenAI in order to provide a completion for the question prompt above.

Langchain provides a convenient concept, called [chains](https://python.langchain.com/docs/modules/chains/), that does some of the heavy lifting when you need to combine multiple AI systems into a single application. For the purpose of this project, we'll be using the `RetrievalQAWithSourcesChain` class. This chain will accept a `question` and a `retriever`. When asked a question, it will first use the retriever to retrieve relevant documents. Afterwards, it will combine the documents into a prompt and send it to the LLM to provide a completion.

- From the `langchain.chains` module, import `RetrievalQAWithSourcesChain`.
- From the `langchain_openai` package, import `ChatOpenAI`.
- Create an OpenAI client LLM model. Make it a `ChatOpenAI()` object with `model_name` `"gpt-3.5-turbo-1106"` and `temperature` set to `0` (minimal randomness). Assign to `llm`.

In [135]:
from langchain.chains import RetrievalQAWithSourcesChain
from langchain_openai import ChatOpenAI

# Create an OpenAI client LLM model
llm = ChatOpenAI(model_name="gpt-3.5-turbo-1106", temperature=0)

### Create the QA bot LLM chain.

- Create a `RetrievalQAWithSourcesChain` from the chain type to answer questions. 
  - Set `chain_type` set to `"stuff"`. This is the simplest type of chain, and will just stuff the document context in one prompt.
  - Set `llm` to the instance of `ChatOpenAI` you recently created.
  - Use the `PromptTemplate` objects you created above to pass to `chain_type_kwargs`
  - As a retriever, use the `docsearch.as_retriever` method you've seen before.
- Invoke `qa_with_sources` to ask the LLM the question about movies.

In [136]:
# Create the QA bot LLM chain
qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    chain_type="stuff",
    llm=llm,
    chain_type_kwargs={
        "document_prompt": document_prompt,
        "prompt": question_prompt,
    },
    retriever=docsearch.as_retriever(),
)

In [137]:
# Invoke qa_with_sources to ask the LLM the first question about movies
qa_with_source.invoke(question)

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about gladiators or the Roman Empire?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about gladiators or the Roman Empire?",
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] [746ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "'Gladiator' is a good movie about 

{'question': "What's a good movie about gladiators or the Roman Empire?",
 'answer': "'Gladiator' is a good movie about the Roman Empire and gladiators.\n",
 'sources': 'https://www.imdb.com/title/tt0172495'}

In [138]:
# Invoke qa_with_sources to ask the LLM the second question about movies
qa_with_source.invoke(question_2)

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about animals and animation?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about animals and animation?",
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] [699ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "'Zootopia' is a good animated movie about animals. I

{'question': "What's a good movie about animals and animation?",
 'answer': "'Zootopia' is a good animated movie about animals. It's a comedy and adventure movie that would be great to watch with a kid.\n",
 'sources': 'https://www.imdb.com/title/tt2948356'}

## Add Debug Logging in Lancgchin

Let's take a moment to address what we've achieved by using RAG, which would be impossible to achieve with just using GPT-3.5 Turbo as an LLM:

1. We enabled the LLM to answer the question with factual information, which can even be information from after ChatGPT's knowledge cutoff (which is September 2021).
2. We enabled the LLM to provide sources with the answer it generates.

Pretty neat, right?

We saw that langchain is very convenient when it comes to quickly creating smart AI systems. However, for learning, it can be quite challenging to understand what's happening behind the scenes. For example, from the code in Task 7, it's not clear that `qa_with_sources` actually first calls Pinecone to retrieve documents, then uses those documents to fill in the prompt to send along to the `gpt-3.5-turbo` LLM.

Let's look at how we can get some more insights into how this all works.

- Import `langchain`
- Set `.debug` to `True` on `langchain`
- Run `qa_with_sources(question)` again

Observe the information that is printed in the output. Langchain enables you to run chains of LLMs or other AI systems, one after the other. The input for the next chain is passed on from the previous, where new information can be added by, for example, using embeddings to find relevant documents. Each chain or LLM is marked with a tag like `[chain/start]` or `[llm/start]`. When a final response is fetched from the last part of the chain, the output travels back up the chain. This is marked with the `[chain/end]` and `[llm/end]` marks.

In [139]:
import langchain

# Enable debug logging
langchain.debug = True

# Invoke qa_with_sources to ask the LLM the same question about movies
qa_with_sources.invoke(question)

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about gladiators or the Roman Empire?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about gladiators or the Roman Empire?",
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] [1.20s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "'Gladiator' is a good movie about 

{'question': "What's a good movie about gladiators or the Roman Empire?",
 'answer': "'Gladiator' is a good movie about the Roman Empire and gladiators.\n",
 'sources': 'https://www.imdb.com/title/tt0172495'}

### Final Conclusion

In this project, we explored the powerful capabilities of Langchain for managing and executing chains of language models (LLMs) and other AI systems. By enabling debug logging, we were able to gain deep insights into the step-by-step execution of the chains, which is crucial for understanding and optimizing the performance of our AI workflows.

#### Steps Performed:
1. **Importing Langchain**: We started by importing the Langchain library, which is essential for creating and managing chains of LLMs.
2. **Enabling Debug Logging**: By setting `langchain.debug` to `True`, we enabled detailed logging, which helped us trace the flow of data and the execution of each chain and LLM.
3. **Executing `qa_with_sources`**: We invoked the `qa_with_sources` function to ask questions about movies, leveraging the Langchain framework to fetch and process relevant information.

#### Key Insights:
- **Chain Execution**: Langchain allows for the seamless execution of multiple LLMs or AI systems in a chain, where the output of one serves as the input for the next. This chaining mechanism is marked with tags like `[chain/start]` and `[chain/end]`, providing clear visibility into the process.
- **Debugging and Optimization**: The debug logs provided by Langchain are invaluable for debugging and optimizing the chains. They offer detailed information about each step, including the inputs and outputs of each LLM, which helps in identifying bottlenecks and improving performance.
- **Relevance and Accuracy**: By using embeddings to find relevant documents, Langchain ensures that the information passed through the chains is both relevant and accurate. This is particularly important for tasks like question answering, where the quality of the final response depends on the quality of the intermediate steps.

#### Potential for Industry Applications:
The Langchain framework is highly versatile and can be applied to a wide range of AI projects in the industry. Its ability to manage complex chains of LLMs makes it suitable for tasks such as:
- **Customer Support**: Automating customer support with accurate and context-aware responses.
- **Content Generation**: Creating high-quality content by chaining multiple LLMs for brainstorming, drafting, and editing.
- **Data Analysis**: Performing complex data analysis by chaining models that handle data extraction, transformation, and interpretation.

In conclusion, Langchain is a powerful tool for managing and executing chains of LLMs, offering detailed logging and optimization capabilities. Its potential applications in the industry are vast, making it a valuable asset for any AI-driven project.