# Using Retrieval-Augmented Generation to Search a Movie Database

Retrieval-augmented generation, or _RAG_, is a technique used with large language models to provide additional context without fine-tuning or retraining. It enhances the ability of language models to provide factual responses, which is a limitation of classical setups.

The goal of this project is to build a question-answering bot for movie-related questions. To achieve this, we will use RAG to provide factual information to the language model. We will upload movie descriptions to a vector database and use it to search for relevant context for the language model.

We will be using the following tools and models:
- [OpenAI](https://openai.com)'s `gpt-3.5-turbo` model for prompt completions
- OpenAI's `text-embedding-ada-002` model to create vector embeddings
- [Pinecone](https://www.pinecone.io/) as the vector database to store the embeddings
- [langchain](https://www.langchain.com/) as the tool to interact with OpenAI and Pinecone

The dataset used for this project is sourced from the Kaggle dataset [IMDb Movies/Shows with Descriptions](https://www.kaggle.com/datasets/ishikajohari/imdb-data-with-descriptions).

### Maintenance note, May 2024

Since this code-along was released, the Python packages for working with the Pinecone and OpenAI APIs have changed their syntax. The instructions, hints, and code have been updated to use the latest syntax, but the video has not been updated. Consequently, it is now slightly out of sync. Trust the workbook, not the video.

## Before you begin

To get started with this project, you'll need a developer account for OpenAI and Pinecone. Follow the steps in the [getting-started.ipynb](https://app.datacamp.com/workspace/w/f1d996aa-0aaa-47e3-bd61-2b5b5a0fa558/edit/getting-started.ipynb) notebook to create an API key and store it in Workspace.

For this project, we will assume that you have already set the `OPENAI_API_KEY` and `PINECONE_API_KEY` environment variables.

## Task 0: Setup

To perform this analysis, we need to install the following packages:

- `openai`: for interacting with OpenAI.
- `pinecone-client`: for interacting with Pinecone.
- `langchain`: a framework for developing with generative AI.
- `langchain-openai` and `langchain-pinecone`: Langchain extension modules with functionality for OpenAI and Pinecone.
- `tiktoken`: a string encoder that generates tokens used by OpenAI. It is useful for estimating the number of tokens used.

### Instructions

Run the cell below to install the corresponding packages.

In [20]:
# Install the openai package, locked to version 1.27
!pip install openai==1.27

# Install the pinecone-client package, locked to version 4.0.0
!pip install pinecone-client==4.0.0

# Install the langchain package, locked to version 0.1.19
!pip install langchain==0.1.19

# Install the langchain-openai package, locked to version 0.1.6
!pip install langchain-openai==0.1.6

# Update the langchain-pinecone package, locked to version 0.1.0
!pip install langchain-pinecone==0.1.0

# Update the tiktoken package, locked to version 0.7.0
!pip install tiktoken==0.7.0

# Update the typing_extensions package, locked to version 4.11.0
!pip install typing_extensions==4.11.0

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
Collecting pinecone-client==4.0.0
  Using cached pinecone_client-4.0.0-py3-none-any.whl.metadata (16 kB)
Using cached pinecone_client-4.0.0-py3-none-any.whl (214 kB)
Installing collected packages: pinecone-client
  Attempting uninstall: pinecone-client
    Found existing installation: pinecone-client 3.2.2
    Uninstalling pinecone-client-3.2.2:
      Successfully uninstalled pinecone-client-3.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.

## Task 1: Import the Movies Data

We'll start with importing the dataset we mentioned at the top of this project. You have the dataset available as a CSV in your workspace: `"IMDB.csv"`. We need to import the dataset and transform it into a convenient format.

### Instructions

- Import the `pandas` package as `pd`
- Import `"IMDB.csv"` into a variable `movies_raw`.
- Print the head of `movies_raw`.

In [21]:
# Import pandas as pd
import pandas as pd

# Import IMBD.csv. Assign to movies_raw.
movies_raw=pd.read_csv("IMDB.csv")

# Print the head of movies_raw
movies_raw.head()

Unnamed: 0.1,Unnamed: 0,index,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,averageRating,numVotes,ordering,title,region,language,types,attributes,isOriginalTitle,Description
0,0,0,tt0102926,movie,The Silence of the Lambs,The Silence of the Lambs,0,1991,\N,118,"Crime,Drama,Thriller",8.6,1473918,50,The Silence of the Lambs,US,en,\N,\N,0,"Jodie Foster stars as Clarice Starling, a top ..."
1,1,1,tt0103064,movie,Terminator 2: Judgment Day,Terminator 2: Judgment Day,0,1991,\N,137,"Action,Sci-Fi",8.6,1128166,17,Terminator 2: Judgment Day,US,en,dvd,\N,0,"In this sequel set eleven years after ""The Ter..."
2,2,3,tt0110357,movie,The Lion King,The Lion King,0,1994,\N,88,"Adventure,Animation,Drama",8.5,1090882,18,The Lion King 3D,US,en,\N,3-D version,0,This Disney animated feature follows the adven...
3,3,4,tt0110912,movie,Pulp Fiction,Pulp Fiction,0,1994,\N,154,"Crime,Drama",8.9,2118762,22,Pulp Fiction,US,en,\N,\N,0,Vincent Vega (John Travolta) and Jules Winnfie...
4,4,5,tt0111161,movie,The Shawshank Redemption,The Shawshank Redemption,0,1994,\N,142,Drama,9.3,2759621,2,The Shawshank Redemption,US,en,\N,\N,0,Andy Dufresne (Tim Robbins) is sentenced to tw...


### Instructions

Transform on `movies_raw` and assign to `movies`.
  
- Rename `primaryTitle` to `movie_title` and `Description` to `movie_description`
- Create a column `source` that contains the identifier of the movie, prefixed by `"https://www.imdb.com/title/"`. The end result should be a working link to the movie. The identifier can be found in the `"tconst"` column in `"IMDB.csv"`. For example, `"https://www.imdb.com/title/tt0102926/"`.
- Filter out all rows that do not have `"movie"` as a `titleType`
- Select the `movie_title`, `movie_description`, `source` and `genres` columns
- Show the head of `movies`.

In [22]:
# Rename primaryTitle, Description columns. Assign to movies.
movies = movies_raw.rename(columns = {
    "primaryTitle": "movie_title",
    "Description": "movie_description",
})

# Add source column from tconst
movies["source"] = "https://www.imdb.com/title/" + movies["tconst"]

# Subset for titleType equal to "movie"
movies = movies.loc[movies["titleType"] == "movie"]

# Select movie_title, movie_description, source, genres columns
movies = movies[["movie_title", "movie_description", "source", "genres"]]

# Show the head of movies


In [23]:
movies.head()

Unnamed: 0,movie_title,movie_description,source,genres
0,The Silence of the Lambs,"Jodie Foster stars as Clarice Starling, a top ...",https://www.imdb.com/title/tt0102926,"Crime,Drama,Thriller"
1,Terminator 2: Judgment Day,"In this sequel set eleven years after ""The Ter...",https://www.imdb.com/title/tt0103064,"Action,Sci-Fi"
2,The Lion King,This Disney animated feature follows the adven...,https://www.imdb.com/title/tt0110357,"Adventure,Animation,Drama"
3,Pulp Fiction,Vincent Vega (John Travolta) and Jules Winnfie...,https://www.imdb.com/title/tt0110912,"Crime,Drama"
4,The Shawshank Redemption,Andy Dufresne (Tim Robbins) is sentenced to tw...,https://www.imdb.com/title/tt0111161,Drama


## Task 2: Create Documents from the Data

Later in this project, we will be creating vector embeddings for all of the rows in the `movies` DataFrame. Before we do so, we need to create [Document](https://docs.langchain.com/docs/components/schema/document) objects from the data in the DataFrame. To accomplish this, we can utilize the `DataFrameLoader` class provided by langchain, which allows us to create documents from a pandas DataFrame.

For the main content of the documents, we will create a summary string that includes relevant information about each movie. To achieve this, we will combine the movie title, description, and genre into a `page_content` column. Additionally, we will retain the IMDB link in the `source` column as metadata.

### Instructions

- Import `DataFrameLoader` from `langchain.document_loaders`
- Create a column `page_content` that creates strings that contain information about the movie title, genre and description. For example, the first movie should look like this:
```
Title: The Silence of the Lambs
Genre: Crime,Drama,Thriller
Description: Jodie Foster stars as Clarice Starling, a top student at the FBI's training academy. Jack Crawford (Scott Glenn) wants Clarice to interview Dr. Hannibal Lecter (Anthony Hopkins), a brilliant psychiatrist who is also a violent psychopath, serving life behind bars for various acts of murder and cannibalism. Crawford believes that Lecter may have insight into a case and that Starling, as an attractive young woman, may be just the bait to draw him out.
```
- Only keep the columns `page_content` and `source` in the movies DataFrame
- Use `DataFrameLoader` to load documents from the `movies` DataFrame into `docs`. Use `"page_content"` as the `page_content_column`.
- Print the first 3 documents and the total number of documents

In [24]:
# Import DataFrameLoader
from langchain.document_loaders import DataFrameLoader

# Create page content column
movies["page_content"] = "title: "+ movies["movie_title"] + \
                         "\n Genre: " + movies["genres"]  + \
                         "\n Description: " + movies["movie_description"]

# Select page_content and source columns
movies = movies[["page_content","source"]]


In [25]:

# Load the documents from the dataframe into docs
# The page content column is 'movie_description'
docs = DataFrameLoader(
    movies,
    page_content_column="page_content",
).load()

# Print the first 3 documents and the number of documents
print(f"First 3 documents: {docs[:3]}")
print(f"Number of documents: {len(docs)}")

First 3 documents: [Document(page_content="title: The Silence of the Lambs\n Genre: Crime,Drama,Thriller\n Description: Jodie Foster stars as Clarice Starling, a top student at the FBI's training academy. Jack Crawford (Scott Glenn) wants Clarice to interview Dr. Hannibal Lecter (Anthony Hopkins), a brilliant psychiatrist who is also a violent psychopath, serving life behind bars for various acts of murder and cannibalism. Crawford believes that Lecter may have insight into a case and that Starling, as an attractive young woman, may be just the bait to draw him out.", metadata={'source': 'https://www.imdb.com/title/tt0102926'}), Document(page_content='title: Terminator 2: Judgment Day\n Genre: Action,Sci-Fi\n Description: In this sequel set eleven years after "The Terminator," young John Connor (Edward Furlong), the key to civilization\'s victory over a future robot uprising, is the target of the shape-shifting T-1000 (Robert Patrick), a Terminator sent from the future to kill him. Ano

## Task 3: Estimate the Cost of Embedding

We're going to be using OpenAI to calculate [vector embeddings](https://platform.openai.com/docs/guides/embeddings/embeddings) of the document texts. Creating embeddings is a form of dimensionality reduction, where we assign the text to a point in an N-dimensional space. Texts that are semantically close to each other should end up being close to each other in the N-dimensional space.

Luckily, OpenAI has several models that are trained to calculate these kinds of embeddings, so we don't have to do that ourselves. Of course, a cost is associated with this. You can derive the cost from the [pricing page of OpenAI](https://openai.com/pricing).

The calculation is based on the amount of _tokens_ in the text. All text is encoded into tokens to be used by OpenAI. On average, a token consists of roughly 3 characters. However, we can calculate the exact tokens for a string of text by using the `tiktoken` package.

The goal of this task is to calculate the number of tokens in the documents, to then extrapolate the estimated cost.

### Instructions

- Import `tiktoken`
- Create the encoder, use the `"cl100k_base"` encoder. This is the encoder used by OpenAI to calculate the embeddings for text using the `text-embedding-ada-002` model.
- Create a list that contains the amount of tokens for each document
- Calculate the estimated cost: the sum of all tokens, divided by 1000 tokens, multiplied with $0.0001

In [26]:
# Import tiktoken

import tiktoken

# Create the encoder

encoder= tiktoken.get_encoding("cl100k_base")

# Create a list containing the number of tokens for each document
tokens_per_doc = [len(encoder.encode(doc.page_content)) for doc in docs]

# Show the estimated cost, which is the sum of the amount of tokens divided by 1000, times $0.0001
total_tokens = sum(tokens_per_doc)
cost_per_1000_tokens = 0.0001
cost = (total_tokens / 1000) * cost_per_1000_tokens
cost

0.0374556

## Task 4: Create the Index on Pinecone

Looks like calculating the embeddings is not going to be too expensive. It's always smart to get a rough estimate on the amount of tokens used, so you get an idea of the cost of calculating the embeddings using OpenAI.

Now we're ready to create the index on Pinecone. An [index in Pinecone](https://docs.pinecone.io/docs/indexes) can be used to store vectors. You can compare an index in Pinecone to a table in SQL, it stores information of one type of object.

In a later task, we'll be creating vectors from the documents we just created using OpenAI's second-generation embedding model. It's important to already know the embeddings we're going to use since we need to know the output dimensions to create an index. For `text-embedding-ada-002`, this is `1536` dimensions ([source](https://platform.openai.com/docs/guides/embeddings/second-generation-models)).

At the end of this task, you should be able to find your new index, `imdb-movies`, in the [Pinecone UI](https://app.pinecone.io/).

![Pinecone UI](pinecone_ui.png)

### Instructions

Initialize Pinecone, getting setup details from DataLab environment variables.

- Import the `os` package.
- Import the `pinecone` package.
- Set the pinecone api key from the environment variable. Assign to `api_key`.
- Initialize Pinecone using the API key. Assign to `pc`.

<details>
<summary>Code hints</summary>
<p>
    
The Pinecone environment variable is usually called `PINECONE_API_KEY`, but check what you called it!
    
---
    
To initialize Pinecone, call `pinecone.Pinecone()`, setting `api_key` to the API key.

</p>
</details>

In [27]:
# Import os and pinecone
import os
import pinecone

# Set the pinecone api key from the environment variable. Assign to api_key.
pc_api_key = os.environ["PINECONE_API_KEY"]
pc= pinecone.Pinecone(api_key=pc_api_key)

# Initialize pinecone using the `PINECONE_API_KEY` variable. 


### Instructions

- List the names of available indexes. Assign to `existing_index_names`.
- Use `.create_index` to create an index with the name `"imdb-movies"`, but only if it does not exist yet. The metric we'll use is the `"cosine"` distance, and as we mentioned above, the embeddings wil have `1536` dimensions. Use a Serverless specification setting `cloud` to `aws` and `region` to `us-east-1`. 

<details>
<summary>Code hints</summary>
<p>
    
Get the list of available indexes with `pc.list_indexes()`. The code pattern to get all available index names is as follows.
    
```py
[idx.name for idx in pc.list_indexes().indexes]
```
    
---
    
Create an index with `pc.create_index()`, passing the index name, and setting the dimension, metric, and spec. Currently, only AWS is supported and not all regions are available. Try `us-east-1` as your first option, and `us-west-2` as a backup plan. The code pattern to create an index is as follows.
    
```py
pc.create_index(
        index_name,
        dimension=n_dims,
        metric="cosine|dotproduct|euclidean",
        spec=pinecone.ServerlessSpec(
            cloud="aws",
            region="us-east-1"
        )
    )
```
    
</p>
</details>

In [28]:
# Import os and pinecone
# Use this index name
index_name = "imdb-movies"

# List the names of available indexes. Assign to existing_index_names.
existing_index_names = [idx.name for idx in pc.list_indexes().indexes]

# First check that the given index does not exist yet
if index_name not in existing_index_names:
    # Create the 'imbd-movies' index with cosine metric, 1536 dims, serverless spec: aws in us-east-1
    pc.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,
        spec=pinecone.ServerlessSpec(cloud="aws", region="us-east-1")
    )

## Task 5: Fill the Index with the Documents

Now that we have the vector index at our disposal, it's time to populate it with some vectors. In this task, we'll need to:

1. Generate vector embeddings for all documents in `docs`. We'll utilize OpenAI for this purpose. langchain provides a convenient helper for this task, `langchain.embeddings.openai.OpenAIEmbeddings`, which you can use to generate embeddings using the latest `text-embedding-ada-002` model.
2. Populate the vector index in Pinecone with these embeddings. Fortunately, langchain also offers assistance with this through the [`langchain.vectorstores.Pinecone`](https://python.langchain.com/docs/integrations/vectorstores/pinecone) helper.

These two steps can be combined using the convenient helper method `.from_document` of the `Pinecone` class. This method accepts an embedding model as input and efficiently calculates the embeddings, subsequently uploading them to Pinecone. We will also introduce some control flow to the code to ensure we do not add data to the Pinecone index if it already contains data. To achieve this, we can make use of the `.from_existing_index` method of `Pinecone`.

In addition to storing vectors, Pinecone allows the storage of additional metadata. When using the langchain helpers, it automatically assumes that vectors should be created from the `page_content` property of each `Document`. All other properties will be included as metadata.

You can verify that everything has worked correctly by accessing the `imdb-movies` index in the Pinecone UI.

1. ![Pinecone UI showing the imdb-movies index](pinecone_ui_index.png)

### Instructions

- From the `langchain_openai` package, import `OpenAIEmbeddings`.
- From the `langchain_pinecone` package, import `PineconeVectorStore`.
- Create the embeddings object.
- Create an index from its name.

In [29]:
# From the langchain_openai package, import OpenAIEmbeddings
from langchain_openai import OpenAIEmbeddings

# From the langchain_pinecone package, import PineconeVectorStore
from langchain_pinecone import PineconeVectorStore

# Create the embeddings object
embeddings=OpenAIEmbeddings()

# Create an index from its name
index= pc.Index(index_name)

### Instructions

Fill the index. _Some control flow is provided._

- Count the number of vectors in the index. Assign to `n_vectors`.
- Check if there is already some data in the index. (If `n_vectors` is greater than zero.)
- If there is data in the index, get the documents to search from the index. Assign to `docsearch`.
- If there is no data, fill the index from the documents and return those docs to assign to `docsearch`.
- Use the predefined question to ask about movies.
- Convert the vector database to a retriever and get the relevant documents for a question.

<details>
<summary>Code hints</summary>
<p>
    
Get statistics about the index with `index.describe_index_stats()`. The `total_vector_count` element contains the number of vectors in the index.
    
---

If there already is an index, you can retrieve the documents using `Pinecone.from_existing_index()`, passing the index name and the type of embeddings (in this case `OpenAIEmbeddings()`).

---
  
If there is no data in the index, you can populate it with `Pinecone.from_documents()`, passing the documents, the type of embeddings, and setting `index_name` to the index name.
    
---

Convert the documents to a retriever using the `.as_retriever()` method, passing no arguments.

---
    
Get the relevant documents from the retriever using `.get_relevant_documents()`, passing the question.
    
</p>
</details>

In [30]:
index.describe_index_stats().total_vector_count

4963

In [31]:
# Count the number of vectors in the index
n_vectors= index.describe_index_stats().total_vector_count
print(f"There are {n_vectors} vectors in the index already.")

# Check if there is already some data in the index on Pinecone
if n_vectors > 0:
    # If there is, get the documents to search from the index. Assign to docsearch.
    docsearch=PineconeVectorStore.from_existing_index(index_name,embeddings)
else:
    # If not, fill the index from the documents and return those docs to assign to docsearch
    docsearch = PineconeVectorStore.from_documents(docs,embeddings,index_name=index_name)

# Define a question about movies to ask
question = "What's a good movie about an epic viking?"
    
# Convert the vector database to a retriever and get the relevant documents for a question
print("These are the documents most relevant to the question:")
docsearch.as_retriever().invoke(question)

There are 4963 vectors in the index already.
These are the documents most relevant to the question:


[Document(page_content='title: The Northman\n Genre: Action,Adventure,Drama\n Description: The Northman is an epic revenge thriller, that explores how far a Viking prince will go to seek justice for his murdered father.', metadata={'source': 'https://www.imdb.com/title/tt11138512'}),
 Document(page_content="title: Thor\n Genre: Action,Fantasy\n Description: As the son of Odin (Anthony Hopkins), king of the Norse gods, Thor (Chris Hemsworth) will soon inherit the throne of Asgard from his aging father. However, on the day that he is to be crowned, Thor reacts with brutality when the gods' enemies, the Frost Giants, enter the palace in violation of their treaty. As punishment, Odin banishes Thor to Earth. While Loki (Tom Hiddleston), Thor's brother, plots mischief in Asgard, Thor, now stripped of his powers, faces his greatest threat.", metadata={'source': 'https://www.imdb.com/title/tt0800369'}),
 Document(page_content="title: Thor: Ragnarok\n Genre: Action,Adventure,Comedy\n Descriptio

## Task 6: Create Prompts for RAG

In the previous task, we observed that the vector store can be utilized to retrieve relevant documents related to specific queries. For instance, when asked "What's a good movie about vikings?", the movie 'The Northman' was returned as a result. It is important to note that we did not incorporate any measure of movie quality into the system, so the notion of the movie being "good" is not explicitly encoded in the embeddings. It is crucial to always consider the data provided to the system and interpret the results of the AI system within that context. To enhance the results, one approach could be to include information about the movie quality in the movie description.

The remarkable aspect of RAG is the ability to provide relevant context to the LLM within the prompt itself. In the aforementioned example, we would include a description of 'The Northman' in the prompt, enabling the LLM to generate factual information beyond its knowledge cutoff. 'The Northman' was released in 2022, while the knowledge cutoff for the GPT-3.5 Turbo model is set at September 2021.

Now that you understand how the retriever can be employed to retrieve relevant documents from the vector database, we need to devise a prompt that presents this information to the LLM when we pose a question.

We require two types of [prompt templates](https://python.langchain.com/docs/modules/model_io/prompts/prompt_templates/):
- A template that demonstrates how the information in relevant documents is presented to the LLM
- A template that combines the context with the rest of the prompt

Some example prompt templates are provided in the sample code below, which you are free to edit. Notice that these example templates contain `=========` separators between different parts of the text. These kinds of delimiters are a common tactic to help the LLM distinguish between different parts of your input prompt.

### Instructions

Create the question and document prompts.

- Import `PromptTemplate` from `langchain.prompts`
- Some example prompt templates are already provided for you. You are free to adapt them at your will. There are two prompt templates:
  - `DOCUMENT_PROMPT`: this template shows how a summary text is created for each document. The properties between the curly brackets (`{`) are replaced with the properties of each `Document`.
  - `QUESTION_PROMPT`: this template creates the full prompt that is sent to the LLM. `question` is replaced by the question asked by the user, and `summaries` is replaced with the summary of each relevant document, created by the `DOCUMENT_PROMPT` template
- Create the `PromptTemplate` objects by using `PromptTemplate.from_template`. Call them `document_prompt` and `question_prompt`, respectively.

In [32]:
# Import PromptTemplate
from langchain.prompts import PromptTemplate 

# Read/adapt the prompts below at will
DOCUMENT_PROMPT = """{page_content}
IMDB link: {source}
========="""

QUESTION_PROMPT = """Given the following extracted parts of a movie database and a question, create a final answer with the IMDB link as source ("SOURCE").
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCE" part in your answer.

QUESTION: What's a good movie about a robot to watch with my kid?
=========
Title: A.I. Artificial Intelligence
Genre: Drama,Sci-Fi
Description: A robotic boy, the first programmed to love, David (Haley Joel Osment) is adopted as a test case by a Cybertronics employee (Sam Robards) and his wife (Frances O'Connor). Though he gradually becomes their child, a series of unexpected circumstances make this life impossible for David. Without final acceptance by humans or machines, David embarks on a journey to discover where he truly belongs, uncovering a world in which the line between robot and machine is both vast and profoundly thin.
IMDB link: https://www.imdb.com/title/tt0212720
=========
Title: I, Robot
Genre: Action,Mystery,Sci-Fi
Description: In 2035, highly intelligent robots fill public service positions throughout the world, operating under three rules to keep humans safe. Despite his dark history with robotics, Detective Del Spooner (Will Smith) investigates the alleged suicide of U.S. Robotics founder Alfred Lanning (James Cromwell) and believes that a human-like robot (Alan Tudyk) murdered him. With the help of a robot expert (Bridget Moynahan), Spooner discovers a conspiracy that may enslave the human race.
IMDB link: https://www.imdb.com/title/tt0343818
=========
Title: The Iron Giant
Genre: Action,Adventure,Animation
Description: In this animated adaptation of Ted Hughes' Cold War fable, a giant alien robot (Vin Diesel) crash-lands near the small town of Rockwell, Maine, in 1957. Exploring the area, a local 9-year-old boy, Hogarth, discovers the robot, and soon forms an unlikely friendship with him. When a paranoid government agent, Kent Mansley, becomes determined to destroy the robot, Hogarth and beatnik Dean McCoppin (Harry Connick Jr.) must do what they can to save the misunderstood machine.
IMDB link: https://www.imdb.com/title/tt0129167
=========
FINAL ANSWER: 'The Iron Giant' is an animated movie about a friendship between a robot and a kid. It would be a good movie to watch with a kid.
SOURCE: https://www.imdb.com/title/tt0129167

QUESTION: {question}
=========
{summaries}
FINAL ANSWER:"""

# Create prompt template objects
document_prompt= PromptTemplate.from_template(DOCUMENT_PROMPT)
question_prompt= PromptTemplate.from_template(QUESTION_PROMPT)


## Task 7: Chain Everything Together to Perform RAG

Finally, we have the vector index filled up with information, we have the prompt templates set up. That means we have everything we need to build a question-answering bot, which can use the information retrieved from Pinecone to answer questions about movies.

We'll use the GPT-3.5 Turbo model of OpenAI in order to provide a completion for the question prompt above.

Langchain provides a convenient concept, called [chains](https://python.langchain.com/docs/modules/chains/), that does some of the heavy lifting when you need to combine multiple AI systems into a single application. For the purpose of this project, we'll be using the `RetrievalQAWithSourcesChain` class. This chain will accept a `question` and a `retriever`. When asked a question, it will first use the retriever to retrieve relevant documents. Afterwards, it will combine the documents into a prompt and send it to the LLM to provide a completion.

#### Note about GPT model versions

The release of GPT 3.5 turbo from Jan 2024, `gpt-3.5-turbo-0125`, cannot consistently answer this question. The older release, `gpt-3.5-turbo-1106` performs better, or for best results, use the (more expensive) GPT-4o, `gpt-4o`.

### Instructions

- From the `langchain.chains` module, import `RetrievalQAWithSourcesChain`.
- From the `langchain_openai` package, import `ChatOpenAI`.
- Create an OpenAI client LLM model. Make it a `ChatOpenAI()` object with `model_name` `"gpt-3.5-turbo-1106"` and `temperature` set to `0` (minimal randomness). Assign to `llm`.

In [33]:
# From the langchain.chains module, import RetrievalQAWithSourcesChain
from langchain.chains import RetrievalQAWithSourcesChain

# From the langchain_openai package, import ChatOpenAI
from langchain_openai import ChatOpenAI

# Create an OpenAI client LLM model. Assign to llm.
llm= ChatOpenAI(model_name="gpt-3.5-turbo-1106", temperature=0)

### Instructions

Create the QA bot LLM chain.

- Create a `RetrievalQAWithSourcesChain` from the chain type to answer questions. 
  - Set `chain_type` set to `"stuff"`. This is the simplest type of chain, and will just stuff the document context in one prompt.
  - Set `llm` to the instance of `ChatOpenAI` you recently created.
  - Use the `PromptTemplate` objects you created above to pass to `chain_type_kwargs`
  - As a retriever, use the `docsearch.as_retriever` method you've seen before.
- Invoke `qa_with_sources` to ask the LLM the question about movies.

<details>
<summary>Code hints</summary>
<p>
    
To create a `RetrievalQAWithSourcesChain` bot chain from the chain type, call the `.from_chain_type()` method of `RetrievalQAWithSourcesChain`. The code pattern is as follows.
    
    
```py
qa_bot = RetrievalQAWithSourcesChain.from_chain_type(
    chain_type="type of chain",
    llm=llm,
    chain_type_kwargs={
        "document_prompt": document_prompt,
        "prompt": question_prompt,
    },
    retriever=vector_of_documents.as_retriever(),
)
```
    
</p>
</details>

In [40]:
# Create the QA bot LLM chain
qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    chain_type="stuff",
    llm=llm,
    chain_type_kwargs={
        "document_prompt": document_prompt,
        "prompt": question_prompt,
    },
    retriever=docsearch.as_retriever(),
)

# Invoke qa_with_sources to ask the LLM the question about movies
qa_with_sources.invoke("good movie on epic viking")

{'question': 'good movie on epic viking',
 'answer': 'The Northman is an epic revenge thriller that explores how far a Viking prince will go to seek justice for his murdered father. It would be a good movie to watch about a Viking.\n',
 'sources': 'https://www.imdb.com/title/tt11138512'}

## Task 8: Add Debug Logging

Let's take a moment to address what we've achieved by using RAG, which would be impossible to achieve with just using GPT-3.5 Turbo as an LLM:

1. We enabled the LLM to answer the question with factual information, which can even be information from after ChatGPT's knowledge cutoff (which is September 2021).
2. We enabled the LLM to provide sources with the answer it generates.

Pretty neat, right?

We saw that langchain is very convenient when it comes to quickly creating smart AI systems. However, for learning, it can be quite challenging to understand what's happening behind the scenes. For example, from the code in Task 7, it's not clear that `qa_with_sources` actually first calls Pinecone to retrieve documents, then uses those documents to fill in the prompt to send along to the `gpt-3.5-turbo` LLM.

Let's look at how we can get some more insights into how this all works.

### Instructions

- Import `langchain`
- Set `.debug` to `True` on `langchain`
- Run `qa_with_sources(question)` again

Observe the information that is printed in the output. Langchain enables you to run chains of LLMs or other AI systems, one after the other. The input for the next chain is passed on from the previous, where new information can be added by, for example, using embeddings to find relevant documents. Each chain or LLM is marked with a tag like `[chain/start]` or `[llm/start]`. When a final response is fetched from the last part of the chain, the output travels back up the chain. This is marked with the `[chain/end]` and `[llm/end]` marks.

In [41]:
# Import langchain
import langchain

# Enable debug logging
langchain.debug= True

# Invoke qa_with_sources to ask the LLM the same question about movies
qa_with_sources.invoke(question)

[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about an epic viking?"
}
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain] Entering Chain run with input:
[0m[inputs]
[32;1m[1;3m[chain/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain] Entering Chain run with input:
[0m{
  "question": "What's a good movie about an epic viking?",
}
[32;1m[1;3m[llm/start][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
  ]
}
[36;1m[1;3m[llm/end][0m [1m[chain:RetrievalQAWithSourcesChain > chain:StuffDocumentsChain > chain:LLMChain > llm:ChatOpenAI] [546ms] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "I don't know.",
        "generation_info": {
          "finish_rea

{'question': "What's a good movie about an epic viking?",
 'answer': "I don't know.",
 'sources': ''}