# Building a RAG application from scratch

Here is a high-level overview of the system we want to build:

<img src='images/system1.png' width="1200">

Let's start by loading the environment variables we need to use.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY")

# This is the YouTube video we're going to use.
YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=lXUZvyajciY"

In [2]:
from openai import OpenAI
client = OpenAI()

## Setting up the model
Let's define the LLM model that we'll use as part of the workflow.

In [3]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(model="gpt-3.5-turbo")

We can test the model by asking a simple question.

In [4]:
model.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

AIMessage(content='The Los Angeles Dodgers won the World Series during the COVID-19 pandemic. They defeated the Tampa Bay Rays in six games in the 2020 World Series.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 32, 'prompt_tokens': 21, 'total_tokens': 53, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-80063e32-e9f5-4eee-b6f5-a6f9802a6f56-0', usage_metadata={'input_tokens': 21, 'output_tokens': 32, 'total_tokens': 53})

In [5]:
model.invoke("What was the Covid Pandemic?")

AIMessage(content='The Covid-19 pandemic, commonly referred to as the Covid Pandemic, was a global outbreak of the novel coronavirus SARS-CoV-2 that began in late 2019. The virus causes the respiratory illness known as Covid-19. The pandemic led to widespread illness, death, and economic disruption around the world, resulting in numerous restrictions, lockdowns, and public health measures to control the spread of the virus. The World Health Organization declared the Covid-19 outbreak a global pandemic on March 11, 2020.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 107, 'prompt_tokens': 14, 'total_tokens': 121, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-3.5-turbo-0125', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-5f191f0a-1

The result from the model is an `AIMessage` instance containing the answer. We can extract this answer by chaining the model with an [output parser](https://python.langchain.com/docs/how_to/#output-parsers).

Here is what chaining the model with an output parser looks like:

<img src='images/chain1.png' width="1200">

For this example, we'll use a simple `StrOutputParser` to extract the answer as a string.

In [6]:
from langchain_core.output_parsers import StrOutputParser

parser = StrOutputParser()

chain = model | parser
chain.invoke("What MLB team won the World Series during the COVID-19 pandemic?")

'The Los Angeles Dodgers won the World Series during the COVID-19 pandemic in 2020. They defeated the Tampa Bay Rays in six games to win their first championship since 1988.'

## Introducing prompt templates

We want to provide the model with some context and the question. [Prompt templates](https://python.langchain.com/docs/how_to/#prompt-templates) are a simple way to define and reuse prompts.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)
prompt.format(context="Mary's sister is Susana", question="Who is Mary's sister?")

'Human: \nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: Mary\'s sister is Susana\n\nQuestion: Who is Mary\'s sister?\n'

In [8]:
prompt

ChatPromptTemplate(input_variables=['context', 'question'], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template='\nAnswer the question based on the context below. If you can\'t \nanswer the question, reply "I don\'t know".\n\nContext: {context}\n\nQuestion: {question}\n'))])

We can now chain the prompt with the model and the output parser.

<img src='images/chain2.png' width="1200">

In [9]:
chain = prompt | model | parser
chain.invoke({
    "context": "Mary's sister is Susana",
    "question": "Who is Mary's sister?"
})

'Susana'

## Combining chains

We can combine different chains to create more complex workflows. For example, let's create a second chain that translates the answer from the first chain into a different language.

Let's start by creating a new prompt template for the translation chain:

In [10]:
translation_prompt = ChatPromptTemplate.from_template(
    "Translate {answer} to {language}"
)

We can now create a new translation chain that combines the result from the first chain with the translation prompt.

Here is what the new workflow looks like:

<img src='images/chain3.png' width="1200">

In [11]:
from operator import itemgetter

translation_chain = (
    {"answer": chain, "language": itemgetter("language")} | translation_prompt | model | parser
)

translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "Spanish",
    }
)

'María tiene una hermana.'

In [12]:
translation_chain.invoke(
    {
        "context": "Mary's sister is Susana. She doesn't have any more siblings.",
        "question": "How many sisters does Mary have?",
        "language": "French",
    }
)

'Marie a une sœur.'

## Transcribing the YouTube Video

The context we want to send the model comes from a YouTube video. Let's download the video and transcribe it using [OpenAI's Whisper](https://openai.com/research/whisper).

In [15]:
import os, tempfile
import yt_dlp, whisper

MODEL = "medium"

if not os.path.exists("transcription.txt"):
    with tempfile.TemporaryDirectory() as tmpdir:
        ydl = yt_dlp.YoutubeDL({
            "format": "bestaudio/best",
            "outtmpl": f"{tmpdir}/%(id)s.%(ext)s",
            "quiet": True,
        })
        info = ydl.extract_info(YOUTUBE_VIDEO, download=True)
        file = next(os.path.join(tmpdir, f) for f in os.listdir(tmpdir)
                    if f.endswith((".m4a", ".webm", ".mp3", ".opus")))

        whisper_model = whisper.load_model(MODEL)
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()
        with open("transcription.txt", "w") as out:
            out.write(transcription)



                                                           

Let's read the transcription and display the first few characters to ensure everything works as expected.

In [16]:
with open("transcription.txt") as file:
    transcription = file.read()

transcription[:100]

'Reinfersional learning is terrible. It just so happens that everything that we had before is much wo'

## Using the entire transcription as context

If we try to invoke the chain using the transcription as context, the model will return an error because the context is too long.

Large Language Models support limitted context sizes. The video we are using is too long for the model to handle, so we need to find a different solution.

In [17]:
try:
    chain.invoke({
        "context": transcription,
        "question": "Can you explain what Heated Rivalry book is about?"
    })
except Exception as e:
    print(e)

Error code: 400 - {'error': {'message': "This model's maximum context length is 16385 tokens. However, your messages resulted in 37181 tokens. Please reduce the length of the messages.", 'type': 'invalid_request_error', 'param': 'messages', 'code': 'context_length_exceeded'}}


## Splitting the transcription

Since we can't use the entire transcription as the context for the model, a potential solution is to split the transcription into smaller chunks. We can then invoke the model using only the relevant chunks to answer a particular question:

<img src='images/system2.png' width="1200">

Let's start by loading the transcription in memory:

In [18]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("transcription.txt")
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'transcription.txt'}, page_content="Reinfersional learning is terrible. It just so happens that everything that we had before is much worse. I'm actually optimistic. I think this will work. I think it's tractable. I'm only sounding pessimistic because when I go on my Twitter timeline, I see all this stuff. That makes no sense to me. A lot of it is, I think, honestly just fundraising. We're not actually building animals. We're building ghosts. These sort of ethereal spirit entities, because they're fully digital, and they're kind of like mimicking humans. And it's a different kind of intelligence. It's business as usual, because we're in an intelligence explosion already and have been for decades. Everything is gradually being automated, has been for hundreds of years. Don't write blog posts. Don't do slides. Don't do any of that. Build the code. Arrange it. Get it to work. It's the only way to go. Otherwise, you're missing knowledge. If you have a perfect 

There are many different ways to split a document. For this example, we'll use a simple splitter that splits the document into chunks of a fixed size. Check [Text Splitters](https://python.langchain.com/docs/how_to/#text-splitters) for more information about different approaches to splitting documents.

For illustration purposes, let's split the transcription into chunks of 100 characters with an overlap of 20 characters and display the first few chunks:

In [19]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text_splitter.split_documents(text_documents)[:10]

[Document(metadata={'source': 'transcription.txt'}, page_content='Reinfersional learning is terrible. It just so happens that everything that we had before is much'),
 Document(metadata={'source': 'transcription.txt'}, page_content="had before is much worse. I'm actually optimistic. I think this will work. I think it's tractable."),
 Document(metadata={'source': 'transcription.txt'}, page_content="it's tractable. I'm only sounding pessimistic because when I go on my Twitter timeline, I see all"),
 Document(metadata={'source': 'transcription.txt'}, page_content='timeline, I see all this stuff. That makes no sense to me. A lot of it is, I think, honestly just'),
 Document(metadata={'source': 'transcription.txt'}, page_content="honestly just fundraising. We're not actually building animals. We're building ghosts. These sort"),
 Document(metadata={'source': 'transcription.txt'}, page_content="ghosts. These sort of ethereal spirit entities, because they're fully digital, and they're kind of

For our specific application, let's use 1000 characters instead:

In [20]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)
documents = text_splitter.split_documents(text_documents)

## Finding the relevant chunks

Given a particular question, we need to find the relevant chunks from the transcription to send to the model. Here is where the idea of **embeddings** comes into play.

An embedding is a mathematical representation of the semantic meaning of a word, sentence, or document. It's a projection of a concept in a high-dimensional space. Embeddings have a simple characteristic: The projection of related concepts will be close to each other, while concepts with different meanings will lie far away. You can use the [Cohere's Embed Playground](https://dashboard.cohere.com/playground/embed) to visualize embeddings in two dimensions.

To provide with the most relevant chunks, we can use the embeddings of the question and the chunks of the transcription to compute the similarity between them. We can then select the chunks with the highest similarity to the question and use them as the context for the model:

<img src='images/system3.png' width="1200">

Let's generate embeddings for an arbitrary query:

In [21]:
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
embedded_query = embeddings.embed_query("Where is this video being taken?")

print(f"Embedding length: {len(embedded_query)}")
print(embedded_query[:10])

Embedding length: 1536
[-0.010090172290802002, -0.0172368586063385, -0.00587761914357543, -0.01030843984335661, -0.006211255677044392, 0.017897896468639374, -0.010620249435305595, -0.02469535358250141, 0.001753931399434805, -0.02579292468726635]


To illustrate how embeddings work, let's first generate the embeddings for two different sentences:

In [22]:
sentence1 = embeddings.embed_query("Mary's sister is Susana")
sentence2 = embeddings.embed_query("This video shows the Andes")

We can now compute the similarity between the query and each of the two sentences. The closer the embeddings are, the more similar the sentences will be.

We can use [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between the query and each of the sentences:

In [23]:
from sklearn.metrics.pairwise import cosine_similarity

query_sentence1_similarity = cosine_similarity([embedded_query], [sentence1])[0][0]
query_sentence2_similarity = cosine_similarity([embedded_query], [sentence2])[0][0]

query_sentence1_similarity, query_sentence2_similarity

(0.7238114027498846, 0.8280962306940765)

## Setting up a Vector Store

We need an efficient way to store document chunks, their embeddings, and perform similarity searches at scale. To do this, we'll use a **vector store**.

A vector store is a database of embeddings that specializes in fast similarity searches. 

<img src='images/system4.png' width="1200">

To understand how a vector store works, let's create one in memory and add a few embeddings to it:

In [24]:
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore1 = DocArrayInMemorySearch.from_texts(
    [
        "Mary's sister is Susana",
        "John and Tommy are brothers",
        "Patricia likes white cars",
        "Pedro's mother is a teacher",
        "Lucia drives an Audi",
        "Mary has two siblings",
    ],
    embedding=embeddings,
)

We can now query the vector store to find the most similar embeddings to a given query:

In [25]:
vectorstore1.similarity_search_with_score(query="Who is Mary's sister?", k=3)

[(Document(page_content="Mary's sister is Susana"), 0.917323542883913),
 (Document(page_content='Mary has two siblings'), 0.9045029978848259),
 (Document(page_content='John and Tommy are brothers'), 0.8013182122337675)]

## Connecting the vector store to the chain

We can use the vector store to find the most relevant chunks from the transcription to send to the model. Here is how we can connect the vector store to the chain:

<img src='images/chain4.png' width="1200">

We need to configure a [Retriever](https://python.langchain.com/docs/how_to/#retrievers). The retriever will run a similarity search in the vector store and return the most similar documents back to the next step in the chain.

We can get a retriever directly from the vector store we created before: 

In [26]:
retriever1 = vectorstore1.as_retriever()
retriever1.invoke("Who is Mary's sister?")

[Document(page_content="Mary's sister is Susana"),
 Document(page_content='Mary has two siblings'),
 Document(page_content='John and Tommy are brothers'),
 Document(page_content="Pedro's mother is a teacher")]

Our prompt expects two parameters, "context" and "question." We can use the retriever to find the chunks we'll use as the context to answer the question.

We can create a map with the two inputs by using the [`RunnableParallel`](https://python.langchain.com/docs/how_to/parallel/) and [`RunnablePassthrough`](https://python.langchain.com/docs/how_to/passthrough/) classes. This will allow us to pass the context and question to the prompt as a map with the keys "context" and "question."

In [27]:
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

setup = RunnableParallel(context=retriever1, question=RunnablePassthrough())
setup.invoke("What color is Patricia's car?")

{'context': [Document(page_content='Patricia likes white cars'),
  Document(page_content='Lucia drives an Audi'),
  Document(page_content="Pedro's mother is a teacher"),
  Document(page_content="Mary's sister is Susana")],
 'question': "What color is Patricia's car?"}

Let's now add the setup map to the chain and run it:



In [28]:
chain = setup | prompt | model | parser
chain.invoke("What color is Patricia's car?")

'White'

Let's invoke the chain using another example:

In [29]:
chain.invoke("What car does Lucia drive?")

'Lucia drives an Audi.'

## Loading transcription into the vector store

We initialized the vector store with a few random strings. Let's create a new vector store using the chunks from the video transcription.

In [30]:
vectorstore2 = DocArrayInMemorySearch.from_documents(documents, embeddings)

In [31]:
len(documents)

173

In [32]:
documents[10]

Document(metadata={'source': 'transcription.txt'}, page_content="everything at once. Animals are maybe a better example, because they don't even have the scaffold of language. They just get thrown out into the world. And they just have to make sense of everything without any labels. And the vision for AGI then should just be something which just looks at sensory data, looks at the computer screen, and it just figures out what's going on from scratch. I mean, if a human was put in a similar situation, that would be trained from scratch. But I mean, this is like a human growing up, or an animal growing up. So why shouldn't that be the vision for AI rather than this thing where we're doing millions of years of training? I think that's a really good question. I mean, so Sutton was on your podcast, and I saw the podcast. And I had a write up about that podcast almost that gets into a little bit of how I see things. And I kind of feel like I'm very careful to make analogies to animals, becau

Let's set up a new chain using the correct vector store. This time we are using a different equivalent syntax to specify the [`RunnableParallel`](https://python.langchain.com/docs/how_to/parallel/) portion of the chain:

In [33]:
chain = (
    {"context": vectorstore2.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)
chain.invoke("Why are LLMs Different?")

'LLMs are different because they do not have the equivalent of culture, they are extremely good at memorization, and they lack the ability to learn abstract concepts quickly like a child can.'

## Setting up Pinecone

So far we've used an in-memory vector store. In practice, we need a vector store that can handle large amounts of data and perform similarity searches at scale. For this example, we'll use [Pinecone](https://www.pinecone.io/).

The first step is to create a Pinecone account, set up an index, get an API key, and set it as an environment variable `PINECONE_API_KEY`.

Then, we can load the transcription documents into Pinecone:

In [34]:
from langchain_pinecone import PineconeVectorStore

index_name = "youtube-rag-llm"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

  from tqdm.autonotebook import tqdm


Let's now run a similarity search on a pinecone to make sure everything works:

In [35]:
pinecone.similarity_search("Who is Andrej Karpathy?")[:3]

[Document(metadata={'source': 'transcription.txt'}, page_content="geniuses of today are bare discussion to surface of what a human mind can do, I think. Today, I'm speaking with Andre Karpathy. Andre, why do you say that this will be the decade of agents and not the year of agents? Well, first of all, thank you for having me here. Excited to be here. So the quote that you've just mentioned, it's the decade of agents, that's actually a reaction to an existing pre-existing quote, I should say, where I think some of the labs, I'm not actually sure who said this, but they were alluding to this being the year of agents with respect to LLMs and how they were going to evolve. And I think I was triggered by that, because I feel like there's some overpredictions going on in the industry. And in my mind, this is really a lot more accurately described as the decade of agents. And we have some very early agents that are actually extremely impressive that I use daily, Claude and Codex and so on. Bu

Let's setup the new chain using Pinecone as the vector store:

In [36]:
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

chain.invoke("What do this people in the podcast talking about")

'The people in the podcast are discussing the advancements and challenges in artificial intelligence, particularly focusing on the development of agents and the need for continual learning in AI systems. They also touch upon the importance of clear and direct communication in explaining complex ideas.'

In [38]:
chain.invoke("Who are the poeple talking in the podcast?")

'Andre Karpathy and the interviewer are the people talking in the podcast.'

In [39]:
chain.invoke("What is Andre Karpathy expertise?")

"Andre Karpathy's expertise is in the field of AI, specifically in training neural networks for tasks, developing agents, and working with LLMs (Large Language Models)."