# Crafting Advanced Multimodal AI Solutions with LangChain & OpenAI API

## Introduction

In today's digital age, videos are a treasure trove of information, but extracting that information efficiently can be a challenge. Watching an entire video or skimming through it can be time-consuming and often inefficient. This project aims to streamline this process by leveraging the power of multimodal AI to create an interactive Q&A bot that can answer questions about the content of a YouTube video.

### Project Overview

In this project, we will:

1. **Download a Tutorial Video from YouTube**: We'll start by selecting a tutorial video from YouTube that contains valuable information.
2. **Transcribe the Audio**: Using the Whisper API, we'll convert the audio from the video into text.
3. **Create a Q&A Bot**: We'll utilize LangChain to build a bot that can answer questions based on the transcribed text.

### Goals

- **Understand the Building Blocks of Multimodal AI Projects**: Gain insights into how different AI models can be combined to create powerful applications.
- **Apply Fundamental Concepts of LangChain**: Explore the core functionalities of LangChain and how it can be used to build interactive applications.
- **Use the Whisper API for Transcription**: Learn how to transcribe audio to text using the Whisper API.
- **Combine LangChain and Whisper API**: Integrate both tools to create a seamless Q&A experience for any YouTube video.

### Workflow

1. **Video Selection and Download**:
   - Choose a tutorial video from YouTube.
   - Use a YouTube downloader tool to save the video locally.

2. **Audio Transcription**:
   - Extract the audio from the downloaded video.
   - Use the Whisper API to transcribe the audio into text.

3. **Text Processing**:
   - Clean and preprocess the transcribed text to ensure it is suitable for querying.
   - Segment the text into manageable chunks for better processing.

4. **Q&A Bot Development**:
   - Utilize LangChain to build a Q&A bot.
   - Feed the processed text into the bot.
   - Implement a user interface to interact with the bot and ask questions.

5. **Interactive Insights Extraction**:
   - Allow users to input questions and receive answers in real-time.
   - Highlight key insights and information from the video based on user queries.

### Thought Process

- **Efficiency**: The primary goal is to make information extraction from videos faster and more efficient.
- **Interactivity**: By creating a Q&A bot, we provide an interactive way for users to engage with the content.
- **Scalability**: The approach can be scaled to handle multiple videos and different types of content.

### Tech Stack

- **YouTube Downloader**: For downloading videos.
- **Whisper API**: For transcribing audio to text.
- **LangChain**: For building the Q&A bot.
- **Python**: The primary programming language for scripting and development.
- **Jupyter Notebook**: For interactive development and demonstration.

### Skills Utilized

- **Python Programming**: Writing scripts for downloading videos, transcribing audio, and building the bot.
- **API Integration**: Using the Whisper API for transcription and LangChain for the Q&A bot.
- **Natural Language Processing (NLP)**: Processing and querying the transcribed text.
- **Interactive Development**: Using Jupyter Notebook to develop and demonstrate the project.

### Step-by-Step Description

1. **Download the Video**:
   - Use a Python script to download the video from YouTube.
   - Example: `youtube-dl <video_url>`

2. **Extract and Transcribe Audio**:
   - Extract audio from the video file.
   - Use the Whisper API to transcribe the audio.
   - Example: 
     ```python
     import whisper
     model = whisper.load_model("base")
     result = model.transcribe("audio_file.mp3")
     text = result["text"]
     ```

3. **Process the Transcribed Text**:
   - Clean the text to remove any noise or irrelevant information.
   - Segment the text into smaller chunks for better querying.

4. **Build the Q&A Bot**:
   - Initialize LangChain and load the processed text.
   - Implement the Q&A functionality.
   - Example:
     ```python
     from langchain import LangChain
     chain = LangChain()
     chain.load_text(text)
     response = chain.query("What is the main topic of the video?")
     print(response)
     ```

5. **Interactive Insights Extraction**:
   - Create a user interface to input questions and display answers.
   - Highlight key insights based on user queries.

By following these steps, we can efficiently extract and interact with valuable information from YouTube videos, making the process faster and more user-friendly.

## Setup

The project requires several packages that need to be installed into Workspace.

- `langchain` is a framework for developing generative AI applications.
- `yt_dlp` lets you download YouTube videos.
- `tiktoken` converts text into tokens.
- `docarray` makes it easier to work with multi-model data (in this case mixing audio and text).

### Run the following code to install the packages.

In [1]:
# Install the openai package, locked to version 1.27
!pip install openai==1.27

# Install the langchain package, locked to version 0.1.19
!pip install langchain==0.1.19

# Install the langchain-openai package, locked to version 0.1.6
!pip install langchain-openai==0.1.6

# Install the yt_dlp package, locked to version 2024.4.9
!pip install yt_dlp==2024.4.9

# Install the tiktoken package, locked to version 0.6.0
!pip install tiktoken==0.6.0

# Install the docarray package, locked to version 0.40.0
!pip install docarray==0.40.0

Defaulting to user installation because normal site-packages is not writeable
Collecting openai==1.27
  Downloading openai-1.27.0-py3-none-any.whl.metadata (21 kB)
Downloading openai-1.27.0-py3-none-any.whl (314 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m29.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
[0mSuccessfully installed openai-1.27.0
Defaulting to user installation because normal site-packages is not writeable
Collecting langchain==0.1.19
  Downloading langchain-0.1.19-py3-none-any.whl.metadata (13 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain==0.1.19)
  Downloading aiohttp-3.9.5-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.5 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain==0.1.19)
  Downloading dataclasses_json-0.6.6-py3-none-any.whl.metadata (25 kB)
Collecting langchain-community<0.1,>=0.0.38 (from langchain==0.1.19)
  Downloading langchain_community-0.0.38

## Required Libraries 

For this project, we need the following packages:

- `os`: This package provides a way of using operating system-dependent functionality, such as reading or writing to the file system.
- `yt_dlp`: This package allows us to download YouTube videos. We will use it to download a video of your choice, convert it to an `.mp3` format, and save the file locally.
- `openai`: This package simplifies making API calls to OpenAI models, which we will use for various generative AI tasks.

Make sure all these packages are installed and properly configured before proceeding with the project.

### Import the following packages.

- Import `os`.
- Import `glob`.
- Import `openai`.
- Import `yt_dlp` with the alias `youtube_dl`.
- From the `yt_dlp` package, import `DowloadError`.
- Assign `openai_api_key` to `os.getenv("OPENAI_API_KEY")`.

In [2]:
# Import the os package
import os 

# Import the glob package
import glob

# Import the openai package 
import openai 

# Import the yt_dlp package as youtube_dl
import yt_dlp as youtube_dl

# Import DownloadError from yt_dlp
from yt_dlp import DownloadError 

# Import DocArray 
import docarray 


In [3]:
openai_api_key = os.getenv("OPENAI_API_KEY")

## Download the YouTube Video

After setting up the environment, the first step is to download a YouTube video and convert it to an audio file (.mp3).

For this example, we'll download a coding tutorial about machine learning in Python.

Here are the steps we'll follow:

1. **Set Variables**: Define the `youtube_url` and the `output_dir` where the file will be stored.
2. **Download and Convert**: Use `yt_dlp` to download the video and convert it to an `.mp3` file.
3. **List Audio Files**: Create a loop to search the `output_dir` for any `.mp3` files and store them in a list called `audio_files`. This list will be used later to send each file to the Whisper model for transcription.

- Run the code to set the URL of the video, `youtube_url`, the directory to store the downloaded video, `youtube_url`, and the download settings, `ydl_config`.

In [4]:
# An example YouTube tutorial video
youtube_url = "https://www.youtube.com/watch?v=aqzxYofJ_ck"

# Directory to store the downloaded video
output_dir = "files/audio/"

# Config for youtube-dl
ydl_config = {
    "format": "bestaudio/best",
    "postprocessors": [
        {
            "key": "FFmpegExtractAudio",
            "preferredcodec": "mp3",
            "preferredquality": "192",
        }
    ],
    "outtmpl": os.path.join(output_dir, "%(title)s.%(ext)s"),
    "verbose": True
}

- Check if `output_dir` exists, if not, then make that directory.
- Try to download the video using the specified configuration.
  -  If a DownloadError occurs, attempt to download the video again.

In [5]:
# Check if the output directory exists, if not create it
if not os.path.exists(output_dir): 
    os.makedirs(output_dir)

# Print a message indicating which video is being downloaded
print(f"Downloading video from {youtube_url}")

# Try to download the video using the specified configuration
# If a DownloadError occurs, attempt to download the video again
try: 
    with youtube_dl.YoutubeDL(ydl_config) as ydl: 
        ydl.download([youtube_url])
except DownloadError: 
    with youtube_dl.YoutubeDL(ydl_config) as ydl: 
        ydl.download([youtube_url])

Downloading video from https://www.youtube.com/watch?v=aqzxYofJ_ck


[debug] Encodings: locale UTF-8, fs utf-8, pref UTF-8, out UTF-8 (No ANSI), error UTF-8 (No ANSI), screen UTF-8 (No ANSI)
[debug] yt-dlp version stable@2024.04.09 from yt-dlp/yt-dlp [ff0779267] (pip) API
[debug] params: {'format': 'bestaudio/best', 'postprocessors': [{'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3', 'preferredquality': '192'}], 'outtmpl': 'files/audio/%(title)s.%(ext)s', 'verbose': True, 'compat_opts': set(), 'http_headers': {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.115 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en-us,en;q=0.5', 'Sec-Fetch-Mode': 'navigate'}}
[debug] Python 3.8.10 (CPython x86_64 64bit) - Linux-5.10.209-198.858.amzn2.x86_64-x86_64-with-glibc2.29 (OpenSSL 1.1.1f  31 Mar 2020, glibc 2.31)
[debug] exe versions: ffmpeg 4.2.7, ffprobe 4.2.7
[debug] Optional libraries: Cryptodome-3.20.0, brotli-1.1.0, certifi-2019

[youtube] Extracting URL: https://www.youtube.com/watch?v=aqzxYofJ_ck
[youtube] aqzxYofJ_ck: Downloading webpage
[youtube] aqzxYofJ_ck: Downloading ios player API JSON
[youtube] aqzxYofJ_ck: Downloading android player API JSON




[youtube] aqzxYofJ_ck: Downloading player a960a0cb


[debug] [youtube] Extracting nsig function with jsinterp
[debug] Saving youtube-nsig.a960a0cb to cache
[debug] [youtube] Decrypted nsig mpAKXPmdsQ2KLfnJ6 => n6t9edO8ZlO1rQ
[debug] Loading youtube-nsig.a960a0cb from cache
[debug] [youtube] Decrypted nsig 6vesnfccWruEJ7TsY => 9L5aMNsiYTAS1A


[youtube] aqzxYofJ_ck: Downloading m3u8 information


[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, vcodec:vp9.2, channels, acodec, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), channels, acodec, lang, proto, size, br, asr, vext, aext, hasaud, id


[info] aqzxYofJ_ck: Downloading 1 format(s): 251


[debug] Invoking http downloader on "https://rr3---sn-ab5sznzl.googlevideo.com/videoplayback?expire=1715630570&ei=ih1CZuOBFumGkucPrJeq-Ac&ip=3.88.83.220&id=o-AHVPdBMFqeO-p8gUUPS3FAQvBxGpl0qMTMfcmbkwR7_G&itag=251&source=youtube&requiressl=yes&xpc=EgVo2aDSNQ%3D%3D&mh=zw&mm=31%2C29&mn=sn-ab5sznzl%2Csn-ab5l6nrd&ms=au%2Crdu&mv=u&mvi=3&pl=24&bui=AWRWj2QtjOQl4cf0mEA8gqEqhT8RfXSBmgAs1IMzKB9ULqJSkNNTkP0YMtauKozZKci1-VePuFsZqKUR&spc=UWF9fyVN2w86wGgK3jV5cD7Rpeks0welMDcfn7W000Hsm-MOut6t75c&vprv=1&svpuc=1&mime=audio%2Fwebm&ns=5SzWiS3Fugfd-wqkdoAalnEQ&rqh=1&gir=yes&clen=10932652&dur=752.701&lmt=1654008313150389&mt=1715608446&fvip=4&keepalive=yes&c=WEB&sefc=1&txp=5318224&n=9L5aMNsiYTAS1A&sparams=expire%2Cei%2Cip%2Cid%2Citag%2Csource%2Crequiressl%2Cxpc%2Cbui%2Cspc%2Cvprv%2Csvpuc%2Cmime%2Cns%2Crqh%2Cgir%2Cclen%2Cdur%2Clmt&sig=AJfQdSswRQIhANvav2j_Tdc8rl0cr3diroFt_xfrfHJCevUe4_XuD2AZAiBLDLcv27c3q2YeWuvpwUUgS5AbGi7m35jPKgteYnzkSw%3D%3D&lsparams=mh%2Cmm%2Cmn%2Cms%2Cmv%2Cmvi%2Cpl&lsig=AHWaYeowRQIhAPDqIbbhr6

[download] Destination: files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.webm
[download] 100% of   10.43MiB in 00:00:00 at 18.53MiB/s  


[debug] ffmpeg command line: ffprobe -show_streams 'file:files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.webm'


[ExtractAudio] Destination: files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.mp3


[debug] ffmpeg command line: ffmpeg -y -loglevel repeat+info -i 'file:files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.webm' -vn -acodec libmp3lame -b:a 192.0k -movflags +faststart 'file:files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.mp3'


Deleting original file files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.webm (pass -k to keep)


To find the audio files that we will use the `glob`module that looks in the `output_dir` to find any .mp3 files. Then we will append the file to a list called `audio_files`. This will be used later to send each file to the Whisper model for transcription. 

### Find the audio file in the output directory.

- Find all the MP3 audio files in the output directory by joining the output directory to the pattern `*.mp3` and using glob to list them.
- Select the first file in the list and assign it to `audio_filename`.
- _Check your work._ Print `audio_filename`.

In [13]:
# Find the audio file in the output directory
audio_files = glob.glob(os.path.join(output_dir, "*.mp3"))

# Select the first audio file in the list
audio_filename = audio_files[0]

# Print the name of the selected audio file
print(audio_filename)

files/audio/Python Machine Learning Tutorial ｜ Splitting Your Data ｜ Databytes.mp3


## Transcribe the Video using Whisper

In this step we will take the downloaded and converted Youtube video and send it to the Whisper model to be transcribed. To do this we will create variables for the `audio_file`, for the `output_file` and the model. 

Using these variables we will:
- create a list to store the transcripts
- Read the Audio File 
- Send the file to the Whisper Model using the OpenAI package 

### Transcribe the audio file.

- _The audio file, output file, and model are definied for you._
- Define an OpenAI client model. Assign to client.
- Open the audio file as read-binary (`"rb"`).
  - Use the Whisper model to create a transcription of the opened audio file. Assign to `response`.
- Extract the transcript from the response.

In [14]:
# Use these settings
audio_file = audio_filename
output_file = "files/transcripts/transcript.txt"
model = "whisper-1"

# Transcribe the audio file to text using OpenAI API
print("converting audio to text...")

# Define an OpenAI client model. Assign to client.
client = openai.OpenAI()

# Open the audio file as read-binary
with open(audio_file, "rb") as audio:
    # Use the model to create a transcription
    response = client.audio.transcriptions.create(file=audio, model=model)

# Extract the transcript from the response
transcript = response.text

# Print the transcript
print(transcript)

converting audio to text...
Hi, in this tutorial, we're going to look at a data pre-processing technique for machine learning called splitting your data. That is splitting your data set into a training set and a testing set. Now, before we get to the code, you might wonder, why do I need to do this? And really, there are going to be two problems if you don't. So if you train your machine learning model on your whole data set, then you've not tested the model on anything else. And that means you don't know how well your model is going to perform on other data sets. Secondly, it's actually even worse than this, because you risk overfitting the model. And that means that you've made your model work really well for one data set, but that gives a cost of model performance on other data sets. So not only do you not know how well the model is going to perform on other data sets, it's probably going to be worse than it could be. So you might also wonder when in your machine learning workflow, 

### Save the transcript to a text file.

- If the directory for the output file doesn't exist, make it.
- Write the transcript to the output file

In [8]:
# Create the directory for the output file if it doesn't exist
os.makedirs(os.path.dirname(output_file), exist_ok=True)

# Write the transcript to the output file
with open(output_file, "w") as file:
    file.write(transcript)

## Create a TextLoader using LangChain 

## Loading Text Data with LangChain

To utilize text or other types of data with LangChain, we need to convert that data into `Document` objects. This conversion is facilitated by using loaders. In this tutorial, we will employ the `TextLoader` to take the text from our transcript and load it into a `Document`. This process is essential for preparing the data for further processing and analysis within the LangChain framework.

By following the steps below, you will learn how to:

1. Initialize a `TextLoader`.
2. Load the transcript text into a `Document`.

Let's get started!
```

### Load the documents from the text file using a TextLoader.

- From the `langchain.document_loaders` module, import `TextLoader`.
- Create a `TextLoader`, passing it the directory of the transcripts, `"./files/text"`. Assign to `loader`.
- Use the TextLoader to load the documents. Assign to `docs`.

In [15]:
# From the langchain.document_loaders module, import TextLoader
from langchain.document_loaders import TextLoader

# Create a `TextLoader`, passing the directory of the transcripts. Assign to `loader`.
loader = TextLoader("./files/text")

# Use the TextLoader to load the documents. Assign to docs.
docs = loader.load()

In [10]:
# Show the first element of docs to verify it has been loaded 
docs[0]

Document(page_content="Hi, in this tutorial, we're going to look at a data pre-processing technique for machine learning called splitting your data. That is splitting your data set into a training set and a testing set. Now, before we get to the code, you might wonder, why do I need to do this? And really, there are going to be two problems if you don't. So if you train your machine learning model on your whole data set, then you've not tested the model on anything else. And that means you don't know how well your model is going to perform on other data sets. Secondly, it's actually even worse than this, because you risk overfitting the model. And that means that you've made your model work really well for one data set, but that gives a cost of model performance on other data sets. So not only do you not know how well the model is going to perform on other data sets, it's probably going to be worse than it could be. So you might also wonder when in your machine learning workflow, as yo

## Create an In-Memory Vector Store 

Now that we have created Documents of the transcription, we will store that Document in a vector store. Vector stores allows LLMs to traverse through data to find similiarity between different data based on their distance in space. 

For large amounts of data, it is best to use a designated Vector Database. Since we are only using one transcript for this project, we can create an in-memory vector store using the `docarray` package. 

We will also tokenize our queries using the `tiktoken` package. This means that our query will be seperated into smaller parts either by phrases, words or characters. Each of these parts are assigned a token which helps the model "understand" the text and relationships with other tokens. 

### Import the `tiktoken` package. 

In [11]:
#tiktoken package
import tiktoken

## Create the Document Search 

We will now use the LangChain library to perform essential operations for creating a Question and Answer experience. This will be achieved using the integrated `RetrievalQA` class, which facilitates the retrieval of relevant information from our vector store to answer user queries.

### Instructions

- From the `langchain.chains` module, import `RetrievalQA`.
- From the `langchain_openai` package, import `ChatOpenAI`, `OpenAIEmbeddings`.
- From the `langchain.vectorstores` module, import `DocArrayInMemorySearch`.

In [17]:
# From the langchain.chains module, import RetrievalQA
from langchain.chains import RetrievalQA

# From the langchain_openai package, import ChatOpenAI, OpenAIEmbeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# From the langchain.vectorstores module, import DocArrayInMemorySearch
from langchain.vectorstores import DocArrayInMemorySearch

### Create an In-Memory Search Object

To create an in-memory search object, we will use the `DocArrayInMemorySearch` class. This class will allow us to search through the embeddings generated by the `OpenAIEmbeddings` function. 

In this code:
- We first initialize the `OpenAIEmbeddings` object.
- Then, we create the `DocArrayInMemorySearch` object by calling the `from_documents` method with our documents (`docs`) and the initialized embeddings.

In [18]:
# Create a new DocArrayInMemorySearch instance from the specified documents and embeddings
db = DocArrayInMemorySearch.from_documents(
    docs, 
    OpenAIEmbeddings()
)

We will now create a retriever from the `db` we created in the last step. This enables the retrieval of the stored embeddings. Since we are also using the `ChatOpenAI` model, will assigned that as our LLM.

Recall that the temperature of an LLM refers to how random the results are. Setting the temperature to zero makes the results more repeatable.

- Convert the `DocArrayInMemorySearch` instance to a retriever. Assign to `retriever`.
- Create a new `ChatOpenAI` instance with a temperature of `0.0`. Assign to `llm`. 

In [19]:
# Convert the DocArrayInMemorySearch instance to a retriever
retriever = db.as_retriever()

# Create a new ChatOpenAI instance 
llm = ChatOpenAI(temperature = 0.0)

### Create a new RetrievalQA instance from the chain type. 

- Set `llm` to the ChatOpenAI instance you just created.
- Set `chain_type` to `"stuff"`.
- Set `retriever` to the retriever you just created.
- Set `verbose` to `True`.

In [20]:
# RetrievalQA instance 
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm,            
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True       
)

## Queries 

Now we are ready to create queries about the YouTube video and read the responses from the LLM. This done first by creating a query and then running the RetrievalQA we setup in the last step and passing it the query. 

### Ask GPT some questions about the transcript.

- Create question, `"What is this tutorial about?"`. Assign to `query`.
- Invoke the query through the RetrievalQA instance. Assign to `response`. 
- Print the response.

In [21]:
# Set the query to be used for the QA system
query = "What is this tutorial about?"

# Invoke the query through the RetrievalQA instance
response = qa_stuff.invoke(query)

# output
response



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What is this tutorial about?',
 'result': "This tutorial is about a data pre-processing technique for machine learning called splitting your data. It explains the importance of splitting your data set into a training set and a testing set, the problems that can arise if you don't do this, and when in your machine learning workflow you should split your data. The tutorial also provides a practical example using loan application data and demonstrates how to split the data into training and testing sets using Python."}

We can continue on creating queries and even creating queries that we know would not be answered in this video to see how the model responds. 

In [22]:
# Set the query to be used for the QA system
query = "What is the difference between a training set and test set?"

# Invoke the query through the RetrievalQA instance and store the response
response = qa_stuff.invoke(query)

# Print the response to the console
response



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'What is the difference between a training set and test set?',
 'result': 'The training set is used to train the machine learning model, meaning the model learns patterns and relationships from this data. The test set, on the other hand, is used to evaluate the performance of the trained model on unseen data. The training set helps the model learn, while the test set helps assess how well the model generalizes to new data.'}

In [23]:
# Set the query to be used for the QA system
query = "Who should watch this lesson?"

# Invoke the query through the RetrievalQA instance and store the response
response = qa_stuff.invoke(query)

# Print the response to the console
response 



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Who should watch this lesson?',
 'result': 'This lesson on splitting data into training and testing sets is beneficial for individuals who are learning or working with machine learning models. It provides essential information on why data splitting is necessary, when to perform it in the machine learning workflow, and how to implement it using Python libraries like pandas and scikit-learn.'}

In [24]:
# Set the query to be used for the QA system
query = "Who is the greatest football team on earth?"

# Invoke the query through the RetrievalQA instance and store the response
response = qa_stuff.invoke(query)

# Print the response to the console
response 



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Who is the greatest football team on earth?',
 'result': "I don't know the answer to that question as it is subjective and varies based on personal preferences and opinions."}

In [25]:
# Set the query to be used for the QA system
query = "How long is the circumference of the earth?"

# Invoke the query through the RetrievalQA instance and store the response
response = qa_stuff.invoke(query)

# Print the response to the console
response 



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How long is the circumference of the earth?',
 'result': "I don't know the exact length of the circumference of the Earth."}

## Conclusion of the Project

In this project, we implemented a question-answering (QA) system using a retrieval-based approach. The primary goal was to enable the system to respond to various queries by retrieving relevant information from a predefined dataset or knowledge base.

### Logic Applied

1. **Query Formulation**: We formulated specific queries that we wanted the QA system to answer. These queries ranged from factual questions like "What is the difference between a training set and test set?" to more subjective ones like "Who is the greatest football team on earth?"

2. **RetrievalQA Instance**: We utilized a `RetrievalQA` instance, which is designed to process the queries and retrieve the most relevant information from the dataset. This instance likely uses techniques such as vector embeddings and similarity search to find the best matches for the queries.

3. **Response Handling**: For each query, the system invoked the `invoke` method of the `RetrievalQA` instance, which processed the query and returned a response. This response was then printed to the console for review.

### Major Advantages of This Workflow

- **Efficiency**: The retrieval-based approach allows for quick responses to queries by leveraging pre-existing data, making it suitable for real-time applications.

- **Scalability**: As the dataset grows, the system can scale to handle more queries without significant changes to the underlying architecture.

- **Flexibility**: The system can handle a wide range of queries, from factual to subjective, demonstrating its versatility.

- **Ease of Use**: By abstracting the complexity of the retrieval process, the system provides a simple interface for users to interact with, requiring only the input of a query to obtain an answer.

Overall, this project showcases the potential of retrieval-based QA systems in providing accurate and efficient answers to diverse queries, highlighting their applicability in various domains.