<a href="https://colab.research.google.com/github/vilsonrodrigues/youtube-retrieval-qa/blob/main/notebooks/YoutubeRetrievalQA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install langchain openai youtube-transcript-api faiss-cpu tiktoken

In [37]:
import os
os.environ["OPENAI_API_KEY"] = 'sk-...'

Document Load 

In [5]:
from langchain.document_loaders import YoutubeLoader

In [8]:
## load transcripts
video_url = 'https://www.youtube.com/watch?v=ibNCc74ni1c'
loader = YoutubeLoader.from_youtube_url(video_url, add_video_info=False)
data = loader.load()

In [12]:
data



In [15]:
print(f"You have {len(data)} document")
print(f"You have {len(data[0].page_content)} characters in that document")

You have 1 document
You have 190661 characters in that document


Insert in Vector Store

In [17]:
# Splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
# The embedding engine that will convert our text to vectors
from langchain.embeddings.openai import OpenAIEmbeddings
# The vectorstore we'll be using
from langchain.vectorstores import FAISS

Split in chunks

In [19]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=400)
docs = text_splitter.split_documents(data)

In [21]:
# Get the total number of characters so we can see the average later
num_total_characters = sum([len(x.page_content) for x in docs])

print(f"Now you have {len(docs)} documents that have an average of {num_total_characters / len(docs):,.0f} characters (smaller pieces)")

Now you have 74 documents that have an average of 2,969 characters (smaller pieces)


Gen Embeddings and ingest in Vector Store

In [38]:
# Get your embeddings engine ready
embeddings = OpenAIEmbeddings()

# Embed your documents and combine with the raw text in a pseudo db. Note: This will make an API call to OpenAI
docsearch = FAISS.from_documents(docs, embeddings)

Set retriver configs

In [39]:
retriever = docsearch.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 4

Create your retrieval engine

Chat Models are cheaper

In [29]:
from langchain.chat_models import ChatOpenAI

In [31]:
chat_model = ChatOpenAI(temperature=0, model='gpt-3.5-turbo')

In [40]:
# The LangChain component we'll use to get the documents
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(llm=chat_model, chain_type="stuff", retriever=docsearch.as_retriever())

In [43]:
query = "What does the author describe about Vladimir Putin?"
qa.run(query)

'The author mentions that a few weeks ago, Vladimir Putin said that whoever controls AI will control the world.'

Try a conversational retrieval pipeline using Memory

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

In [None]:
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(chat_model, retriever=docsearch.as_retriever(), memory=memory)

In [None]:
question = "your question"
result = qa({"question": question})