# Knowledge Base

For this exercise, we want to build a Research Assistant whose goal is to retrieve videos from Youtube, transcribe them and then answer questions about the related topics using Human knowledge. 

Ready? 🌶️ 

<Note type="note">

We used OpenAI Whisper model which is not free (at least for now) to extract and transcribe videos from youtube. We have included the code for you as reference, but no need to run it, instead we're letting you use the results directly.

Step 0 has been include as reference, start with step 1 of the exercise which is to retrieve the transcripts from a csv file, split documents, load them in a vectorDB. 

</Note>

## Step 0 - Exercise Setup 

For this exercise, you can pull the usual docker image:

```bash 
docker run -v $(pwd):/home/jovyan -p 8888:8888 jupyter/datascience-notebook
```

Then install the following packages:

In [1]:
# install package
%pip install -Uqq langchain-weaviate
%pip install langchain langchain_mistralai langchain_huggingface -q
%pip install -qU langchain-community beautifulsoup4 arxiv pymupdf
%pip install -qU weaviate-client
%pip install sentence-transformers -q 
%pip install transformers -q

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
[33m  DEPRECATION: Building 'sgmllib3k' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'sgmllib3k'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Step 0b - Load all transcript into Excel

If you want to know where the Excel file come from, we scrape data from Youtube using Langchain as well. We didnt' ask students to do it themselves as it requires to use Whisper, which is a paid model.

<Note type="note">

The code below is purely for your information. If you want to start the exercise right away, go directly to step I

</Note>

```python
# Install Youtube dependencies
%pip install --upgrade --quiet  yt_dlp
%pip install --upgrade --quiet  pydub
%pip install --upgrade --quiet  librosa
%pip install openai --quiet

# Set OPENAI_API_KEY
%env OPENAI_API_KEY=REPLACE_WITH_YOUR_API_KEY
```

```python
import os
import pandas as pd 
from langchain_community.document_loaders.blob_loaders.youtube_audio import (
    YoutubeAudioLoader,
)
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import (
    OpenAIWhisperParser
)
# set a flag to switch between local and remote parsing
# change this to True if you want to use local parsing
local = False

# Scrape LLM videos
urls = [
    "https://youtu.be/zjkBMFhNj_g?si=Fl3oBi7tD6TA93a-", 
    "https://youtu.be/5sLYAQS9sWQ?si=0rcqcfZOdsXYkMPA",
    "https://youtu.be/zizonToFXDs?si=hR76h7x9UGSBr_lp",
    "https://youtu.be/T-D1OfcDW1M?si=3td68NnXV3uwx7pC",
    "https://youtu.be/9vM4p9NN0Ts?si=EaHt52VTJc4YYnSu",
    "https://youtu.be/wjZofJX0v4M?si=tK4IMk2y8L99ceyd",
]

# Directory to save audio files
save_dir = f"{os.getcwd()}/YouTube"

# Transcribe the videos to text
if local:
    loader = GenericLoader(
        YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParserLocal()
    )
else:
    loader = GenericLoader(YoutubeAudioLoader(urls, save_dir), OpenAIWhisperParser())
docs = loader.load()


# This part of the code will decode the docs variable and load
# each part of the metadata and content into separate columns 
# of a pandas dataframe
sources = [doc.metadata["source"] for doc in docs]
chunk_index = [doc.metadata["chunk"] for doc in docs]
content = [doc.page_content for doc in docs]


df = pd.DataFrame({
    "sources": sources,
    "chunk_id": chunk_index,
    "transcript": content
})

df["sources"] = df["sources"].apply(lambda x: x.split("/")[-1])
df.to_csv("LLM_transcripts.csv")
df.head()
```

In [10]:
df = pd.read_csv("https://full-stack-assets.s3.eu-west-3.amazonaws.com/LLM_transcripts.csv").iloc[:, 1:]
df.head()

Unnamed: 0,sources,chunk_id,transcript
0,Large Language Models (LLMs) - Everything You ...,0,This video is going to give you everything you...
1,Large Language Models (LLMs) - Everything You ...,1,"is hallucinations, which means that they somet..."
2,"＂okay, but I want Llama 3 for my specific use ...",0,"My name is David Andrzej and in this video, I'..."
3,"＂okay, but I want Llama 3 for my specific use ...",1,It's probably the main two things it's used fo...
4,A Practical Introduction to Large Language Mod...,0,"Hey everyone, I'm Shaw and I'm back with a new..."


## Step I - Load documents 

Alright, let's start now with the real stuff 😉 For the first step, you will need to load the CSV file using Langchain Documents. Feel free to check Langchain documentation to use your prefered Document Loader:

* [Documents Loader](https://python.langchain.com/docs/integrations/document_loaders/)

You can download the CSV file here: 

* [Transcripts](https://full-stack-assets.s3.eu-west-3.amazonaws.com/LLM_transcripts.csv)

Load the csv file using the link provided, use an appropriate data loader and display the first five elements.

In [1]:
import pandas as pd 
from langchain_community.document_loaders import DataFrameLoader

df = pd.read_csv("https://full-stack-assets.s3.eu-west-3.amazonaws.com/LLM_transcripts.csv").iloc[:, 1:]

loader = DataFrameLoader(df, page_content_column="transcript")

docs = loader.load()
docs[:5]

[Document(metadata={'sources': 'Large Language Models (LLMs) - Everything You NEED To Know.m4a', 'chunk_id': 0}, page_content="This video is going to give you everything you need to go from knowing absolutely nothing about artificial intelligence and large language models to having a solid foundation of how these revolutionary technologies work. Over the past year, artificial intelligence has completely changed the world, with products like ChatGPT potentially appending every single industry and how people interact with technology in general. And in this video, I will be focusing on LLMs, how they work, ethical considerations, applications, and so much more. And this video was created in collaboration with an incredible program called AI Camp, in which high school students learn all about artificial intelligence. And I'll talk more about that later in the video. Let's go. So first, what is an LLM? Is it different from AI? And how is ChatGPT related to all of this? LLMs stand for large 

## Step II - Split texts

Now you will need to split your transcripts into chunks of text, here are the steps:
1. Use a tokenizer to tokenize the texts in the dataset
2. Use a text splitter to create chunks of text
3. Display the results

In [2]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from transformers import AutoTokenizer

# Here we use pretrained Tokenizer offered by hugging face. This gives us definitely more 
# accurate splitting
tokenizer = AutoTokenizer.from_pretrained("google-bert/bert-base-cased")

# Instanciate a splitter 
# There are plenty of different splitters see below to learn more
splitter = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(tokenizer),

# Now create splits 
splitted_docs = splitter[0].split_documents(docs)

# Compare docs size from splitted_docs size 
print(f"Nb documents : {len(docs)}")
print(f"Nb splitted documents : {len(splitted_docs)}")

Nb documents : 21
Nb splitted documents : 34


## Step III - Load into Weaviate

Finally load all the splitted documents into a Weaviate database. You can either use Weaviate cloud or Weaviate docker container.

In [3]:
# For the solution we use Weaviate docker container as we find it more
# reliable. Weaviate Cloud is still under a lot of development as of today.
from dotenv import load_dotenv
import weaviate
from weaviate.classes.init import Auth
import os

load_dotenv()

weaviate_url = os.environ["WEAVIATE_URL"]
weaviate_api_key = os.environ["WEAVIATE_API_KEY"]

# Connect to Weaviate Cloud
client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,
    auth_credentials=Auth.api_key(weaviate_api_key),
)

# client = weaviate.connect_to_local(
#     # host="host.docker.internal",  # Use host.docker.internal if you are running it inside a docker container
#     port=8080,
#     grpc_port=50051,
# )

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings()

from langchain_weaviate.vectorstores import WeaviateVectorStore

# Now we can load our documents into our Database 
# Depending on the amount of data 
# The time necessary to execute the cell will vary
vectorstore = WeaviateVectorStore.from_documents(
    splitted_docs, 
    embeddings, 
    client=client, 
    by_text=False, 
    tenant="knowledge_base_llm"
)

vectorstore

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
2025-May-16 03:54 PM - langchain_weaviate.vectorstores - INFO - Tenant knowledge_base_llm does not exist in index LangChain_5bc7e27ecd0747218db36fbf82ce55b8. Creating tenant.


<langchain_weaviate.vectorstores.WeaviateVectorStore at 0x17b6cb740>

## Step IV - Query your DB

Alright now let's verify that everything works by querying our database!
Query the database by asking about LLMs.

In [4]:
query = "What's an LLM ?"

docs = vectorstore.similarity_search(
    query, 
    k=5,
    tenant="knowledge_base_llm"
)

# Print the first 100 characters of each result
for i, doc in enumerate(docs):
    print(f"\n## DOCUMENT {i+1}:\n")
    print(doc.page_content)


## DOCUMENT 1:

This video is going to give you everything you need to go from knowing absolutely nothing about artificial intelligence and large language models to having a solid foundation of how these revolutionary technologies work. Over the past year, artificial intelligence has completely changed the world, with products like ChatGPT potentially appending every single industry and how people interact with technology in general. And in this video, I will be focusing on LLMs, how they work, ethical considerations, applications, and so much more. And this video was created in collaboration with an incredible program called AI Camp, in which high school students learn all about artificial intelligence. And I'll talk more about that later in the video. Let's go. So first, what is an LLM? Is it different from AI? And how is ChatGPT related to all of this? LLMs stand for large language models, which is a type of neural network that's trained on massive amounts of text data. It's genera

Closing Weaviate connection

In [5]:
client.close()