<a href="https://colab.research.google.com/github/ritzfy/yt-comment-rag/blob/main/youtube_comment_rag_using_phi_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -q jq langchain https://github.com/egbertbouman/youtube-comment-downloader/archive/master.zip


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone


In [None]:
!youtube-comment-downloader --url https://www.youtube.com/watch?v=T-D1OfcDW1M --output IBM_RAG.json

Downloading Youtube comments for https://www.youtube.com/watch?v=T-D1OfcDW1M
Downloaded 335 comment(s)
[11.89 seconds] Done!


Since the output is in concatenated JSON with no newlines which can't be parsed using python's JSON parser, we first convert it and dump the fixed JSON file into new

In [None]:
import json
from collections import OrderedDict

with open("IBM_RAG.json", "r") as f_in, open("IBM_RAG_fixed.json", "w") as f_out:
    data = OrderedDict()
    for line in f_in.readlines():
        obj = json.loads(line)
        data[obj["cid"]] = obj
    json.dump(list(data.values()), f_out, indent=4)

!head IBM_RAG_fixed.json -n 20

[
    {
        "cid": "Ugxg3KAFt9zS3hNw0eN4AaABAg",
        "text": "is she writing backwards?",
        "time": "19 uur geleden",
        "author": "@GenesisSoon",
        "channel": "UCWSgqBddRzQh6hVMnvphXcg",
        "votes": "0",
        "replies": "",
        "photo": "https://yt3.ggpht.com/ytc/AIdro_mwP_Tc9vGB_VzWNEfagp0SgQERmNs7w4hfU5XTzLPMwA8=s88-c-k-c0x00ffffff-no-rj",
        "heart": false,
        "reply": false,
        "time_parsed": 1713986389.852828
    },
    {
        "cid": "UgxgN7eTpTfmqL3krZJ4AaABAg",
        "text": "This is a bit confusing or I might just be naive but I thought LLM  leverages its generative model capability by scrapping and crawling the web (Internet) to give responses already so what is the point of the retrieval Aug generation when it already does the same thing. Is the difference just  real-time vs pre-trained )Non real-time?",
        "time": "1 dag geleden",
        "author": "@idan007-sp3pk",
        "channel": "UCsI04xiduHIUz9NfaYKGWWw",


In [None]:
from langchain_community.document_loaders import JSONLoader

file_path = "IBM_RAG_fixed.json"
loader = JSONLoader(file_path=file_path, jq_schema=".[] |.text")

data = loader.load()

print(data)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_data = splitter.split_documents(data)

In [None]:
!pip install -q transformers sentence-transformers faiss-gpu

HuggingFaceEmbeddings creates problems for any embedding models that are not in langchain.embeddings and asks you to trust the code

[link to solution of trust_remote_code problem](https://github.com/langchain-ai/langchain/issues/6080)

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoModel

model_kwargs = {'trust_remote_code': True}
db = FAISS.from_documents(chunked_data, HuggingFaceEmbeddings(model_name='Alibaba-NLP/gte-large-en-v1.5', model_kwargs=model_kwargs))

A retriever is an object that allows you to search for similar documents or texts in the FAISS index.

The `as_retriever` method takes the arguments:

- `search_type="similarity"`, specifies that a similarity search is to be performed, viz the retriever will return documents that are similar to a given query.
- `search_kwargs={"k": 4}`, no of most similar documents to return for the query

The resulting retriever object can be used to search for similar documents or texts in the FAISS index.

In [None]:
retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 4})

# Load model

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/172 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.85M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/940 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/623 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Now with the model loaded, we can,
- Create the text generation pipeline using the Hugging Face Transformers library
- Define the prompt template and instatiate it
- Pipe the pipeline and output parser to the prompt template

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

<|end|>
<|user|>
{question}
<|end|>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

In [None]:
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | llm_chain

First test without context

In [None]:
question = "RAG combines the generative power of LLMs with the precision of?"
llm_chain.invoke({"context": "", "question": question})

'\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n\n\n</s>\n<|user|>\nRAG combines the generative power of LLMs with the precision of?\n</s>\n<|assistant|>\n\n  RAG, which stands for "Reinforced Adversarial Generation," combines the generative power of Large Language Models (LLMs) like GPT-3 or BERT with the precision of adversarial training techniques. In this approach, an LLM is used as a base model that generates text outputs. Then, through adversarial training, these outputs are refined by introducing challenges and feedback loops where another system attempts to identify weaknesses or inaccuracies within the generated content. The original LLM then adjusts its parameters to improve accuracy and reduce vulnerabilities, resulting in more precise and reliable output. This method leverages both the creativity and breadth of language models along with targeted improvements from adversarial processes.'

Testing with context

In [None]:
rag_chain.invoke(question)

'\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n[Document(page_content=\'RAG combines the generative power of LLMs with the precision of specialized data search mechanisms, resulting in nuanced and contextually relevant responses.\', metadata={\'source\': \'/content/IBM_RAG_fixed.json\', \'seq_num\': 11}), Document(page_content="Are there any simplified LLMs for RAG? They don\'t need to have all the information, so they can be much smaller and faster.", metadata={\'source\': \'/content/IBM_RAG_fixed.json\', \'seq_num\': 161}), Document(page_content=\'Is RAG used at inference time or training time? It seems to be at inference time\', metadata={\'source\': \'/content/IBM_RAG_fixed.json\', \'seq_num\': 239}), Document(page_content=\'So, RAG = giving the LLM data to work with\', metadata={\'source\': \'/content/IBM_RAG_fixed.json\', \'seq_num\': 103})]\n\n</s>\n<|user|>\nRAG combines the generative power of LLMs with the precision of?\n</s>