# RAG

## Requirements

In [1]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community sentence-transformers faiss-gpu pandas gdown

## Dataset

In [2]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=ee13769f-40be-449d-9834-c73fdd518996
To: /content/IMDB_crawled.json
100% 292M/292M [00:02<00:00, 125MB/s] 


## Config

In [3]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [4]:
import pandas as pd

df = pd.read_json('IMDB_crawled.json')

In [8]:
import os

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited

df.to_csv('data/imdb.csv', index=False)

## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [9]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS

from langchain_community.embeddings import HuggingFaceEmbeddings

# load the csv

# load the embeddings model

# save embed the documents using the model in a vectorstore

# with open("data/vectorstore.pkl", "wb") as f:
#     pickle.dump(vectorstore, f)



load the vectorstore as a retriever.

In [10]:
# with open("data/vectorstore.pkl", "rb") as f:
#     vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = None

## LLM

load the quantized LLM.

In [11]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# load the quantization config
bnb_config = None

model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = None

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

  warn_deprecated(


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [35]:
from langchain.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}")
        return text

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a helpful assistant.
{messages}
<|user|>
give me the search query about the above conversation.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = None

initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [36]:
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a helpful assistant.

Here are the movies you MUST choose from:

{context}
-----------------
{messages}
<|assistant|>""")

# init the retriver chain
retrieval_chain = None

write the conversation helper class for easier testing.

In [37]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        pass

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()
        # invoke the chain
        response = None
        self.add_assistant_message(response)
        return response

## Test

talk with the RAG to see how good it performs.

In [42]:
c = Conversation()
A = c.chat('give me a cool gangster movie')
print(A)

QUERY: gangster movies with gritty storylines and intense action sequences
Title: Goodfellas (1990)
Genre: Biographical crime drama
Movie Rating: 8.7

Plot: Based on the true story of Henry Hill, a young man who grew up in the violent world of the mafia. As he rises through the ranks, he becomes increasingly consumed by the criminal lifestyle, eventually leading to his downfall.

Review: If you're looking for a classic gangster movie that will leave you on the edge of your seat, look no further than Goodfellas. Martin Scorsese's masterful direction and Robert De Niro's captivating performance as Henry Hill will draw you into the gritty world of organized crime. With its gripping storyline and unforgettable characters, Goodfellas is a must-watch for any fan of the genre. Get ready to be swept away by this timeless cinematic masterpiece.


In [43]:
A = c.chat('give me a newer one')
print(A)

QUERY: Goodfellas-inspired biographical crime dramas with gritty storylines and intense action set in the world of organized crime released after 1990.
Title: The Irishman (2019)
Genre: Biographical crime drama
Movie Rating: 7.4

Plot: Frank Sheeran, a truck driver and union official, becomes involved with the Bufalino crime organization and befriends mob boss Russell Bufalino. Their relationship leads Sheeran to play a role in some of the most infamous unsolved mysteries in American history.

Review: If you're looking for a more recent addition to the gangster genre, then The Irishman is the perfect choice for you. Directed by Martin Scorsese and starring an all-star cast including Robert De Niro, Al Pacino, and Joe Pesci, this film is a true masterpiece. With its intricate plot and stunning visual effects, The Irishman is a must-watch for any fan of the genre. Prepare yourself for a thrilling ride filled with suspense, action, and unforgettable performances.
