<a href="https://colab.research.google.com/github/AI-Maker-Space/LLM-Ops-Vault/blob/main/Week%201/First%20Session/Barbie_Retrieval_Augmented_Question_Answering_(RAQA)_Assignment%20(Assignment%20Version).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Questioning Barbie Reviews with RAQA (Retrieval Augmented Question Answering)

In the following notebook, you are tasked with creating a system that answers questions based on information found in reviews of the 2023 Barbie movie.

## Build 🏗️

There are 3 main tasks in this notebook:

1. Obtain and parse reviews from a review website
2. Create a Vectorstore from the reviews
3. Create a `RetrievalQA` using the VectorStore

## Ship 🚢

Create a Hugging Face Space that hosts your application.

## Share 🚀

Make a social media post about your final application.

>### Why RAQA and not RAG?
>If we look at the original [paper](https://arxiv.org/abs/2005.11401), we find that RAG is a fairly specific and well defined term that isn't exactly the same as "retrieve context, feed context to model in the prompt".
>For that reason, we're making the decision to delineate between "actual" RAG, and Retrieval Augmented Question Answering - which is not a well defined phrase.

### Pre-task Work

All we really need to do to get started is to get our prerequisites!

We'll be leveraging `langchain` and `openai` today.

Check out the docs:
- [LangChain](https://docs.langchain.com/docs/)
- [OpenAI](https://github.com/openai/openai-python)

In [8]:
!pip install -q -U openai langchain

In [2]:
import os
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Open AI API Key:")

### Task 1: Data Preparation

In this task we'll be collecting, and then parsing, our data.

#### Scraping IMDB Reviews of Barbie

We'll use some Selenium based trickery to get the reviews we need to make our application.

Check out the docs here:
- [Selenium](https://www.selenium.dev/documentation/)

In [10]:
!pip install -q -U requests pandas 

In [11]:
!pip install -U ipykernel



In [12]:
!pip install -q -U scrapy selenium

In [13]:
!apt install chromium-chromedriver 

The operation couldn’t be completed. Unable to locate a Java Runtime that supports apt.
Please visit http://www.java.com for information on installing Java.



In [14]:
import numpy as np
import pandas as pd
from scrapy.selector import Selector
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")

In [15]:
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=chrome_options)

In [16]:
url = "https://www.imdb.com/title/tt1517268/reviews/?ref_=tt_ov_rt"
driver.get(url)

In [17]:
sel = Selector(text = driver.page_source)
review_counts = sel.css('.lister .header span::text').extract_first().replace(',','').split(' ')[0]
more_review_pages = int(int(review_counts)/25)

In [18]:
for i in tqdm(range(more_review_pages)):
    try:
        css_selector = 'load-more-trigger'
        driver.find_element(By.ID, css_selector).click()
    except:
        pass

100%|██████████| 44/44 [00:02<00:00, 15.59it/s]


In [11]:
rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []
review_url_list = []
error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')

for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').extract_first()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN
        try:
            review_url = sel2.css('a.title::attr(href)').extract_first()
        except:
            review_url = np.NaN
        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)
        review_url_list.append(review_url)
    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)
review_df = pd.DataFrame({
    'Review_Date':review_date_list,
    'Author':author_list,
    'Rating':rating_list,
    'Review_Title':review_title_list,
    'Review':review_list,
    'Review_Url':review_url
    })

100%|██████████| 75/75 [00:00<00:00, 200.61it/s]


In [12]:
review_df

Unnamed: 0,Review_Date,Author,Rating,Review_Title,Review,Review_Url
0,21 July 2023,LoveofLegacy,6,"Beautiful film, but so preachy\n","Margot does the best with what she's given, bu...",/review/rw9201978/?ref_=tt_urv
1,22 July 2023,imseeg,7,3 reasons FOR seeing it and 1 reason AGAINST.\n,The first reason to go see it:,/review/rw9201978/?ref_=tt_urv
2,22 July 2023,Natcat87,6,Too heavy handed\n,"As a woman that grew up with Barbie, I was ver...",/review/rw9201978/?ref_=tt_urv
3,31 July 2023,ramair350,10,"As a guy I felt some discomfort, and that's o...",As much as it pains me to give a movie called ...,/review/rw9201978/?ref_=tt_urv
4,24 July 2023,heatherhilgers,9,A Technicolor Dream\n,"Wow, this movie was a love letter to cinema. F...",/review/rw9201978/?ref_=tt_urv
...,...,...,...,...,...,...
70,28 July 2023,klastaitas,8,We're All Dolls\n,"Without consumerism, without belief, without p...",/review/rw9201978/?ref_=tt_urv
71,20 July 2023,FeastMode,7,I'll beach you off so hard\n,I had no idea what to expect going into Barbie...,/review/rw9201978/?ref_=tt_urv
72,7 August 2023,misscattik,,No questions just answers\n,The title of the review practically sums it al...,/review/rw9201978/?ref_=tt_urv
73,23 July 2023,acdc_mp3,9,A really fun time\n,Weeks ago I heard that there was going to be a...,/review/rw9201978/?ref_=tt_urv


Let's save this `pd.DataFrame` as a `.csv` to our local session (this will be terminated when you terminate the Colab session) so we can leverage it in LangChain!

Check out the docs if you get stuck:
- [`to_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html)

In [13]:
### YOUR CODE HERE
review_df.to_csv("./movie_review.csv")

#### Data Parsing

Now that we have our data - let's go ahead and start parsing it into a more usable format for LangChain!

We'll be using the `CSVLoader` for this application.

We also want to be sure to track the sources that our review came from!

Check out the docs here:
- [`CSVLoader`](https://python.langchain.com/docs/integrations/document_loaders/csv)

In [20]:
from langchain.document_loaders.csv_loader import CSVLoader

In [21]:
loader = CSVLoader(
    file_path="movie_review.csv",
    source_column='Review_Url'
    )
 
data = loader.load()

In [38]:
#print(data)

In [23]:
assert len(data) == 125

AssertionError: 

In [49]:
len(data)

75

Now that we have collected our review information into a loader - we can go ahead and chunk the reviews into more manageable pieces.

We'll be leveraging the `RecursiveCharacterTextSplitter` for this task today.

While splitting our text seems like a simple enough task - getting this correct/incorrect can have massive downstream impacts on your application's performance.

You can read the docs here:
- [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter)

We want to split our documents into 1000 character length chunks, with 100 characters of overlap.

> ### HINT:
>It's always worth it to check out the LangChain source code if you're ever in a bind - for instance, if you want to know how to transform a set of documents, check it out [here](https://github.com/langchain-ai/langchain/blob/5e9687a196410e9f41ebcd11eb3f2ca13925545b/libs/langchain/langchain/text_splitter.py#L268C18-L268C18)

In [50]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000, # the character length of the chunk
    chunk_overlap = 100, # the character length of the overlap between chunks
    length_function = len # the length function
)

In [26]:
documents = text_splitter.transform_documents(data)### YOUR CODE HERE

In [37]:
#print(documents)

In [28]:
assert len(documents) == 181

AssertionError: 

With our documents transformed into more manageable sizes, and with the correct metadata set-up, we're now ready to move on to creating our VectorStore!

### Task 2: Creating an "Index"

The term "index" is used largely to mean: Structured documents parsed into a useful format for querying, retrieving, and use in the LLM application stack.

#### Selecting Our VectorStore

There are a number of different VectorStores, and a number of different strengths and weaknesses to each.

In this notebook, we will be keeping it very simple by leveraging [Facebook AI Similarity Search](https://ai.meta.com/tools/faiss/#:~:text=FAISS%20(Facebook%20AI%20Similarity%20Search,more%20scalable%20similarity%20search%20functions.), or `FAISS`.

In [29]:
!pip install -q -U faiss-cpu tiktoken

We're going to be setting up our VectorStore with the OpenAI embeddings model. While this embeddings model does not need to be consistent with the LLM selection, it does need to be consistent between embedding our index and embedding our queries over that index.

While we don't have to worry too much about that in this example - it's something to keep in mind for more complex applications.

We're going to leverage a [`CacheBackedEmbeddings`](https://python.langchain.com/docs/modules/data_connection/caching_embeddings )flow to prevent us from re-embedding similar queries over and over again.

Not only will this save time, it will also save us precious embedding tokens, which will reduce the overall cost for our application.

>#### Note:
>The overall cost savings needs to be compared against the additional cost of storing the cached embeddings for a true cost/benefit analysis. If your users are submitting the same queries often, though, this pattern can be a massive reduction in cost.

In [51]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.vectorstores import FAISS
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/") ### YOUR CODE HERE ??
core_embeddings_model = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY) ### YOUR CODE HERE

embedder = CacheBackedEmbeddings.from_bytes_store(
core_embeddings_model, store, namespace=core_embeddings_model.model
)

vector_store =FAISS.from_documents(documents,embedder)### YOUR CODE HERE

Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: You exceeded your current quota, please check your plan and billing details..


In [52]:
list(store.yield_keys())

['text-embedding-ada-002f5603ee0-e479-5d7f-8e90-6c8c138b6291',
 'text-embedding-ada-002153556fb-dc15-5f1e-b70e-e2f2c40622b6',
 'text-embedding-ada-002a4d71e0f-d166-5907-ad05-6a62f21e67f6',
 'text-embedding-ada-002357297a9-b625-5f60-9556-c3f9b0c1941f',
 'text-embedding-ada-002f3d81969-92be-51c6-a768-b9748c4dcfc6',
 'text-embedding-ada-002df35a0ba-760e-59db-87ef-c39984028c19',
 'text-embedding-ada-002d7e02b54-9403-59f4-b032-a2576ea37fa4',
 'text-embedding-ada-002b279039b-208a-52b9-bfd5-c5a80c3e55ab',
 'text-embedding-ada-002a3a27c64-e8c2-50ab-8eb4-58a9157e1584',
 'text-embedding-ada-002ac68c522-5e8b-57ad-9b29-9f9d9d13591c',
 'text-embedding-ada-00207a4bfb1-5185-5dc7-a88f-62639d85b721',
 'text-embedding-ada-0020acd5952-16a1-5062-9dea-c1d87aeaf3e1',
 'text-embedding-ada-0025387825d-9525-5a60-9f85-fa76afb6a8ca',
 'text-embedding-ada-002282d3357-6ed4-5f85-90c7-ed4ab7da0319',
 'text-embedding-ada-0022d544bcd-dbd6-57c1-a2f7-a5632c70bbfc',
 'text-embedding-ada-00267f79849-b5e3-50c2-8f38-f85e20d

Now that we've created the VectorStore, we can check that it's working by embedding a query and retrieving passages from our reviews that are close to it.

In [53]:
query = "How is Will Ferrell in this movie?"
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

for page in docs:
  print(page.page_content)

: 61
Review_Date: 23 July 2023
Author: agjbull
Rating: 6
Review_Title: Just a little empty
Review: I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow. The film was very disjointed. Ryan Gosling as good as he is seemed to old to play Ken and Will Ferrell ruined every scene he was in. I just didn't get it, it seemed hollow artificial and hackneyed. A waste of some great talent. It was predictable without being reassuring and trying so hard to be woke in the most superficial way in that but trying to tick so many boxes it actually ticked none. Margo Robbie looks beautiful throughout, the costumes and the sets were amazing but the story was way too weak and didn't make much sense at all.
Review_Url: /review/rw9201978/?ref_=tt_urv
I can imagine that in the boardroom they went through a discussion like this: People are going to think this is a feel good movie so we need to give them something deeper but also lets make

Let's see how much time the `CacheBackedEmbeddings` pattern saves us:

In [54]:
%%timeit -n 1 -r 1
query = "I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow."
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

321 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [57]:
#%%timeit
query = "I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow."
embedding_vector = core_embeddings_model.embed_query(query)
docs = vector_store.similarity_search_by_vector(embedding_vector, k = 4)

In [58]:
for page in docs:
  print(page.page_content)

disappointing. The pacing was horrible, the villain won and was pretty irrelevant in the long run, the story was all over the place, and it is totally not for kids. I think the worst part is how disappointed I was at the opportunity to really make something special and it just wasn't. 6 points for all the laughs and fun I had and I would expect 6-7 to be an appropriate rating.
events that made the world think that this movie is going to be all about fun and adventure. Sadly, it was not nearly as enjoyable as the misleading marketing made it seem.
: 61
Review_Date: 23 July 2023
Author: agjbull
Rating: 6
Review_Title: Just a little empty
Review: I really wanted to enjoy this and I know that I am not the target audience but there were massive plot holes and no real flow. The film was very disjointed. Ryan Gosling as good as he is seemed to old to play Ken and Will Ferrell ruined every scene he was in. I just didn't get it, it seemed hollow artificial and hackneyed. A waste of some great t

As we can see, even over a significant number of runs - the cached query is significantly faster than the first instance of the query!

With that, we're ready to move onto Task 3!

### Task 3: Building a Retrieval Chain

In this task, we'll be making a Retrieval Chain which will allow us to ask semantic questions over our data.

This part is rather abstracted away from us in LangChain and so it seems very powerful.

Be sure to check the documentation, the source code, and other provided resources to build a deeper understanding of what's happening "under the hood"!

#### A Basic RetrievalQA Chain

We're going to leverage `return_source_documents=True` to ensure we have proper sources for our reviews - should the end user want to verify the reviews themselves.

Hallucinations [are](https://arxiv.org/abs/2202.03629) [a](https://arxiv.org/abs/2305.15852) [massive](https://arxiv.org/abs/2303.16104) [problem](https://arxiv.org/abs/2305.18248) in LLM applications.

Though it has been tenuously shown that using Retrieval Augmentation [reduces hallucination in conversations](https://arxiv.org/pdf/2104.07567.pdf), one sure fire way to ensure your model is not hallucinating in a non-transparent way is to provide sources with your responses. This way the end-user can verify the output.

#### Our LLM

In this notebook, we'll continue to leverage OpenAI's suite of models - this time we'll be using the `gpt-3.5-turbo` model to power our RetrievalQAWithSources chain.

Check out the relevant documentation if you get stuck:
- [`OpenAIChat()`](https://python.langchain.com/docs/modules/model_io/models/chat/)

In [None]:
from langchain.llms.openai import OpenAIChat

# llm = ### YOUR CODE HERE

In [62]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name="gpt-3.5-turbo"),
#chat = ChatOpenAI(model_name="gpt-3.5-turbo")

Now we can set up our chain.

We'll need to make our VectorStore a retriever in order to be able to leverage it in our chain - let's check out the `as_retriever()` method.

Relevant Documentation:
- [`FAISS`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.faiss.FAISS.html#langchain.vectorstores.faiss.FAISS.as_retriever)

In [76]:
retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={'k': 3, 'fetch_k': 50}
)

# retriever = vector_store.as_retriever(
#     search_type="mmr",
#     search_kwargs={'k': 6, 'lambda_mult': 0.25}
# )

In [77]:
#https://python.langchain.com/docs/use_cases/question_answering/how_to/vector_db_qa
from langchain.chains import RetrievalQA
from langchain.callbacks import StdOutCallbackHandler

handler = StdOutCallbackHandler()

qa_with_sources_chain = RetrievalQA.from_chain_type(
    #chat = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, streaming=True),
    llm= ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0, streaming=True),### YOUR CODE HERE,
    retriever=retriever,### YOUR CODE HERE,
    callbacks=[handler],
    return_source_documents=True
)

In [82]:
qa_with_sources_chain({"query" : "Should I take my boy friend to see this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'Should I take my boy friend to see this movie?',
 'result': 'Based on the given context, the review does not mention anything about whether this movie is suitable for couples or whether it would be enjoyable for a boyfriend. Therefore, it is uncertain whether you should take your boyfriend to see this movie.',
 'source_documents': [Document(page_content=": 5\nReview_Date: 21 July 2023\nAuthor: G-Joshua-Benjamin\nRating: 6\nReview_Title: My mom and I saw this yesterday. Here are my thoughts.\nReview: I don't know if I put spoilers in here. I am super worn out. Haha So I will just put that I did.\nReview_Url: /review/rw9201978/?ref_=tt_urv", metadata={'source': '/review/rw9201978/?ref_=tt_urv', 'row': 5}),
  Document(page_content="there specifically for Margot then I am pretty sure there will be multiple movies at your local cinema at the time you're reading this, which are way more worth watching than this one...", metadata={'source': '/review/rw9201978/?ref_=tt_urv', 'row': 

In [83]:
qa_with_sources_chain({"query" : "how is the music in this movie?"})



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'how is the music in this movie?',
 'result': 'The music in this movie is described as enjoyable, with musical numbers and a beautiful and emotional song by Billie Eilish called "What Was I Made For?" However, without more specific information, it is difficult to provide a comprehensive evaluation of the music in the film.',
 'source_documents': [Document(page_content=': 6\nReview_Date: 21 July 2023\nAuthor: Sleepin_Dragon\nRating: 8\nReview_Title: Well this really did come as a surprise.\nReview: It pains me to say it, but I enjoyed this movie so much more then I was expecting to, musical numbers, humour, there truly is something for the whole family yo enjoy.\nReview_Url: /review/rw9201978/?ref_=tt_urv', metadata={'source': '/review/rw9201978/?ref_=tt_urv', 'row': 6}),
  Document(page_content='but Billie Eilish "What Was I Made For?" tune that kept haunting in the background until it finally get the perfect scene to played it was really the best thing because that song was 

And with that, we have our Barbie Review RAQA Application built!

### Conclusion

Now that we have our application ready, let's deploy it to a Hugging Face space with their Docker spaces.

You can find the next part [here](https://ai-maker-space-barbie-raqa-application-chainlit-demo.hf.space/readme)!

You've built, now it's time to ship! 🚢