<a href="https://colab.research.google.com/github/wjleece/Adwords-for-RAG-LLMs/blob/main/wjleece_AdWords_RAG_LLM_API_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

If you use this code, please cite:

{
  title = {Adwords RAG},

  author = {Bill Leece},

  year = {2024}
}

#Intro

I was curious to create a RAG search system that would return hyperlinked results, similar to what happens in web search. I did this because I assume at some point companies like Perplexity will monetize through advertising. A fair bit of the code below inovlves dealing with data to make it hyperlinkable, so you can treat that lightly if your main interest is getting up to speed on AI and RAG. Finally, for anyone who is worried about not knowing Python well, I'm far from fluent in Python, so for sure *you can do it too!*

This notebook demonstrates conversational querying on Amazon product reviews for Nike shoes. Amazon recently announced a production-ready version of the concepts in this demo, which they call "Rufus": https://www.aboutamazon.com/news/retail/how-to-use-amazon-rufus

#Using This Colab Notebook



You have commenting rights to this notebook, so feel free to ask questions. To run the code, you should make a copy of this notebook for yourself, then you can edit/play with it as you like. Have fun!

#Setup

In [None]:
!pip install openai migrate --quiet
!pip install sentence_transformers --quiet
!pip install --upgrade langchain langchain_openai langchain_community --quiet
## !pip install tiktoken --quiet
!pip install gradio faiss-gpu --quiet

In [None]:
!python --version

Python 3.10.12


In [None]:
!openai --version

openai 1.43.0


In [None]:
import json
import os
import re
import pandas as pd
import requests
import gradio as gr
from IPython.display import HTML
from getpass import getpass
from langchain_community.document_loaders import CSVLoader
from langchain_openai import OpenAIEmbeddings
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain_openai import OpenAI
from langchain_community.vectorstores import FAISS

#Get Amazon Nike Review data (from Jan 2024, URLs may not work in the future)

In [None]:
url = 'https://raw.githubusercontent.com/wjleece/adwords-for-LLMs/main/updated_Amazon_reviews_batch_all.json' #JSON formatted data

response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    all_reviews = response.text
else:
    print(f"Failed to fetch data: HTTP {response.status_code}")

# Now all_reviews contains the data from the URL

In [None]:
print(type(all_reviews))

<class 'str'>


In [None]:
product_and_review_dict = json.loads(all_reviews) #ensures we have all the data formatted properly. Returns a dictionary.

In [None]:
type(product_and_review_dict)

dict

In [None]:
# Let's get our data in something that makes it easy to view, like a DataFrame
# DataFrame expects a list input, and we have a nested dictionary
# Let's convert the nested dictionary to a list

# Flatten the structure
flattened_reviews = []
for batch in product_and_review_dict.values():    #batch is the key values in json_data, as all_reviews was created by combining 4 separate batches of Amazon review data
                                                  #I used Octoparse when had a limit (50) on the number of ASINs it could process in one batch.
                                                  #There were 190 or so ASINs, hence 4 batches to get all the review data
    flattened_reviews.extend(batch)               #modify the flattened_reviews list by adding each reivew to the end of the list

In [None]:
# Create DataFrame
product_and_reviews_df = pd.DataFrame(flattened_reviews)

In [None]:
product_and_reviews_df.head() #this is just so we can easily see our data in a convenient format. I prefer it to running a for loop b/c the results are simply easier to read in a DataFrame

Unnamed: 0,Product_Title,ASIN,Link_Url,Review_Date,Rating,Combined_Review
0,nike air max 270,B078X16RP7,https://www.amazon.com/NIKE-Running-Shoes-Blac...,"January 19, 2024",5.0,Están Hermosos. | A mi esposo le encantaron es...
1,nike air max 270,B078X16RP7,https://www.amazon.com/NIKE-Running-Shoes-Blac...,"January 17, 2024",5.0,True to size | There’s nothing I don’t like ab...
2,nike air max 270,B078X16RP7,https://www.amazon.com/NIKE-Running-Shoes-Blac...,"January 17, 2024",1.0,I have a lot of these… | These are my favorite...
3,nike air max 270,B078X16RP7,https://www.amazon.com/NIKE-Running-Shoes-Blac...,"January 17, 2024",5.0,"Nike Air Max 270. | Beautiful, comfy, stylish,..."
4,nike air max 270,B078X16RP7,https://www.amazon.com/NIKE-Running-Shoes-Blac...,"January 17, 2024",5.0,Tennis | Excelente calidad


In [None]:
product_and_reviews_df.shape

(6614, 6)

#Listify Product Review Data for RAG Usage

In [None]:
product_and_reviews_df.Combined_Review[3]

'Nike Air Max 270. | Beautiful, comfy, stylish, just perfect, nothing much to say!'

In [None]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [None]:
product_and_reviews_df.to_csv('data/df_embed.csv',index=False)

In [None]:
product_and_reviews_df.Combined_Review[3]

'Nike Air Max 270. | Beautiful, comfy, stylish, just perfect, nothing much to say!'

In [None]:
# we do some stuff here b/c FAISS.from_documents required data to be formatted in a certain way

data = []

path= 'data/df_embed.csv'

loader = CSVLoader(file_path=path,source_column="Combined_Review")

data = loader.load() #creates a list of langchain Documents, see https://python.langchain.com/docs/modules/data_connection/document_loaders/

In [None]:
type(loader)

In [None]:
type(data) # Now we're good to go as  we need: FAISS.from_documents(list of Documents, embedding function)

list

In [None]:
#debug

data[3]

Document(metadata={'source': 'Nike Air Max 270. | Beautiful, comfy, stylish, just perfect, nothing much to say!', 'row': 3}, page_content='Product_Title: nike air max 270\nASIN: B078X16RP7\nLink_Url: https://www.amazon.com/NIKE-Running-Shoes-Black-White-AH8050-100/dp/B078X16RP7/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8\nReview_Date: January 17, 2024\nRating: 5.0\nCombined_Review: Nike Air Max 270. | Beautiful, comfy, stylish, just perfect, nothing much to say!')

In [None]:
# for interest / debugging only

word_count=0
total_word_count=0

for i in range(len(data)):
    word_count = len(data[i].page_content) #Combined_Review is the 6th item
    total_word_count = total_word_count + word_count

print (f'You have {len(data)} review(s) in your data')
print (f'There are {total_word_count} words in total in all of the reviews')

You have 6614 review(s) in your data
There are 2455211 words in total in all of the reviews


#Authenticate OpenAI API & Get Embeddings

In [None]:
if os.getenv("OPENAI_API_KEY") is None:
  if any(['VSCODE' in x for x in os.environ.keys()]):
    print('Please enter password in the VS Code prompt at the top of your VS Code window!')
  os.environ["OPENAI_API_KEY"] = getpass("")

assert os.getenv("OPENAI_API_KEY", "").startswith("sk-"), "This doesn't look like a valid OpenAI API key"
print("OpenAI API key configured")

··········
OpenAI API key configured


In [None]:
# Only run this once per session as it can get expensive - and this data is static!

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

#embeddings = OpenAIEmbeddings(model="text-embedding-3-small") #use this in the future as its cheaper; currently there's a formatting issue w/ the hyperlinks w/ this model

db = FAISS.from_documents(data, embeddings) #instantiates a FAISS vector database & generates embeddings for each document in data using the OpenAIEmbeddings
                                            #those embeddings are indexed in the FAISS vector database

In [None]:
dir(db)

['_FAISS__add',
 '_FAISS__from',
 '__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_aembed_documents',
 '_aembed_query',
 '_asimilarity_search_with_relevance_scores',
 '_cosine_relevance_score_fn',
 '_create_filter_func',
 '_embed_documents',
 '_embed_query',
 '_euclidean_relevance_score_fn',
 '_get_retriever_tags',
 '_max_inner_product_relevance_score_fn',
 '_normalize_L2',
 '_select_relevance_score_fn',
 '_similarity_search_with_relevance_scores',
 'aadd_documents',
 'aadd_texts',
 'add_documents',
 'add_embeddings',
 'add_texts',
 'adelete',
 'afrom_documents',
 'afrom_embeddings',
 'afrom_texts',
 'aget_by_ids',
 '

In [None]:
dir(embeddings)

['Config',
 '__abstractmethods__',
 '__annotations__',
 '__class__',
 '__class_vars__',
 '__config__',
 '__custom_root_type__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__exclude_fields__',
 '__fields__',
 '__fields_set__',
 '__format__',
 '__ge__',
 '__get_validators__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__include_fields__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__json_encoder__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__post_root_validators__',
 '__pre_root_validators__',
 '__pretty__',
 '__private_attributes__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__repr_args__',
 '__repr_name__',
 '__repr_str__',
 '__rich_repr__',
 '__schema_cache__',
 '__setattr__',
 '__setstate__',
 '__signature__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__try_update_forward_refs__',
 '__validators__',
 '__weakref__',
 '_abc_impl',
 '_aget_len_safe_embeddings',
 '_calculate_keys',
 '_copy

#Create RAG Response Function Returing JSON Formatted Data + Response Quality Function

In [None]:
query = "What are some great Nike running shoes that customers love?"

#I only have this here as a seeding input, so that the functions below can successfully initialize so that you can then run greet().
#This item only needs to be run once per session as once functions are initialized, the greet() function will execute taking the 'actual customer query'

In [None]:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import StrOutputParser

def get_rag_response_from_query(db, query, k=5):
    # Similarity search
    reviews = db.similarity_search(query, k=k)
    review_content = " ".join([r.page_content for r in reviews])

    # Initialize ChatOpenAI with gpt-4o
    llm = ChatOpenAI(model_name="gpt-4o", temperature=0.5)

    # Main prompt
    prompt = ChatPromptTemplate.from_template("""
    You are a bot has specialized knowledge of Nike shoes, which you have obtained from Nike shoe customer review data from Amazon that has been provided to you.
    When you are asked questions about Nike shoes, make sure to mention specific key phrases from the reviews that you have access to, as long as that information
    is relevant to the question asked. Return the response in the form of an introduction, a recommendation section, and a conclusion.
    Return the data in the recommendation section in JSON format using "product_name" as the key and "description" as the value.
    You do not need to tell me when the JSON formatted section begins or ends as I expect that to be clearly indicated with "[" and "]", respectively.
    Similarly you do not need to tell me what the intro is and what the conclusion is as it will precede and follow the "[" and "]", respectively.

    Answer the following question: {question}
    By searching the following articles: {docs}

    If the question is not related to the information in the Nike shoe customer review data from Amazon
    that you have been given access to, respond that you have expertise in Nike shoes but not on the topic the user has asked.
    """)

    # Create and invoke chain
    chain = prompt | llm | StrOutputParser()
    rag_response_text = chain.invoke({"question": query, "docs": review_content})

    # Evaluation prompt
    prompt_eval = ChatPromptTemplate.from_template("""
    You job is to evaluate if the response to a given context is faithful for the following: {answer}
    By searching the following reviews: {docs}
    Give a reason why they are similar or not, start with a Yes or a No.
    """)

    # Create and invoke evaluation chain
    chain_part_2 = prompt_eval | llm | StrOutputParser()
    evals = chain_part_2.invoke({"answer": rag_response_text, "docs": review_content})

    return rag_response_text, review_content, evals

In [None]:
rag_response_text, review_content, evals = get_rag_response_from_query(db, query)

In [None]:
rag_response_text

'When it comes to Nike running shoes, customers have shared a wealth of feedback that highlights some standout options. Based on reviews from Amazon, here are some top recommendations that have garnered high praise for their comfort, durability, and performance.\n\n[\n    {\n        "product_name": "nike",\n        "description": "Work great for tall guys | My husband is 6’5 and weighs about 300lb. He is a size 14-15. They are the best comfortable shoes, he loves them they are great quality and have held pretty well. Definitely recommend if you need to go out or workout, walking long periods of time or running."\n    },\n    {\n        "product_name": "nike react infinity run flyknit 3",\n        "description": "Durable, comfortable everyday shoes | As title says, these are super durable and comfortable for everyday use and running. Been wearing these every day for four months and they still have no visible wear and tear."\n    },\n    {\n        "product_name": "nike",\n        "descr

#Create OpenAI Response Dictionary from RAG Response for Later Hyperlinking; Separate JSON Response from Intro and Conclusion. (This part is not really "AI related" so you can treat this lightly)

In [None]:
def create_response_tuple(rag_response_text):
    openai_response_formatted = rag_response_text.strip()

    # Finding the start and end of the JSON data
    json_start = openai_response_formatted.find('[')
    json_end = openai_response_formatted.rfind(']') + 1  # +1 to include the closing bracket

    # Extracting the intro and conclusion text
    intro_text = openai_response_formatted[:json_start].strip()
    conclusion_text = openai_response_formatted[json_end:].strip()

    # Extracting the product description data from the response
    openai_response_products = openai_response_formatted[json_start:json_end].strip()

    # Converting JSON data to a list of dictionaries
    data = json.loads(openai_response_products)

    # Creating a dictionary with product names and descriptions
    response_product_dict = {}
    for item in data:
        response_product_dict[item['product_name']] = item['description']

    return intro_text, response_product_dict, conclusion_text

In [None]:
#debug
response_tuple = create_response_tuple(rag_response_text)

In [None]:
intro = response_tuple[0]
intro

'When it comes to Nike running shoes, customers have shared a wealth of feedback that highlights some standout options. Based on reviews from Amazon, here are some top recommendations that have garnered high praise for their comfort, durability, and performance.'

In [None]:
response_product_dict = response_tuple[1]
response_product_dict

{'nike': "BEST NIKE SNEAKER | I tried these on and was in heaven. I'm picky about comfort due to a wider foot. These are like walking on big cushions of air. So much support and I feel like I walk faster...lol",
 'nike react infinity run flyknit 3': 'Durable, comfortable everyday shoes | As title says, these are super durable and comfortable for everyday use and running. Been wearing these every day for four months and they still have no visible wear and tear.',
 'nike sport trail': 'Great @ Pounding Pavement | I like these shoes quite a bit. Wore them for a half marathon and they are comfortable for the long haul. Currently, I have the green ones as well and they are still “okay” after 175 miles. I ended up buying new ones in blue to keep a bounce in my step.'}

In [None]:
conclusion = response_tuple[2]
conclusion

"In conclusion, whether you're looking for a shoe that can handle long runs, provide exceptional comfort for wider feet, or offer durability for everyday use, these Nike running shoes come highly recommended by customers. Their positive experiences highlight the quality and performance you can expect from these models."

#Create Hyperlinks and Insert into OpenAI response (This part is also not really "AI related" so you can treat this lightly)

In [None]:
# this works well with text-embedding-ada-002 but less well with text-embedding-3-small.
# This is undoubtedly due to some formatting nuances that I'll figure out later
# text-embedding-3-small is cheaper than text-embedding-ada-002 so I should use it

def create_hyperlink_mapping(response_product_dict, product_and_review_dict):
    """
    Create a mapping of product names to their hyperlinked versions using the closest matches.

    Args:
    response_product_dict (dict): A dictionary of product names from an OpenAI response
    product_and_review_dict (dict): A nested dictionary of product details including URLs.

    Returns:
    dict: A dictionary mapping product names to hyperlinked versions.
    """
    hyperlink_mapping = {}

    # Flatten the nested product_and_review_dict
    flattened_product_dict = {}
    for batch in product_and_review_dict.values():
        for product in batch:
            product_title = product['Product_Title'].lower()
            url = product['Link_Url']
            flattened_product_dict[product_title] = url

    # Iterate over products in response_product_dict
    for product_name in response_product_dict.keys():
        # Lowercase the product name for matching
        matched_product_lower = product_name.lower()

        # Find the corresponding URL in flattened_product_dict
        for title, url in flattened_product_dict.items():
            if matched_product_lower in title:
                # Create a hyperlink version of the product name
                hyperlink_mapping[product_name] = f'<a href="{url}">{product_name}</a>'
                break  # Stop searching once a match is found

    return hyperlink_mapping

In [None]:
#debug

hyperlink_match_dict = create_hyperlink_mapping(response_product_dict, product_and_review_dict)
hyperlink_match_dict

{'nike': '<a href="https://www.amazon.com/NIKE-Running-Shoes-Black-White-AH8050-100/dp/B078X16RP7/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8">nike</a>',
 'nike react infinity run flyknit 3': '<a href="https://www.amazon.com/Nike-Infinity-Flyknit-Running-Grey-Grey/dp/B0BR58HL83/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8">nike react infinity run flyknit 3</a>',
 'nike sport trail': '<a href="https://www.amazon.com/Nike-Sport-Trail-Running-Anthracite/dp/B09XXYWN3R/ref=cm_cr_arp_d_bdcrb_top?ie=UTF8">nike sport trail</a>'}

#Create API

In [None]:
def linkify_response_for_gradio(response_product_dict, hyperlink_match_dict):
    updated_response = ""
    for product, description in response_product_dict.items():
        # Embed hyperlink for the product, if available
        hyperlink = hyperlink_match_dict.get(product, None)
        if hyperlink:
            # Add target="_blank" to open in new tab and embed in the text
            updated_hyperlink = hyperlink.replace('<a href=', '<a target="_blank" href=')
        else:
            # If no hyperlink, use the product name as plain text
            updated_hyperlink = f'<a>{product}</a>'

        # Combine the hyperlinked product with its description
        # Remove the product name from the description to avoid repetition
        description_without_product_name = description.replace(product, '').strip()
        updated_response += f"<p>{updated_hyperlink}: {description_without_product_name}</p>"

    # Replace newline characters with HTML line breaks (if needed)
    response_with_breaks = updated_response.replace('\n', '<br>')
    return response_with_breaks

In [None]:
#Putting it all together

def main():

  def greet(query):

      # Step 1: Get OpenAI response
      rag_response_text, review_content, evals = get_rag_response_from_query(db, query, k=5)

      # Step 2: Convert string response to product_dict
      response_tuple = create_response_tuple(rag_response_text)
      intro = response_tuple[0]
      response_product_dict = response_tuple[1]
      conclusion = response_tuple[2]

      # Step 3: Create hyperlink mapping
      hyperlink_match_dict = create_hyperlink_mapping(response_product_dict, product_and_review_dict)

      # Step 4: Linkify response for Gradio
      hyperlinked_response = linkify_response_for_gradio(response_product_dict, hyperlink_match_dict)

      # Step 5: Return the response with its intro and conclusion
      intro_html = f"<p>{intro}</p>" #things get messed up if I don't do this, not entirely sure why
      conclusion_html = f"<p>{conclusion}</p>"
      final_response = intro_html + hyperlinked_response + conclusion_html
      final_response = final_response.replace('\n', '<br>')

      return final_response, review_content, evals

  examples = [
      ["Can you recommend some lightweight Nike running shoes that dry easily if they get wet and that customers really love?"],
      ["Can you recommend some lightweight Nike basketball shoes that have good ankle support?"],
      ["Can you recommend some Nike shoes that customers love?"],
      ["Can you recommend some Nike shoes that are comfortable and durable?"],
      ]

  nike_product_search = gr.Interface(fn=greet, title="Bill Leece's Nike Product Sentiment Search & Adwords Simulator", inputs="text",
                            outputs=[ #gr.components.Textbox(lines=3, label="Response"),
                            gr.HTML(label="Response"),
                            gr.components.Textbox(lines=3, label="Source"),
                            gr.components.Textbox(lines=3, label="Evaluation")],
                                    )

  nike_product_search.launch(share=True)

if __name__ == "__main__":
    main()

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://ca40464e3268cdc305.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
