In this notebook, we define the `amazon_asin` function, which searches for a term and returns any amazon.com products in the first ten results, along with their asins, and a "matching score" via rapidfuzz.

For example running `amazon_asin("Ace the Data Science Interview: 201 Real Interview Questions by Nick Singh and Kevin Huo")` outputs 
the following:
```
[{'name': 'Ace Data Science Interview Questions',
  'asin': '0578973839',
  'score': 100.0},
 {'name': 'Ace Data Science Interview Interviews',
  'asin': '1956591133',
  'score': 82.53968253968254},
 {'name': 'Ace Data Engineering Interview Questions',
  'asin': 'B0F18SQNYL',
  'score': 82.35294117647058}]
  ```
**The results are in descending order of closeness of match (according to duckduckgo.com). This may not agree with the rapidfuzz ``score``.**

  This is then applied to the incidents reports data in the "asin_search_results" column and then saved to asin_search_results.csv.


In [31]:
# If TESTING is True, the code will only run on a random sample of at most MAX_INCIDENTS rows of the incidient report data.
# If TESTING is False, the code will run on the whole dataset.
TESTING=False
MAX_INCIDENTS=20

In [32]:
import time
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import rapidfuzz as rf
from tqdm import tqdm
tqdm.pandas()

In [33]:
def duckduckgo_search(query, max_results=10, headless=False):
    """
    Performs a DuckDuckGo search using Selenium and returns the parsed HTML of the results page.

    This function automates a headless (or optionally visible) Chrome browser to open DuckDuckGo,
    perform a search for the given query, and return the resulting page content as a BeautifulSoup object.

    Args:
        query (str): The search term to query DuckDuckGo with.
        max_results (int, optional): Unused in current implementation; reserved for future use to limit results. Defaults to 10.
        headless (bool, optional): If True, runs Chrome in headless mode (no browser window). Defaults to True.

    Returns:
        BeautifulSoup or list: Parsed HTML of the search results page as a BeautifulSoup object.
        Returns an empty list if an error occurs during the search process.

    Notes:
        - Requires undetected-chromedriver (uc), Selenium, BeautifulSoup4, and ChromeDriver v136.
        - The function waits up to 12 seconds for the search box to appear.
        - The function currently does not paginate or limit results using `max_results`.
        - May print errors to the console during scraping.
    """
    options = uc.ChromeOptions()
    
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--remote-debugging-port=9222")

    driver = uc.Chrome(options=options, version_main=136)

    try:
        driver.get("https://duckduckgo.com/")

        search_box = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "searchbox_input"))
        )
        search_box.send_keys(query)
        search_box.send_keys(Keys.RETURN)


        time.sleep(2)  # give results a moment to fully load

        source=driver.page_source
        return BeautifulSoup(source,"html.parser")

    except Exception as e:
        print("Error during scraping:", e)
        return []

    finally:
        driver.quit()

In [34]:
def amazon_asin(query:str):
    """
    Searches Amazon for a given product query and attempts to extract ASINs from search results.

    This function uses DuckDuckGo search (via duckduckgo_search) to find product pages on Amazon 
    related to the input query. It parses the search results to identify Amazon product URLs containing 
    ASINs (Amazon Standard Identification Numbers), extracts the product name and ASIN, and computes a 
    fuzzy match score between the query and the extracted product name.

    Args:
        query (str): The product search query (e.g., product description or name). Truncated to 200 characters.

    Returns:
        list[dict] or None: A list of dictionaries, each containing:
            - 'name' (str): Extracted product name from the URL.
            - 'asin' (str): The ASIN string extracted from the Amazon URL.
            - 'score' (int): A fuzzy match score (0–100) indicating relevance to the original query.
        
        Returns None if the query is empty, invalid, or if no ASINs were found.
    
    Note:
        - Relies on an external `duckduckgo_search` function for search and a fuzzy matching tool `rf.fuzz`.
        - This function does not verify if ASINs are still valid on Amazon.
    """
    try:
        # Ensure query is a string and truncate to 200 characters
        query=str(query)[0:200]
        
        # Remove any newline, carriage return, or tab characters
        query=query.replace('\r', '').replace('\n', '').replace('\t','')
        if query=='' or query=='nan':
            return None
        
        # Build the DuckDuckGo search query to limit results to amazon.com
        search_query="site:amazon.com "+str(query)
        
        # Perform the search using DuckDuckGo (browser-based)
        page= duckduckgo_search(search_query)
        
        # Extract all Amazon product links from the search result page
        links=[link.attrs['href'] for link in page.find_all('a',class_='eVNpHGjtxRBq_gLOfGDr LQNqh2U1kzYxREs65IJu')]
        
        # Loop through all extracted links, saving the product name, asin and fuzzscore to results.
        results=[]
        for link in links:
            index=link.find('/dp/')
            if index!=-1:
                asin=link[index+4:]
                name=link[23:index].replace('-',' ')
                score=rf.fuzz.token_set_ratio(name,query)
                results.append({'name': name,'asin': asin,'score':score})
        
        # Return None if no valid ASINs were found
        if results==list():
            return None
    
        return results
    except Exception as error:
        print("An error occurred:", error)
        return None


In [35]:
# Import incident reports with asin numbers from links
%run asin_in_text.ipynb
reports=pd.read_csv("reports.csv",index_col=0)

# Limit the number of records if TESTING=true
if TESTING and len(reports)>MAX_INCIDENTS:
    reports=reports.sample(MAX_INCIDENTS,random_state=1066)

In [36]:
# We define the query column, which is used to search for a product.
# It concatenates the brand, model_name_or_number, and product_description.

reports['query']=reports.brand.fillna('').astype(str)+' '+\
                 reports.model_name_or_number.fillna('').astype(str) + ' ' +\
                 reports.product_description.fillna('').astype(str)

In [37]:
reports['search_result']=reports['query'].progress_apply(amazon_asin)

100%|██████████| 2514/2514 [5:01:31<00:00,  7.20s/it]  


In [39]:
for i,item in reports.iterrows():
    print("Brand:", item.brand)
    print("Model No:", item.model_name_or_number)
    print("Description:", item.product_description)

    print("Search Results:")
    for result in item.search_result:
        print(result)
    print("\n \n")

Brand: POLKA DROP SLIME
Model No: nan
Description: Slime globe with colored spheres which resemble [REDACTED] cereal or [REDACTED]
Search Results:
{'name': 'Polka Dot Slime 12 Pack', 'asin': 'B0CJ9XS1NJ', 'score': 35.71428571428571}
{'name': 'YOPINSAND Galaxy Making Add ins Glitters', 'asin': 'B0D5LX83X3', 'score': 20.799999999999997}
{'name': 'SLIMYGLOOP MixEms Horizon Sparkly Glitter', 'asin': 'B07N84BJ63', 'score': 20.634920634920633}
{'name': 'GirlZone Cosmic Premade Glitter Christmas', 'asin': 'B0B2Q8MR4Q', 'score': 28.57142857142857}

 

Brand: Nickledodeon Slime
Model No: Lot #281117
Description: Slime kit from Nickelodeon by Cra-Z-Art
Search Results:
{'name': 'Cra Z Art Nickelodeon Stress Less Slime', 'asin': 'B07VB9PHLH', 'score': 60.714285714285715}
{'name': 'Cra Z Art Nickelodeon Pre Made Slime Super', 'asin': 'B089MWDPVB', 'score': 57.6271186440678}

 

Brand: Lalaloopsy Color Me ( Squiggles N. Shapes )
Model No: 531463/531470
Description: Lalaloopsy Color Me Doll ( Squiggl

TypeError: 'NoneType' object is not iterable

In [40]:
reports=reports.drop(columns=['query'])

In [41]:
reports.to_csv("asin_with_search_results.csv")