In this notebook, we define the `amazon_asin` function, which searches for a term and returns any amazon.com products in the first ten results, along with their asins, and a "matching score" via rapidfuzz.

For example running `amazon_asin("Ace the Data Science Interview: 201 Real Interview Questions by Nick Singh and Kevin Huo")` outputs 
the following:
```
[{'name': 'Ace Data Science Interview Questions',
  'asin': '0578973839',
  'score': 100.0},
 {'name': 'Ace Data Science Interview Interviews',
  'asin': '1956591133',
  'score': 82.53968253968254},
 {'name': 'Ace Data Engineering Interview Questions',
  'asin': 'B0F18SQNYL',
  'score': 82.35294117647058}]
  ```
**The results are in descending order of closeness of match (according to duckduckgo.com). This may not agree with the rapidfuzz ``score``.**

  This is then applied to the incidents reports data in the "asin_search_results" column and then saved to asin_search_results.csv.


In [20]:
# If TESTING is True, the code will only run on a random sample of at most MAX_INCIDENTS rows of the incidient report data.
# If TESTING is False, the code will run on the whole dataset.
TESTING=True
MAX_INCIDENTS=20

In [21]:
import time
import undetected_chromedriver as uc
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import rapidfuzz as rf
from tqdm import tqdm
tqdm.pandas()

In [22]:
def duckduckgo_search(query, max_results=10, headless=False):
    """
    Performs a DuckDuckGo search using Selenium and returns the parsed HTML of the results page.

    This function automates a headless (or optionally visible) Chrome browser to open DuckDuckGo,
    perform a search for the given query, and return the resulting page content as a BeautifulSoup object.

    Args:
        query (str): The search term to query DuckDuckGo with.
        max_results (int, optional): Unused in current implementation; reserved for future use to limit results. Defaults to 10.
        headless (bool, optional): If True, runs Chrome in headless mode (no browser window). Defaults to True.

    Returns:
        BeautifulSoup or list: Parsed HTML of the search results page as a BeautifulSoup object.
        Returns an empty list if an error occurs during the search process.

    Notes:
        - Requires undetected-chromedriver (uc), Selenium, BeautifulSoup4, and ChromeDriver v136.
        - The function waits up to 12 seconds for the search box to appear.
        - The function currently does not paginate or limit results using `max_results`.
        - May print errors to the console during scraping.
    """
    options = uc.ChromeOptions()
    
    if headless:
        options.add_argument("--headless=new")
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-gpu")
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-blink-features=AutomationControlled")
    options.add_argument("--remote-debugging-port=9222")

    driver = uc.Chrome(options=options, version_main=136)

    try:
        driver.get("https://duckduckgo.com/")

        search_box = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "searchbox_input"))
        )
        search_box.send_keys(query)
        search_box.send_keys(Keys.RETURN)


        time.sleep(2)  # give results a moment to fully load

        source=driver.page_source
        return BeautifulSoup(source,"html.parser")

    except Exception as e:
        print("Error during scraping:", e)
        return []

    finally:
        driver.quit()

In [23]:
def amazon_asin(query:str):
    """
    Searches Amazon for a given product query and attempts to extract ASINs from search results.

    This function uses DuckDuckGo search (via duckduckgo_search) to find product pages on Amazon 
    related to the input query. It parses the search results to identify Amazon product URLs containing 
    ASINs (Amazon Standard Identification Numbers), extracts the product name and ASIN, and computes a 
    fuzzy match score between the query and the extracted product name.

    Args:
        query (str): The product search query (e.g., product description or name). Truncated to 200 characters.

    Returns:
        list[dict] or None: A list of dictionaries, each containing:
            - 'name' (str): Extracted product name from the URL.
            - 'asin' (str): The ASIN string extracted from the Amazon URL.
            - 'score' (int): A fuzzy match score (0–100) indicating relevance to the original query.
        
        Returns None if the query is empty, invalid, or if no ASINs were found.
    
    Note:
        - Relies on an external `duckduckgo_search` function for search and a fuzzy matching tool `rf.fuzz`.
        - This function does not verify if ASINs are still valid on Amazon.
    """
    try:
        # Ensure query is a string and truncate to 200 characters
        query=str(query)[0:200]
        
        # Remove any newline, carriage return, or tab characters
        query=query.replace('\r', '').replace('\n', '').replace('\t','')
        if query=='' or query=='nan':
            return None
        
        # Build the DuckDuckGo search query to limit results to amazon.com
        search_query="site:amazon.com "+str(query)
        
        # Perform the search using DuckDuckGo (browser-based)
        page= duckduckgo_search(search_query)
        
        # Extract all Amazon product links from the search result page
        links=[link.attrs['href'] for link in page.find_all('a',class_='eVNpHGjtxRBq_gLOfGDr LQNqh2U1kzYxREs65IJu')]
        
        # Loop through all extracted links, saving the product name, asin and fuzzscore to results.
        results=[]
        for link in links:
            index=link.find('/dp/')
            if index!=-1:
                asin=link[index+4:]
                name=link[23:index].replace('-',' ')
                score=rf.fuzz.token_set_ratio(name,query)
                results.append({'name': name,'asin': asin,'score':score})
        
        # Return None if no valid ASINs were found
        if results==list():
            return None
    
        return results
    except Exception as error:
        print("An error occurred:", error)
        return None


In [24]:
# Import incident reports with asin numbers from links
%run asin_in_text.ipynb
reports=pd.read_csv("reports.csv",index_col=0)

# Limit the number of records if TESTING=true
if TESTING and len(reports)>MAX_INCIDENTS:
    reports=reports.sample(MAX_INCIDENTS,random_state=1066)

In [25]:
# We define the query column, which is used to search for a product.
# It concatenates the brand, model_name_or_number, and product_description.

reports['query']=reports.brand.fillna('').astype(str)+' '+\
                 reports.model_name_or_number.fillna('').astype(str) + ' ' +\
                 reports.product_description.fillna('').astype(str)

In [26]:
reports['search_result']=reports['query'].progress_apply(amazon_asin)

100%|██████████| 20/20 [04:22<00:00, 13.12s/it]


In [30]:
for i,item in reports.iterrows():
    print("Brand:", item.brand)
    print("Model No:", item.model_name_or_number)
    print("Description:", item.product_description)

    print("Search Results:")
    for result in item.search_result:
        print(result)
    print("\n \n")

Brand: Century
Model No: L2265
Description: Century Racer cars
Search Results:
{'name': '37649 New Century Racer', 'asin': 'B001CEH2DG', 'score': 72.22222222222223}
{'name': 'Buick Century California Highway Patrol', 'asin': 'B005OQNPRA', 'score': 45.16129032258065}

 

Brand: nan
Model No: nan
Description: Bravo Sports, of Santa Fe Springs, Calif.Twin Stick Pogo046HE
Search Results:
{'name': 'Bravo Sports Pop Stick Pogo', 'asin': 'B005QNYGYC', 'score': 57.89473684210526}
{'name': 'Santa Fe Springs California CA', 'asin': 'B0966FG2RY', 'score': 54.94505494505494}
{'name': 'Bravo Sports Disney Minnie Skates', 'asin': 'B07PZK7HKY', 'score': 44.680851063829785}
{'name': 'bravo sports shade folding beach', 'asin': 'B009L4LQHC', 'score': 36.55913978494624}
{'name': 'Mytee Products 40 Air Electric', 'asin': 'B075LTPSBH', 'score': 24.175824175824175}

 

Brand: Zekpro
Model No: The anti-anxiety 360 spinner (gold blue)
Description: Fidget spinner. The kind that has 3 bearings.
Search Results:


In [28]:
reports

Unnamed: 0,report_no.,report_date,sent_to_manufacturer/importer/private_labeler,publication_date,category_of_submitter,product_description,product_category,product_sub_category,product_type,product_code,...,damage_repaired,product_was_modified_before_incident,have_you_contacted_the_manufacturer,if_not_do_you_plan_to,answer_explanation,company_comments,associated_report_numbers,asin_in_report,query,search_result
1427,20110718-34E6D-1192035,7/18/2011,3/11/2014,8/26/2011,Consumer,Century Racer cars,Toys & Children,Toys,Toy Vehicles (Excluding Riding Toys) (5021),5021,...,,No,No,No,"Consumer is unsure who to contact, has only co...",,,,Century L2265 Century Racer cars,"[{'name': '37649 New Century Racer', 'asin': '..."
1789,20110321-2E7AA-2147481168,3/21/2011,,4/11/2011,Consumer,"Bravo Sports, of Santa Fe Springs, Calif.Twin ...",Toys & Children,Toys,Pogo Sticks (1310),1310,...,,No,No,No,,Thank you for transmitting this to us. We wo...,,,"Bravo Sports, of Santa Fe Springs, Calif.Twi...","[{'name': 'Bravo Sports Pop Stick Pogo', 'asin..."
2123,20170628-7B636-2147399345,6/28/2017,8/24/2017,1/2/2018,Consumer,Fidget spinner. The kind that has 3 bearings.,Toys & Children,Toys,"Toys, Not Elsewhere Classified (1381)",1381,...,No,No,Yes,,When my daughter dropped the fidget spinner th...,,,,Zekpro The anti-anxiety 360 spinner (gold blue...,[{'name': 'Zekpro Anti Anxiety Spinner Focusin...
1656,20240209-D2AF9-4607683,2/9/2024,4/2/2024,4/23/2024,Consumer,Latex balloons,Toys & Children,Toys,Balloons (Toy) (1347),1347,...,,No,No,No,The consumer has the product.,,,,Power Balloon Latex balloons,[{'name': 'Treasures Gifted Power Rangers Ball...
1864,20130301-5255E-2147458337,3/1/2013,3/14/2013,3/28/2013,Consumer,"More ammo and more distance, Xploderz® creates...",Toys & Children,Toys,Toy Guns With Projectiles (1399),1399,...,,No,Yes,,I have contacted the manufacturer who wrote th...,,,,Xploderz X2 Retaliator 2000 X2 Retaliator 2000...,"[{'name': 'Xploderz 45211 Target Strike Set', ..."
1755,20110811-DF9B0-2147476356,8/11/2011,,9/1/2011,Consumer,Stomp Rocket Ultra,Toys & Children,Toys,Rocketry Sets (1314),1314,...,,No,No,Yes,I feel compelled to tell them of teh danger an...,,,,Stomp Rocket Ultra Ultra Stomp Rocket Ultra,[{'name': 'Stomp Rocket Launcher Kids Backyard...
1423,20120627-F455A-1255602,6/27/2012,7/27/2012,8/10/2012,Consumer,Electrical helicopter (3.5 channel RC helicopter),Toys & Children,Toys,Toy Vehicles (Excluding Riding Toys) (5021),5021,...,,No,No,Yes,The consumer will be contacting the manufactur...,,,,Syma S031G Electrical helicopter (3.5 channel...,[{'name': 'SYMA Military Helicopters Helicopte...
214,20110527-2D18E-2147478799,5/27/2011,,6/21/2011,Consumer,Razor 4 wheeler quad riding vehicle for childr...,Toys & Children,Riding Toys,Powered Riding Toys (1330),1330,...,,,Yes,,the manufacture was contacted by the family's ...,Razor appreciates all comments from its custo...,,,Razor The Razor Dirt Quad Electric Ride-On Raz...,[{'name': 'Razor Dirt Quad Variable Speed Hand...
1763,20110817-C461B-1197066,8/17/2011,10/7/2011,10/24/2011,Consumer,Dexton 12' Great Plains Teepee,Toys & Children,Toys,"Children's Play Tents, Play Tunnels or Other E...",1322,...,,No,Yes,,Consumer contacted mfr and explained incident ...,,,,Dexton Dexton 12' Great Plains Teepee Dexton ...,"[{'name': 'Dexton Great Plains Teepee', 'asin'..."
1646,20210703-DDAB6-2147363155,7/6/2021,7/8/2021,7/22/2021,Consumer,Earth magnets,Toys & Children,Toys,Building Sets (1345),1345,...,,,,,,Retrospective Goods LLC: The report of harm in...,,,Speks Spectrum 512 Speks Earth magnets,[{'name': 'Speks Magnetic Sensory Stuffer Mons...


In [29]:
reports.to_csv("asin_with_search_results.csv")