# `duckduckgo_asins.ipynb`

In this notebook we match incident reports with the ASINs (Amazon Standard Identification Numbers) of the products they concern. 

We first use the `find_amazon_link_in_report` function to extract ASINs from Amazon links (including "a.co" short URLs) in the report text.

We then use the `amazon_asin` function, which searches for a product in duckduckgo.com and returns any amazon.com products in the first ten results, along with their asins, and a "matching score" via rapidfuzz.

For example running `amazon_asin("Ace the Data Science Interview: 201 Real Interview Questions by Nick Singh and Kevin Huo")` outputs 
the following:
```
[{'name': 'Ace Data Science Interview Questions',
  'asin': '0578973839',
  'score': 100.0},
 {'name': 'Ace Data Science Interview Interviews',
  'asin': '1956591133',
  'score': 82.53968253968254},
 {'name': 'Ace Data Engineering Interview Questions',
  'asin': 'B0F18SQNYL',
  'score': 82.35294117647058}]
  ```
**The results are in descending order of closeness of match (according to duckduckgo.com). This may not agree with the rapidfuzz ``score``.**

This is then applied to the incidents reports data, and saved to the "asin_search_results" column in asin_search_results.csv.

We save the results to `df_asin.csv`, which consistis of Amazon ASINs annotated with match status and report indices.

In [1]:
# By default, we will use the saved results from the scrape in the file asin_with_search_results.csv.
# The scraping took about 5 hours on my computer.
# Setting USE_SAVED_SCRAPE_RESULTS to False will carry out the scrape again, which is necessary if we are using new incident data.
USE_SAVED_SCRAPE_RESULTS=True

# If TESTING is True, the code will only run on a random sample of at most MAX_INCIDENTS rows of the incidient report data.
# If TESTING is False, the code will run on the whole dataset.
TESTING=False
MAX_INCIDENTS=10

In [2]:
import pandas as pd

In [3]:
#Import incident report data

report_files=['../Data/Current Version of Toys Incidence+Recall/Toysandchildren_ArtsandCrafts.csv',
              '../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Riding_Toys.csv',
              '../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Toys.csv']
df_arts=pd.read_csv(report_files[0],header=2)
df_riding=pd.read_csv(report_files[1],header=1)
df_toy=pd.read_csv(report_files[2],header=1)
reports=pd.concat([df_arts,df_riding,df_toy   ],ignore_index=True)


to_rename=dict()
for column in reports.columns:
    to_rename[column]=column.replace(' / ','/').replace(' ','_').lower()
reports=reports.rename(columns=to_rename)

del df_riding, df_arts, df_toy, to_rename, report_files,column


In [4]:
# This contains a few "a.co" short URLs whch I entered the ASINs for manually.
link_lookup={'a.co/d/2299M1u': 'B0BWYHLHHZ',
                'a.co/d/63Rrqnu': 'B08BJYVMHS',
                'a.co/d/6QDqgip': 'B086WNVDTG',
                'a.co/d/7gfkDxl': 'B09Z68Q2K7',
                'a.co/d/7r4Z3S5': 'B095WQHW8L',
                'a.co/d/8jsyfzE': 'B09YHVY42K',
                'a.co/d/aCLKMub': 'B0CDLDCW38',
                'a.co/d/ckJk1jl': 'B0BQQP5WK5',
                'a.co/d/dai68Xv': 'B07Y1V52BT',
                'a.co/d/eOVs4JR': 'B0CHJTD1FS',
                'a.co/d/eV1EXP7': 'B0BWS22GVD',
                'a.co/d/glkRka6': 'B07BQFS9W8',
                'a.co/d/iPLpSs3': 'B0CH8DCDCC',
                'a.co/d/iqx6cmg': 'B0BBDX4W8T'}

In [5]:
def find_amazon_link_in_report(x):
    for item in x:
        idx=str(item).find('a.co/d/')
        if idx!=-1:
            return link_lookup[str(item)[idx:idx+14]]
        idx=str(item).find('/dp/')
        if idx!=-1:
                return str(item)[idx+4:idx+14]
    return None

In [6]:
reports['asin_in_report']=reports.apply(find_amazon_link_in_report,axis=1)


In [7]:
from tqdm import tqdm
tqdm.pandas()

In [8]:
from scrape_duckduckgo import amazon_asin

In [9]:
# Limit the number of records if TESTING=true
if TESTING and len(reports)>MAX_INCIDENTS:
    reports=reports.sample(MAX_INCIDENTS,random_state=1066)

In [10]:
# We define the query column, which is used to search for a product.
# It concatenates the brand, model_name_or_number, and product_description.

reports['query']=reports.brand.fillna('').astype(str)+' '+\
                 reports.model_name_or_number.fillna('').astype(str) + ' ' +\
                 reports.product_description.fillna('').astype(str)

In [11]:
if USE_SAVED_SCRAPE_RESULTS:
    reports=pd.read_csv("asin_with_search_results.csv",index_col=0)
    from ast import literal_eval
    reports.fillna({'search_result':"[]"},inplace=True)
    reports.search_result=reports.search_result.apply(lambda x : literal_eval(x))
else:
    reports['search_result']=reports['query'].progress_apply(amazon_asin)
    reports=reports.drop(columns=['query'])

We now see some of the results. It important to note that the product names in the results do not necessarily summarise what the product is.
For example, consider the product with description ``FLARP- Noise Putty (Pink)Slime product``. One of the returned results is ``Original Glitter Assorted JA RU Scented``, but on visiting the [product's amazon.com page](https://www.amazon.com/Original-Glitter-Assorted-JA-RU-Scented/dp/B098TYWVR1?th=1), we see that the full product name is ``JA-RU Flip & Flarp Noise Putty for Kids Double Pack Original & Glitter (1 Pack Assorted),Farrt Gas Noise Maker Slime Cloud & Scented Putty Fidget Stress Toy for Boys, Girls & Adults. 047-1A ``, which better matches the product description.

In [12]:
for i,item in reports.iloc[:30].iterrows():
    print("Brand:", item.brand)
    print("Model No:", item.model_name_or_number)
    print("Description:", item.product_description)

    print("Search Results:")
    for result in item.search_result:
        print(result)
    print("\n \n")

Brand: POLKA DROP SLIME
Model No: nan
Description: Slime globe with colored spheres which resemble [REDACTED] cereal or [REDACTED]
Search Results:
{'name': 'Polka Dot Slime 12 Pack', 'asin': 'B0CJ9XS1NJ', 'score': 35.71428571428571}
{'name': 'YOPINSAND Galaxy Making Add ins Glitters', 'asin': 'B0D5LX83X3', 'score': 20.799999999999997}
{'name': 'SLIMYGLOOP MixEms Horizon Sparkly Glitter', 'asin': 'B07N84BJ63', 'score': 20.634920634920633}
{'name': 'GirlZone Cosmic Premade Glitter Christmas', 'asin': 'B0B2Q8MR4Q', 'score': 28.57142857142857}

 

Brand: Nickledodeon Slime
Model No: Lot #281117
Description: Slime kit from Nickelodeon by Cra-Z-Art
Search Results:
{'name': 'Cra Z Art Nickelodeon Stress Less Slime', 'asin': 'B07VB9PHLH', 'score': 60.714285714285715}
{'name': 'Cra Z Art Nickelodeon Pre Made Slime Super', 'asin': 'B089MWDPVB', 'score': 57.6271186440678}

 

Brand: Lalaloopsy Color Me ( Squiggles N. Shapes )
Model No: 531463/531470
Description: Lalaloopsy Color Me Doll ( Squiggl

In [13]:
# Load amazon metadata asins
amazon_meta_df=pd.read_pickle("../Data/metadata_raw.pkl")[['asin']]

In [14]:
# Add "match" and "indices" columns to amazon_meta_df

def build_asin_dict():
    asins=dict()
    for index,report in reports.iterrows():
        if report.notna().asin_in_report:
            asin=report.asin_in_report.lower()
            asins[asin]=asins.get(asin,list())
            asins[asin].append(index)
        if report.notna().search_result:
            for result in report.search_result:
                asin=result['asin'].lower()
                asins[asin]=asins.get(asin,list())
                asins[asin].append(index)

    return asins
asins=build_asin_dict()
amazon_meta_df['indices']=amazon_meta_df.asin.apply(lambda x : asins.get(x.lower(),[]))
amazon_meta_df['match']=amazon_meta_df.indices.apply(lambda x: int(x!=[]))

In [15]:
amazon_meta_df.to_csv("df_asin.csv",columns=['match','indices'])

In [None]:
amazon_meta_df=pd.read_pickle("../Data/metadata_raw.pkl")