# Documentation for `asin_in_text.ipynb`

## Overview
This notebook processes incident and recall report data for toys and children’s products to extract Amazon ASINs (Amazon Standard Identification Numbers) directly from the report text. It identifies both standard Amazon product links and shortened "a.co" URLs, using a lookup table for the latter, and saves the extracted ASINs for further analysis.

## Workflow Steps
1. **Load Incident Report Data:**  
   Reads three CSV files containing incident and recall reports for different toy categories:
   - `Toysandchildren_ArtsandCrafts.csv`
   - `Toysandchildren_Riding_Toys.csv`
   - `Toysandchildren_Toys.csv`
   These are concatenated into a single DataFrame and columns are standardized for consistency.
2. **Define Link Lookup Table:**  
   A dictionary (`link_lookup`) is defined to map shortened "a.co" URLs to their corresponding ASINs.
3. **Extract ASINs from Reports:**  
   The function `find_amazon_link_in_report` scans each row for Amazon product links:
   - If a shortened "a.co/d/" URL is found, it uses the lookup table to retrieve the ASIN.
   - If a standard Amazon `/dp/` URL is found, it extracts the 10-character ASIN following `/dp/`.
   - The first ASIN found is returned; if none are found, returns `None`.
   The extracted ASIN is stored in a new column, `asin_in_report`.
4. **Save Results:**  
   The processed DataFrame, including the new `asin_in_report` column, is saved to `reports.csv`.

## Input Files
- `../Data/Current Version of Toys Incidence+Recall/Toysandchildren_ArtsandCrafts.csv`
- `../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Riding_Toys.csv`
- `../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Toys.csv`
- `link_lookup` (dictionary defined in the notebook)

## Output Files
- `reports.csv` (incident reports with extracted ASINs in the `asin_in_report` column)

## Key Function

### `find_amazon_link_in_report(x)`
Scans a list of strings (a row from the DataFrame) to find and extract an Amazon ASIN from any Amazon product links present.  
- If a shortened "a.co/d/" URL is found, it uses the `link_lookup` dictionary to get the ASIN.
- If a standard Amazon `/dp/` URL is found, it extracts the 10-character ASIN following `/dp/`.
- Returns the first ASIN found, or `None` if no valid link is present.

## Usage
Run the notebook cells sequentially to extract ASINs from incident report data and save the results for further analysis.

---

In [3]:
import pandas as pd

In [4]:
#Import incident report data

report_files=['../Data/Current Version of Toys Incidence+Recall/Toysandchildren_ArtsandCrafts.csv',
              '../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Riding_Toys.csv',
              '../Data/Current Version of Toys Incidence+Recall/Toysandchildren_Toys.csv']
df_arts=pd.read_csv(report_files[0],header=2)
df_riding=pd.read_csv(report_files[1],header=1)
df_toy=pd.read_csv(report_files[2],header=1)
reports=pd.concat([df_arts,df_riding,df_toy   ],ignore_index=True)


to_rename=dict()
for column in reports.columns:
    to_rename[column]=column.replace(' / ','/').replace(' ','_').lower()
reports=reports.rename(columns=to_rename)

del df_riding, df_arts, df_toy, to_rename, report_files,column


In [13]:
# This contains a few "a.co" short URLs whch I entered the ASINs for manually.
link_lookup={'a.co/d/2299M1u': 'B0BWYHLHHZ',
                'a.co/d/63Rrqnu': 'B08BJYVMHS',
                'a.co/d/6QDqgip': 'B086WNVDTG',
                'a.co/d/7gfkDxl': 'B09Z68Q2K7',
                'a.co/d/7r4Z3S5': 'B095WQHW8L',
                'a.co/d/8jsyfzE': 'B09YHVY42K',
                'a.co/d/aCLKMub': 'B0CDLDCW38',
                'a.co/d/ckJk1jl': 'B0BQQP5WK5',
                'a.co/d/dai68Xv': 'B07Y1V52BT',
                'a.co/d/eOVs4JR': 'B0CHJTD1FS',
                'a.co/d/eV1EXP7': 'B0BWS22GVD',
                'a.co/d/glkRka6': 'B07BQFS9W8',
                'a.co/d/iPLpSs3': 'B0CH8DCDCC',
                'a.co/d/iqx6cmg': 'B0BBDX4W8T'}

In [6]:
def find_amazon_link_in_report(x):
    for item in x:
        idx=str(item).find('a.co/d/')
        if idx!=-1:
            return link_lookup[str(item)[idx:idx+14]]
        idx=str(item).find('/dp/')
        if idx!=-1:
                return str(item)[idx+4:idx+14]
    return None

In [7]:
reports['asin_in_report']=reports.apply(find_amazon_link_in_report,axis=1)


In [8]:
reports.to_csv("reports.csv")