# Documentation for `3_generate_df_asin.ipynb`

## Overview
This notebook generates a DataFrame of Amazon ASINs, annotating each with whether it appears in incident reports (either directly or via DuckDuckGo search results) and the indices of matching reports. The output is to generate matches.

## Workflow Steps
1. **Load Amazon Metadata:**  
   Reads `amazon_meta.json` and extracts the `asin` column.
2. **Load Incident Reports:**  
   Loads `asin_with_search_results.csv`, which contains incident reports with ASINs extracted from text and DuckDuckGo search results.
3. **Parse Search Results:**  
   Converts the `search_result` column from string to a list of dictionaries.
4. **Build ASIN-to-Report Mapping:**  
   Iterates through incident reports to build a dictionary mapping each ASIN (from both direct extraction and search results) to the indices of reports where it appears.
5. **Annotate ASINs:**  
   Adds two columns to the ASIN DataFrame:
   - `indices`: List of report indices where the ASIN appears.
   - `match`: 1 if the ASIN appears in any report, 0 otherwise.
6. **Save Results:**  
   Writes the annotated ASIN DataFrame to `df_asin.csv`.

## Input Files
- `../Data/amazon_meta.json` (Amazon product metadata)
- `asin_with_search_results.csv` (incident reports with ASINs from text and search)

## Output Files
- `df_asin.csv` (ASINs annotated with match status and report indices)

## Usage
Run each cell in order to generate the annotated ASIN list for downstream analysis.

---

In [13]:
TESTING=False
AMAZON_TEST_SAMPLE=200_000

In [14]:
import pandas as pd


In [None]:
# Load amazon metadata asins
amazon_meta_df=pd.read_json("../Data/amazon_meta.json",lines=True)[['asin']]

In [17]:
# Load incident reports with embedded asins (generated via asin_in_text.ipynb) and duckduckgo.com results (from duckduckgo_asins.ipynb)
report_df=pd.read_csv("asin_with_search_results.csv",index_col=0)

In [18]:
# Convert report_df.search_result from str to lisr of dicts
from ast import literal_eval
report_df.fillna({'search_result':"[]"},inplace=True)
report_df.search_result=report_df.search_result.apply(lambda x : literal_eval(x))

In [19]:
# Add "match" and "indices" columns to amazon_meta_df

def build_asin_dict():
    asins=dict()
    for index,report in report_df.iterrows():
        if report.notna().asin_in_report:
            asin=report.asin_in_report.lower()
            asins[asin]=asins.get(asin,list())
            asins[asin].append(index)
        if report.notna().search_result:
            for result in report.search_result:
                asin=result['asin'].lower()
                asins[asin]=asins.get(asin,list())
                asins[asin].append(index)

    return asins
asins=build_asin_dict()
amazon_meta_df['indices']=amazon_meta_df.asin.apply(lambda x : asins.get(x.lower(),[]))
amazon_meta_df['match']=amazon_meta_df.indices.apply(lambda x: int(x!=[]))

In [None]:
amazon_meta_df.to_csv("../Data/df_asin.csv",columns=['match','indices'])