# Documentation: `generate_df.ipynb`

## Purpose
Extracts ASINs from `amazon_meta.json` and prepares CSVs for fuzzy matching and search result analysis.

## Main Steps
- Reads raw Amazon metadata (`amazon_meta.json`)
- Extracts the `asin` column and writes it to `only_asins.csv`
- Reads and processes additional CSVs: `df_asin.csv`, `asin_with_search_results.csv`, `link_lookup.csv`
- Prepares data for downstream fuzzy matching and search result analysis

## Input Files
- `../Data/amazon_meta.json`
- `only_asins.csv`
- `df_asin.csv`
- `asin_with_search_results.csv`
- `link_lookup.csv`

## Output Files
- `only_asins.csv`

## Usage
Run each cell in order to generate and process the ASIN lists.

---

In [13]:
TESTING=False
AMAZON_TEST_SAMPLE=200_000

In [14]:
import pandas as pd


In [15]:
# Make only_asins.csv (only asins from amazon metadata) if it doesn't exist
import os
if not os.path.exists("only_asins.csv"):
    amazon_meta_df=pd.read_json("../Data/amazon_meta.json",lines=True)[['asin']]
    amazon_meta_df.to_csv("only_asins.csv",columns=['asin'])
    

In [16]:
# Load amazon metadata asins
amazon_meta_df=pd.read_csv("only_asins.csv",index_col=0)

In [17]:
# Load incident reports with embedded asins (generated via asin_in_text.ipynb) and duckduckgo.com results (from duckduckgo_asins.ipynb)
report_df=pd.read_csv("asin_with_search_results.csv",index_col=0)

In [18]:
# Convert report_df.search_result from str to lisr of dicts
from ast import literal_eval
report_df.fillna({'search_result':"[]"},inplace=True)
report_df.search_result=report_df.search_result.apply(lambda x : literal_eval(x))

In [19]:
# Add "match" and "indices" columns to amazon_meta_df

def build_asin_dict():
    asins=dict()
    for index,report in report_df.iterrows():
        if report.notna().asin_in_report:
            asin=report.asin_in_report.lower()
            asins[asin]=asins.get(asin,list())
            asins[asin].append(index)
        if report.notna().search_result:
            for result in report.search_result:
                asin=result['asin'].lower()
                asins[asin]=asins.get(asin,list())
                asins[asin].append(index)

    return asins
asins=build_asin_dict()
amazon_meta_df['indices']=amazon_meta_df.asin.apply(lambda x : asins.get(x.lower(),[]))
amazon_meta_df['match']=amazon_meta_df.indices.apply(lambda x: int(x!=[]))

In [20]:
amazon_meta_df.to_csv("df_asin.csv",columns=['match','indices'])