# Processor

This python notebook consists of a set of functions that take a JSON input from the `scraper` tool, and get the data into a format that we can serve on the final website. 

## JSON to CSV

As our input data is currently formatted in JSON, we want to convert that to CSV so we can work with entire columns, and eventually push to a database like PostgreSQL.

To avoid corrupting the original data that was collected, place a copy of the data in the convenience folder `./_raw_json` in this directory, and only modify this copied data as you work with the modules in this notebook.

The following block of code will go through all the JSON files in the specified directory, flatten them, convert them into CSV's, and save them all to disk in the `./_raw_csv` directory.

**NOTE**: This is the only part of the process that's somewhat hardcoded. In order to combine the JSON into a single CSV, the column names have to be consistently named, and its easier to easier to acheive this by edit the JSON's before saving as CSV's.

In [25]:
import json
import os
import pandas as pd

from flatten_json import flatten
from tqdm import tqdm

JSON_DIR = './_raw_json'
CSV_DIR = './_raw_csv'

#### Helper Functions

In [32]:
def delete_keys_from_dict(d, to_delete):
    if isinstance(to_delete, str):
        to_delete = [to_delete]
    if isinstance(d, dict):
        for single_to_delete in set(to_delete):
            if single_to_delete in d:
                del d[single_to_delete]
        for k, v in d.items():
            delete_keys_from_dict(v, to_delete)
    elif isinstance(d, list):
        for i in d:
            delete_keys_from_dict(i, to_delete)

#### Run

In [36]:
for filename in tqdm(os.listdir(JSON_DIR)):
    if ".json" in filename:
        with open(f'{JSON_DIR}/{filename}', 'r', encoding='utf-8') as f:
            data = json.load(f)
            
            # 1. First globally delete all the keys that we don't want.
            delete_keys_from_dict(data, ['__typename', 'best_datasheet', 'best_image', 'manufacturer_url'])
            
            for result in data['data']['search']['results']:
                # 2. Within the JSON, flatten the `specs` array from 
                #    'specs': [{'attribute': {'id': '548',
                #                             'name': 'Capacitance' 
                #                             'shortname': 'capacitance'
                #                             '__typename': 'Attribute'
                #                            },
                #               'display_value': '100 nF'
                #              },
                #              { ... },
                #              { ... },
                #              ...
                #             ]
                #    to
                #    'specs': {'capacitance': {'display_value': '100 nF', 'id': '548'},
                #              'case_package': {'display_value': 'Radial', 'id': '842'},
                #              'depth': {'display_value': '8 mm', 'id': '291'},
                #              ...
                #             }    
                #    and remove some fields that we don't want to include.
                spec_json = {}
                for spec in result['part']['specs']:
                    title = spec['attribute']['shortname']
                    spec['attribute']['display_value'] = spec['display_value']
                    spec = spec['attribute']
                    del spec['shortname']
                    del spec['name']
                    spec_json[title] = spec
                result['part']['specs'] = spec_json


                # 3. Remove specific parts of the JSON that we don't want (duplicate fields, etc).
                del result['part']['_cache_id']
                del result['part']['descriptions']
                del result['part']['counts']
            
            # 4. Run the `flatten` function on each of the parts, place it in a list, and convert 
            #    to a Pandas DF.
            flat = [flatten(d) for d in data['data']['search']['results']]
    
            df = pd.DataFrame(flat, dtype ='str')
            df.to_csv(f"{CSV_DIR}/{filename.split('.')[0]}.csv", index=False)


100%|███████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:12<00:00, 18.07s/it]


#### Combine

In [45]:
filenames = [f"{CSV_DIR}/{filename}" for filename in os.listdir(CSV_DIR) if "all" not in filename and ".csv" in filename]
df = pd.concat(map(pd.read_csv, filenames), ignore_index=True)
df = df.astype(str)
df.to_csv(f"{CSV_DIR}/all.csv", index=False)

['./_raw_csv/film.csv', './_raw_csv/mica.csv', './_raw_csv/ceramic.csv', './_raw_csv/aluminum_electrolytic.csv']
