# Processor

This python notebook consists of a set of functions that take a JSON input from the `scraper` tool, and get the data into a format that we can serve on the final website. 

## JSON to CSV

As our input data is currently formatted in JSON, we want to convert that to CSV so we can work with entire columns, and eventually push to a database like PostgreSQL.

To avoid corrupting the original data that was collected, place a copy of the data in the convenience folder `./_raw_json/` in this directory, and only modify this copied data as you work with the modules in this notebook.

For the script to work properly, the `./_raw_json/` directory _must_ be partitioned into subdirectories of `year`, and json's must be nested inside each of the corresponding years. 

A correct directory structure would look something like this:
```
_raw_json
    2022
        aluminum_electrolytic.json
        ceramic.json
        film.json
        mica.json
        polymer.json
        tantalum.json
    2023
        aluminum_electrolytic.json
        capacitor_arrays.json
        capacitor_kits.json
        ceramic.json
        film.json
        mica.json
        polymer.json
        tantalum.json
```

The following block of code will go through all the JSON files in the specified directory, flatten them, convert them into CSV's, and save them all to disk in the `./_raw_csv/` directory.

**NOTE**: This is the only part of the process that's somewhat hardcoded. In order to combine the JSON into a single CSV, the column names have to be consistently named, and its easier to acheive this by editing the JSON's before saving as CSV's.

In [1]:
%load_ext autoreload
%autoreload 2

%aimport compute
%aimport categories

import json
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from flatten_json import flatten
from tqdm import tqdm

JSON_DIR = './_raw_json'
CSV_DIR = './_raw_csv'
CSV_COMBINED = 'combined.csv'
CSV_FINAL = 'final.csv'

#### Helper Functions

In [2]:
def delete_keys_from_dict(d, to_delete):
    if isinstance(to_delete, str):
        to_delete = [to_delete]
    if isinstance(d, dict):
        for single_to_delete in set(to_delete):
            if single_to_delete in d:
                del d[single_to_delete]
        for k, v in d.items():
            delete_keys_from_dict(v, to_delete)
    elif isinstance(d, list):
        for i in d:
            delete_keys_from_dict(i, to_delete)

#### Run

In [3]:
for subdir, dirs, files in os.walk(JSON_DIR):
    for filename in tqdm(files):
        if ".json" in filename:
            year = subdir.split("/")[-1]
            with open(os.path.join(subdir, filename), 'r', encoding='utf-8') as f:
                data = json.load(f)

                # 1. First globally delete all the keys that we don't want.
                delete_keys_from_dict(data, ['__typename', 'best_datasheet', 'best_image', 'manufacturer_url'])

                for result in data['data']['search']['results']:
                    # 2. Within the JSON, flatten the `specs` array from 
                    #    'specs': [{'attribute': {'id': '548',
                    #                             'name': 'Capacitance' 
                    #                             'shortname': 'capacitance'
                    #                             '__typename': 'Attribute'
                    #                            },
                    #               'display_value': '100 nF'
                    #              },
                    #              { ... },
                    #              { ... },
                    #              ...
                    #             ]
                    #    to
                    #    'specs': {'capacitance': {'display_value': '100 nF', 'id': '548'},
                    #              'case_package': {'display_value': 'Radial', 'id': '842'},
                    #              'depth': {'display_value': '8 mm', 'id': '291'},
                    #              ...
                    #             }    
                    #    and remove some fields that we don't want to include.
                    spec_json = {}
                    for spec in result['part']['specs']:
                        title = spec['attribute']['shortname']
                        spec['attribute']['display_value'] = spec['display_value']
                        spec = spec['attribute']
                        del spec['shortname']
                        del spec['name']
                        spec_json[title] = spec
                    result['part']['specs'] = spec_json

                    # 3. Remove specific parts of the JSON that we don't want (duplicate fields, etc).
                    del result['part']['_cache_id']
                    del result['part']['descriptions']
                    del result['part']['counts']
                    
                    # 4. Add column for the year of the component based on the subdirectory name.
                    result[year] = True

                # 5. Run the `flatten` function on each of the parts, place it in a list, and convert 
                #    to a Pandas DF.
                flat = [flatten(d) for d in data['data']['search']['results']]

                df = pd.DataFrame(flat, dtype ='str')
                if not os.path.exists(f"{CSV_DIR}/{year}"):
                    os.makedirs(f"{CSV_DIR}/{year}")
                df.to_csv(f"{CSV_DIR}/{year}/{filename.split('.')[0]}.csv", index=False)


100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26886.56it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:59<00:00,  8.45s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████| 10/10 [03:14<00:00, 19.47s/it]


#### Combine

The combine block aims to:
1. Create a list of years based on the file structure of the `./_raw_csv/` directory.
2. Loop through each file, and add a `year` column to each of the CSV's. For each column that changes year over year, we create a new column, appending the year at the end. E.g. `original_name_year`.
3. Merge CSV's based on 'part_mpn', resolving conflicts using the most recent year.
4. Drops the intermediate columns including the `year` column.

In [4]:
years = [d for d in os.listdir(CSV_DIR) if os.path.isdir(os.path.join(CSV_DIR, d))]

dataframes = []
for year in years:
    year_dir = os.path.join(CSV_DIR, year)
    for filename in tqdm(os.listdir(year_dir)):
        if "all" not in filename and filename.endswith(".csv"):
            df = pd.read_csv(os.path.join(year_dir, filename))
            df = compute.add_year(df, year)
            df = compute.rename_by_year(df, categories.columns_that_update_yearly_preprocess.keys(), year)
            dataframes.append(df)

combined_df = pd.concat(dataframes, ignore_index=True)
combined_df = compute.merge(combined_df, years)
combined_df = compute.drop_columns(combined_df, ['year'])
combined_df.to_csv(os.path.join(CSV_DIR, CSV_COMBINED), index=False)

  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
100%|███████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:04<00:00,  1.24it/s]
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
  df = pd.read_csv(os.path.join(year_dir, filename))
100%|███████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:13<00:00,  1.51s/it]


## CSV Postprocessing

In this step, we want to take the combined CSV that we generated in the previous step and format it into the final format that we will upload to the PostgreSQL database.

We will use a modular approach. For each step of updating the CSV, we will implement a function that takes in a pandas dataframe and outputs another pandas dataframe in the desired format. 

In [5]:
df = pd.read_csv(f'{CSV_DIR}/{CSV_COMBINED}', index_col=False)

  df = pd.read_csv(f'{CSV_DIR}/{CSV_COMBINED}', index_col=False)


In [6]:
years = [d for d in os.listdir(CSV_DIR) if os.path.isdir(os.path.join(CSV_DIR, d))]

year_specific_columns = {}
year_specific_columns_postprocessed = []
for year in years:
    for col, data_type in categories.columns_that_update_yearly_preprocess.items():
        year_specific_col = f"{col}_{year}"
        if data_type not in year_specific_columns:
            year_specific_columns[data_type] = []
        year_specific_columns[data_type].append(year_specific_col)
    for col in categories.columns_that_update_yearly_postprocess:
        year_specific_columns_postprocessed.append(f"{col}_{year}")

In [7]:
string_to_float_cols = [
    'part_specs_tolerance_display_value', 
    'part_specs_temperaturecoefficient_display_value', 
    'part_specs_maxjunctiontemperature_display_value', 
    'part_specs_maxoperatingtemperature_display_value', 
    'part_specs_minoperatingtemperature_display_value', 
    'part_specs_dissipationfactor_display_value', 
    'part_specs_failurerate_display_value', 
    'part_specs_frequencytolerance_display_value', 
    'part_specs_qfactor_display_value', 
    'part_specs_frequencystability_display_value', 
    'part_specs_accuracy_display_value', 
    'part_specs_speedgrade_display_value', 
    'part_specs_inductancetolerance_display_value', 
    'part_specs_ambienttemperaturerangehigh_display_value'
]

string_to_int_cols = [
    'part_specs_numberofpins_display_value', 
    'part_specs_life_hours__display_value', 
    'part_specs_life_cycles__display_value', 
    'part_specs_numberofcapacitors_display_value', 
]

string_to_base_float_cols = [
    'part_specs_capacitance_display_value', 
    'part_specs_depth_display_value', 
    'part_specs_height_display_value', 
    'part_specs_height_seated_max__display_value', 
    'part_specs_length_display_value', 
    'part_specs_voltage_display_value', 
    'part_specs_voltagerating_display_value', 
    'part_specs_voltagerating_ac__display_value', 
    'part_specs_voltagerating_dc__display_value', 
    'part_specs_width_display_value', 
    'part_specs_weight_display_value', 
    'part_specs_insulationresistance_display_value', 
    'part_specs_diameter_display_value', 
    'part_specs_thickness_display_value', 
    'part_specs_esr_equivalentseriesresistance__display_value', 
    'part_specs_resistance_display_value', 
    'part_specs_dcresistance_dcr__display_value', 
    'part_specs_inductance_display_value', 
    'part_specs_maxdccurrent_display_value', 
    'part_specs_powerrating_display_value', 
    'part_specs_seriesresistance_display_value', 
    'part_specs_currentrating_display_value', 
    'part_specs_characterheight_display_value', 
    'part_specs_ripplecurrent_display_value',
    'part_specs_maxlength_display_value', 
    'part_specs_maxthickness_display_value', 
    'part_specs_maxwidth_display_value', 
    'part_specs_minlength_display_value', 
    'part_specs_minthickness_display_value', 
    'part_specs_minwidth_display_value', 
    'part_specs_insidediameter_display_value', 
    'part_specs_selfresonantfrequency_display_value', 
    'part_specs_current_display_value', 
    'part_specs_maxcurrentrating_display_value', 
    'part_specs_maxvoltagerating_dc__display_value', 
    'part_specs_maxfrequency_display_value', 
    'part_specs_leakagecurrent_display_value', 
    'part_specs_testfrequency_display_value', 
    'part_specs_ripplecurrent_ac__display_value', 
    'part_specs_impedance_display_value', 
    'part_specs_outsidediameter_display_value', 
    'part_specs_workingvoltage_display_value', 
    'part_specs_frequency_display_value'
]

string_to_float_cols.extend(year_specific_columns.get('float', []))
string_to_int_cols.extend(year_specific_columns.get('int', []) + years)

In [8]:
df = compute.spec_string_to_float(df, cols=string_to_float_cols)
df = compute.spec_string_to_int(df, cols=string_to_int_cols)
df = compute.spec_string_to_base_float(df, cols=string_to_base_float_cols)

df = compute.classify_ceramic(df)
df = compute.classify_dielectric(df)

df = compute.process_category(df)
df = compute.process_manufacturer(df)
df = compute.process_mpn(df)
df = compute.process_voltage(df)
df = compute.process_current(df)
df = compute.process_capacitance(df)
df = compute.process_esr(df)
df = compute.process_esr_frequency(df)
df = compute.process_price(df, years)

df = compute.compute_volume(df)
df = compute.compute_mass(df)
df = compute.compute_energy(df)
df = compute.compute_rated_power(df)
df = compute.compute_gravimetric_energy_density(df)
df = compute.compute_volumetric_energy_density(df)
df = compute.compute_gravimetric_power_density(df)
df = compute.compute_volumetric_power_density(df)
df = compute.compute_energy_per_cost(df, years)

  df[cols] = df[cols].applymap(_convert_to_float, na_action="ignore")
  df[cols] = df[cols].applymap(_convert_to_int, na_action="ignore")
  df[cols] = df[cols].applymap(_convert_to_base_units, na_action="ignore")
  df[column_map["ceramic_class"]] = df.loc[df["part_category_id"] == category_id][
  df[column_map["dielectric"]] = df.apply(_get_dielectric, axis=1)
  df[column_map["category"]] = df["part_category_id"]
  df[column_map["manufacturer"]] = df["part_manufacturer_name"]
  df[column_map["mpn"]] = df["part_mpn"]
  df[column_map["voltage"]] = (
  df[column_map["current"]] = df["part_specs_ripplecurrent_display_value"].fillna(
  df[column_map["capacitance"]] = df.apply(_get_capacitance, axis=1)
  df[column_map["esr"]] = (
  df[column_map["esr_frequency"]] = df["part_specs_testfrequency_display_value"]
  df[column_map["esr_frequency_low"]] = df["esr_frequency"].apply(
  df[column_map["esr_frequency_high"]] = df["esr_frequency"].apply(
  df[f"{column_map['price']}_{year}"] = df[
  df[f

In [9]:
cols_to_drop = [item for item in list(categories.column_map.values()) if item not in categories.columns_that_update_yearly_postprocess]
cols_to_drop = cols_to_drop + years + year_specific_columns_postprocessed
df = compute.drop_all_except(df, cols=cols_to_drop)

In [10]:
df.to_csv(f'{CSV_DIR}/{CSV_FINAL}', index=False)

## Deploy

In this step, we want to take the final CSV that we generated, and actually upload it to our PostgreSQL database. We will use a package called (https://pypi.org/project/postgres-csv-uploader/) that was developed for this project in order to both infer the types for each of the columns in the CSV, create the necessary table schema, and then handle the upload process for the actual contents to the database.

In [11]:
!/opt/homebrew/Cellar/jupyterlab/4.0.9_2/libexec/bin/python -m pip install --upgrade pip
!/opt/homebrew/Cellar/jupyterlab/4.0.9_2/libexec/bin/python -m pip install postgres_csv_uploader
!/opt/homebrew/Cellar/jupyterlab/4.0.9_2/libexec/bin/python -m pip install psycopg2

[0m[33mDEPRECATION: Loading egg at /opt/homebrew/Cellar/gpgme/1.23.2/lib/python3.12/site-packages/gpg-1.23.2-py3.12-macosx-13-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /opt/homebrew/Cellar/gpgme/1.23.2/lib/python3.12/site-packages/gpg-1.23.2-py3.12-macosx-13-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[0m[33mDEPRECATION: Loading egg at /opt/homebrew/Cellar/gpgme/1.23.2/lib/python3.12/site-packages/gpg-1.23.2-py3.12-macosx-13-arm64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issue

In [12]:
from postgres_csv_uploader.uploader import PostgresCSVUploader
import psycopg2 as ps

In [None]:
host = "ec2-34-233-115-14.compute-1.amazonaws.com"
port = 5432
database = "dfu56m15dkhh46"
user = "pgyrjmstmyerfk"
password = "228fcbba14e9d2bf362fcaa29cabe1106cc8dba00605f45ee25e810194309fd4"

conn = ps.connect(
    host=host,
    user=user,
    password=password,
    port=port,
    database=database
)

uploader = PostgresCSVUploader(conn)
uploader.upload(
    f'{CSV_DIR}/{CSV_FINAL}',
    CSV_FINAL.split('.')[0]
)

## Sanity Checks

In [None]:
DATA_JSON_2022 = "./_raw_json/2022"
DATA_JSON_2023 = "./_raw_json/2023"

DATA_CSV_2022 = "./_raw_csv/2022"
DATA_CSV_2023 = "./_raw_csv/2023"

DATA_CSV_COMBINED = "./raw_csv/combined.csv"
DATA_CSV_FINAL = "./raw_csv/final.csv"

In [None]:
def print_unique_json(path: str):
    json_files = [f for f in os.listdir(path) if f.endswith('.json')]

    for json_file in json_files:
        with open(os.path.join(path, json_file), 'r') as file:
            data = json.load(file)
            results = data["data"]["search"]["results"]
            all_results = len(results)
            unique_results = len(set([result["part"]["mpn"] for result in results]))
            print(f"{json_file} unique elements: {unique_results}")

def print_unique_csv(path: str):
    csv_files = [f for f in os.listdir(path) if f.endswith('.csv') and "all" not in f and "combined" not in f]

    for csv_file in csv_files:
        df = pd.read_csv(os.path.join(path, csv_file))
        print(f"{csv_file} unique elements: {len(df['part_mpn'].unique())}")

### Before and After Converting to CSV

#### JSON [2022]

In [None]:
print_unique_json(DATA_JSON_2022)

#### JSON [2023]

In [None]:
print_unique_json(DATA_JSON_2023)

#### CSV [2022]

In [None]:
print_unique_csv(DATA_CSV_2022)

#### CSV [2023]

In [None]:
print_unique_csv(DATA_CSV_2023)

### Before and After Merging Categories & Years

In [None]:
years = [d for d in os.listdir(CSV_DIR) if os.path.isdir(os.path.join(CSV_DIR, d))]

unique_mpn_raw = set()
for year in years:nmmn 
    year_dir = os.path.join(CSV_DIR, year)
    for filename in os.listdir(year_dir):
        if "all" not in filename and "combined" not in filename and filename.endswith(".csv"):
            df = pd.read_csv(os.path.join(year_dir, filename))
            unique_mpn_raw.update(df['part_mpn'].unique())

print(f"Total unique components across {years}: {len(unique_mpn_raw)}")

# Assuming combined_df is the dataframe after custom merging
combined_df = pd.read_csv(DATA_CSV_COMBINED)
unique_mpn_combined = combined_df['part_mpn'].nunique()
print(f"Total unique components in combined CSV: {unique_mpn_combined}")

## Visualizations

In [None]:
data_viz = pd.read_csv("final.csv", index_col=False)

### Capacitance vs. Rated DC Voltage

In [None]:
x = "voltage"
y = "capacitance"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Capacitance vs. Rated DC Voltage') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Rated Voltage vs. Volumetric Energy Density

In [None]:
x = "voltage"
y = "volumetric_energy_density"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Rated Voltage vs. Volumetric Energy Density') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Rated Voltage vs. Gravimetric Energy Density

In [None]:
x = "voltage"
y = "gravimetric_energy_density"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Rated Voltage vs. Gravimetric Energy Density') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Rated Voltage vs. Energy per Cost

In [None]:
x = "voltage"
y = "energy_per_cost"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Rated Voltage vs. Energy per Cost') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Volumetric Energy Density vs Power

In [None]:
x = "volumetric_energy_density"
y = "power"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Volumetric Energy Density vs Power') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Volumetric Energy Density vs Volumetric Power Density

In [None]:
x = "volumetric_energy_density"
y = "volumetric_power_density"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Volumetric Energy Density vs Volumetric Power Density') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

### Volumetric Energy Density vs Gravimetric Power Density

In [None]:
x = "volumetric_energy_density"
y = "gravimetric_power_density"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Volumetric Energy Density vs Gravimetric Power Density') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

In [None]:
x = "capacitance"
y = "mass"
one = data_viz.loc[data_viz["part_category_id"] == 6331].plot.scatter(x=x, y=y, label="Aluminum Electrolytic Capacitors", loglog=True, color="Blue", s=5, title='Volumetric Energy Density vs Power') 
data_viz.loc[data_viz["part_category_id"] == 6332].plot.scatter(x=x, y=y, label="Ceramic Capacitors", loglog=True, ax=one, color="Green", s=5)
data_viz.loc[data_viz["part_category_id"] == 6333].plot.scatter(x=x, y=y, label="Film Capacitors", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[data_viz["part_category_id"] == 6334].plot.scatter(x=x, y=y, label="Mica Capacitors", loglog=True, ax=one, color="Orange", s=5)
data_viz.loc[data_viz["part_category_id"] == 6336].plot.scatter(x=x, y=y, label="Tantalum Capacitors", loglog=True, ax=one, color="Purple", s=5)

In [None]:
x = "mass"
y = "capacitance"
one = data_viz.loc[pd.notna(data_viz["esr_frequency_low"])].plot.scatter(x=x, y=y, label="Low Frequency", loglog=True, color="Blue", s=5, title='Volumetric Energy Density vs Power') 
data_viz.loc[pd.notna(data_viz["esr"]) & pd.isna(data_viz["esr_frequency"])].plot.scatter(x=x, y=y, label="Other", loglog=True, ax=one, color="Red", s=5)
data_viz.loc[pd.notna(data_viz["esr_frequency_high"])].plot.scatter(x=x, y=y, label="High Frequency", loglog=True, ax=one, color="Green", s=5)