# BoardBot Extracted Data Cleaning and Standardization



## Introduction



**Project Overview**

BoardBot is an AI-powered system designed to assist users with queries about computer hardware products, particularly focusing on embedded systems, development kits, and industrial communication devices. The data we're working with comes from a feature extraction process applied to various product descriptions and specifications.



**Objective**

Our goal is to refine and standardize the extracted product data to optimize it for efficient searching, filtering, and comparison operations. The cleaned data will be used in a vector database for semantic search capabilities and to provide accurate, consistent responses to user queries about hardware specifications.

**Notebook Objectives**

1. **Modular Structure**: Create utility functions for common operations and implement a consistent structure for cleaning functions across all columns.

2. **Comprehensive Documentation**: Provide clear explanations for the cleaning process of each column, document the rationale behind standardization choices, and include examples of how the cleaned data will be used in search and filter operations.

3. **Standardized Data Formats**: Implement a consistent approach for storing range values across all relevant columns and standardize units where applicable.

4. **Optimal Cleaning Strategies**: Ensure no relevant data is lost during the cleaning process, implement column-specific cleaning strategies that preserve the nuances of each data type, and handle special cases and outliers appropriately.

5. **Multi-value Field Handling**: Develop a consistent approach for columns that may contain multiple values and ensure the cleaned format is optimized for both searching and filtering operations.



**Deliverable**

A well-structured, thoroughly commented Jupyter notebook implementing the improved data cleaning pipeline.



---


## 1. Setup and Data Loading

First, we import the necessary libraries and load the extracted data.

In [1]:
# Import required libraries
import os
import re
import ast
import logging
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Set pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.max_rows', 1000)

# Set up plotting styles
sns.set(style="whitegrid", font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 8)
%matplotlib inline

# Configure logging
logging.basicConfig(level=logging.ERROR, format='%(asctime)s - %(levelname)s - %(message)s')

In [2]:
# Load the data
data_file = "../data/processed_feature_extraction_results.csv"
df = pd.read_csv(data_file)

# Display the first few rows
df.head()

Unnamed: 0,product_id,name,manufacturer,form_factor,evaluation_or_commercialization,processor_architecture,processor_core_count,processor_manufacturer,processor_tdp,memory,onboard_storage,input_voltage,io_count,wireless,operating_system_bsp,operating_temperature_max,operating_temperature_min,certifications,short_summary,full_summary,full_product_description,target_applications,duplicate_ids
0,iW-G27S-SCQM-4L004G-E016G-BIC,IMX QUAD MAX QUAD PLUS PICO ITX SBC,IWAVE SYSTEMS,PICO ITX,False,ARM,6,NXP,,Up to 8GB LPDDR4,Up to 64GB eMMC,12V,"['USB 3.0', 'USB 2.0', 'HDMI', 'Ethernet', 'PCIe', 'CAN', 'UART', 'SPI', 'I2C', 'MIPI CSI', 'MIPI DSI', 'LVDS']","['Wi-Fi', 'Bluetooth']","['Linux', 'Android Pie', 'QNX']",85°C,-40°C,"['RoHS', 'REACH']","The iMX Quad Max Quad Plus Pico ITX SBC is a high-performance, compact single board computer designed for industrial and embedded applications.","The iMX Quad Max Quad Plus Pico ITX SBC integrates dual Cortex-A72 and Cortex-A53 cores, dual GPUs, and a VPU for enhanced multimedia capabilities. It supports up to 8GB LPDDR4 memory and 64GB eMMC storage, with extensive I/O options including USB, HDMI, Ethernet, and wireless connectivity.","The iMX Quad Max Quad Plus Pico ITX SBC is engineered for high-performance applications in industrial, automotive, and medical domains. It features a robust set of interfaces including dual Ethernet, multiple USB ports, PCIe, and advanced display outputs like HDMI and MIPI DSI. The board supports Linux, Android Pie, and QNX operating systems, making it versatile for various development needs.","['Remote Energy Management', 'Intelligent Edge', 'Augmented and Virtual Reality', '4K Media Streaming', 'Industrial Automation', 'Automotive eCockpit']","['iW-G27S-SCQM-4L004G-E016G-BIC', 'iW-G27S-SCQM-4L008G-E032G-BIC']"
1,conga-TC170slash3955U,CONGATC,CONGATEC,COM EXPRESS COMPACT,True,X86,2,INTEL,15W,Up to 32 GByte dual channel DDR4,Optional eMMC 5.1 on board mass storage,21 VDC,"['x PCI Express GEN 3 lanes', 'x Serial ATA Gen 3', 'x USB 3.0', 'x USB 2.0', 'LPC bus', 'I2C bus', 'x UART', 'Digital High Definition Audio Interface']",Available,"['Microsoft Windows', 'Microsoft Windows IoT Enterprise', 'Linux', 'Microsoft Windows Embedded Standard']",60°C,0°C,"FCC Class A, UL 94V0","Compact module with Intel Core processors, dual channel DDR4 memory, and extensive I/O options.","The congaTC is a COM Express Compact module featuring Intel Core i7/i5/i3 and Celeron processors, supporting dual channel DDR4 memory, and offering a wide range of I/O interfaces including PCIe, SATA, USB, and more.","The congaTC module is designed for high-performance applications, equipped with Intel's 6th generation Core processors and Celeron options. It supports up to 32 GByte of dual channel DDR4 memory, and provides a variety of interfaces such as PCIe, SATA, USB, and audio. The module is suitable for industrial and embedded applications, offering features like Intel Turbo Boost, Hyper-Threading, and virtualization technologies.","['Industrial automation', 'Embedded systems', 'Digital signage', 'Medical devices', 'Gaming']",['conga-TC170slash3955U']
2,34012-0416-N1-2,COMEMEL E,KONTRON,COM EXPRESS MINI,True,X86,Up to 14 cores,INTEL,31W,UP TO 8 GBYTE LPDDR,UP TO 64 GBYTE EMMC,5V,"['Up to 2x USB 3.0, 8x USB 2.0, up to 4x serial interfaces or CAN bus']",Supports wireless technologies with miniature RF connectors for WLAN and Bluetooth,"['Windows', 'Linux', 'VxWorks']",85°C,-40°C,EN50155:2021 for Edge AI Platforms,"COM Express mini Type 10 module with Intel Atom, Pentium, and Celeron processors.","The COMemEL E is a COM Express mini Type 10 module featuring Intel Atom, Pentium, and Celeron processors, designed for low power and high performance in industrial applications.","The COMemEL E module is a compact, low-power solution optimized for industrial applications. It supports up to 8 GByte LPDDR memory with In-Band ECC, multiple USB ports, SATA, eMMC Flash, and optional GbE with TSN support. It is designed to operate in industrial temperature ranges and supports various operating systems including Windows, Linux, and VxWorks.","['Industrial', 'Embedded Systems']","['34012-0416-N1-2', '34012-0432-J2-4', '34013-0416-R1-2', '34013-0432-R1-4', '34013-0832-R2-4']"
3,conga-TCA7slashi-x6212RE,CONGATCA,CONGATEC,COM EXPRESS COMPACT,False,X86,4,INTEL,6W,"DDR4 SODIMM, 3200 MT/s, up to 32GB","eMMC, 64GB",,"['PCIe, USB, SATA, UART, CAN, GPIO']",,"['Microsoft Windows IoT Enterprise', 'Linux Yocto', 'RTS RealTime Hypervisor']",85°C,-40°C,,"COM Express Compact module with Intel Atom, Pentium, and Celeron processors, supporting industrial temperature ranges.","The congaTCA is a COM Express Compact module featuring Intel Atom xE, Pentium, and Celeron processors, designed for industrial applications with support for DDR4 SODIMM memory, eMMC storage, and a wide range of I/O interfaces.","The congaTCA module by congatec is built on the COM Express Compact form factor, integrating Intel Atom xE, Pentium, and Celeron processors. It supports dual-channel DDR4 SODIMM memory up to 32GB, onboard eMMC storage, and various I/O interfaces including PCIe, USB, SATA, UART, CAN, and GPIO. The module is suitable for industrial applications with an operating temperature range from -40°C to 85°C. It supports multiple operating systems including Microsoft Windows IoT Enterprise and Linux Yocto.","['Industrial', 'Embedded Systems', 'IoT']","['conga-TCA7slashi-x6212RE', 'conga-TCA7slashi-x6414RE', 'conga-TCA7slashi-x6425RE', 'conga-TCA7slashJ6413', 'conga-TCA7slashJ6426', 'conga-TCA7slashx6211E', 'conga-TCA7slashx6413E', 'conga-TCA7slashx6425E']"
4,MIC-7700H-00A1,MIC,ADVANTECH,Compact Fanless System,False,x86,24C,INTEL,35W,"Dual-channel DDR3 1600 MHz, up to 16GB","HDD, CFast, mSATA",9-36VDC,"['4x USB 3.0, 4x USB 2.0, 6x COM, 2x LAN']","WiFi Module 802.11 a/b/g/n/ac 2T2R w/BT5.1, Intel AC9260",Embedded OS,60°C,-20°C,"['CE', 'FCC Class A', 'CCC', 'BSMI', 'UL', 'RoHS']","Compact fanless system with Intel Core CPU, multiple I/O ports, and wide input voltage range.","The MIC series is a compact fanless system featuring Intel's 4th and 5th Gen Core CPUs, supporting a wide range of I/O interfaces and expansion modules, suitable for industrial applications.","The MIC series by Advantech is designed for industrial applications, offering a compact fanless design with Intel Core processors. It supports a variety of I/O interfaces including USB, COM, and LAN ports, and offers flexible storage options with HDD, CFast, and mSATA. The system is certified with CE, FCC, CCC, BSMI, UL, and RoHS standards, ensuring compliance and reliability in various environments.","['Industrial Automation', 'Embedded Systems', 'IoT Gateways']","['MIC-7700H-00A1', 'MIC-7700Q-00A1']"


---


## 2. Utility Functions

To create a modular and reusable data processing pipeline, we define utility functions for common operations such as:

- Computing before-and-after statistics.
- Standardizing values.
- Handling range values.
- Parsing multi-value fields.

### 2.1 Before-and-After Statistics


In [3]:
def print_before_after_stats(df, column_name, clean_func):
    """
    Print statistics before and after cleaning a column.

    Parameters:
    - df: pandas DataFrame
    - column_name: str, name of the column to clean
    - clean_func: function, the cleaning function to apply
    """
    print(f"\n=== Cleaning '{column_name}' Column ===")
    original_unique = df[column_name].nunique()
    print(f"Original unique values: {original_unique}")

    # Apply the cleaning function
    df[column_name] = df[column_name].apply(clean_func)

    cleaned_unique = df[column_name].nunique()
    print(f"Cleaned unique values:  {cleaned_unique}")
    reduction = original_unique - cleaned_unique
    print(f"Reduction:              {reduction}")
    reduction_pct = (reduction / original_unique * 100) if original_unique else 0
    print(f"Reduction percentage:   {reduction_pct:.2f}%")

    # Display the top 50 most frequent values after cleaning
    print("\nTop 50 most frequent values after cleaning:")
    value_counts = df[column_name].value_counts().head(50)
    print(value_counts.to_string())

    # Return the cleaned DataFrame
    return df

### 2.2 Standardizing Units


In [4]:

def standardize_units(value, from_unit, to_unit):
    """
    Convert a numeric value from one unit to another.

    Parameters:
    - value: float or int
    - from_unit: str, the current unit
    - to_unit: str, the desired unit

    Returns:
    - Converted value
    """
    conversion_factors = {
        ('KB', 'MB'): 1 / 1024,
        ('MB', 'GB'): 1 / 1024,
        ('GB', 'MB'): 1024,
        ('GB', 'TB'): 1 / 1024,
        ('TB', 'GB'): 1024,
        ('W', 'mW'): 1000,
        ('mW', 'W'): 1 / 1000,
    }

    factor = conversion_factors.get((from_unit.upper(), to_unit.upper()))
    if factor is not None:
        return value * factor
    else:
        # No conversion needed or conversion factor not defined
        return value


### 2.3 Parsing Range Values


In [5]:
def parse_range(value):
    """
    Parse a string representing a range and return a tuple (min_value, max_value).

    Parameters:
    - value: str, the range string

    Returns:
    - tuple of (min_value, max_value) or None
    """
    if not isinstance(value, str):
        return None

    match = re.match(r'(-?\d+(?:\.\d+)?)[^\d]+(-?\d+(?:\.\d+)?)', value)
    if match:
        min_value = float(match.group(1))
        max_value = float(match.group(2))
        return min_value, max_value
    else:
        return None


### 2.4 Handling Multi-value Fields


In [6]:
def parse_multi_values(value):
    """
    Parse a string containing multiple values into a list.

    Parameters:
    - value: str or list

    Returns:
    - list of values
    """
    if isinstance(value, list):
        return value
    elif isinstance(value, str):
        # Try to parse as a list literal
        try:
            value_list = ast.literal_eval(value)
            if isinstance(value_list, list):
                return value_list
        except:
            # Split by common delimiters
            return re.split(r'[,;/]', value)
    else:
        return []


---



## 3. Data Cleaning Functions

For each column, we define a cleaning function following a consistent structure:

- **Input**: Original value.
- **Processing**: Steps to clean and standardize the value.
- **Output**: Cleaned value.

We also document the rationale behind the cleaning choices.


### 3.1 Name

**Rationale**:

- Standardize product names to improve matching and searching.
- Remove unnecessary suffixes and common terms.
- Normalize casing and remove special characters.


In [7]:
df["name"].value_counts().head(50)

name
AIMB                                                       47
SOM                                                        19
ASMB                                                       12
PCM                                                        11
ARK                                                        10
MIO                                                        10
RSB                                                         9
RASPBERRY PI MODEL B                                        7
AIMB-275                                                    7
VENICE GW                                                   7
COM EXPRESS COMPACT MODULE                                  6
COM EXPRESS BASIC MODULE                                    6
TREK                                                        6
EDHMIC                                                      6
EMETXEI                                                     5
PCA                                                         5
CON


**Cleaning Function**:


In [8]:
def clean_name(name):
    if not isinstance(name, str):
        return name

    # Convert to uppercase
    name = name.upper()

    # Remove extra spaces
    name = " ".join(name.split())

    # Remove common suffixes
    suffixes_to_remove = [
        " MODULE", " BOARD", " COMPUTER", " SERIES", " MOTHERBOARD", " CPU",
        " SINGLE", " DEVELOPMENT KIT", " DEVELOPMENT", " PLATFORM", " SYSTEM",
    ]
    for suffix in suffixes_to_remove:
        if name.endswith(suffix):
            name = name[:-len(suffix)]

    # Standardize common terms and abbreviations
    replacements = {
        "COM EXPRESS": "COM-EXPRESS",
        "RASPBERRY PI": "RASPBERRY-PI",
        "COMPUTER ON": "COM",
        "TX COMPUTER ON": "TX COM",
        "ICES COM EXPRESS": "ICES COM-EXPRESS",
        "SYSTEM ON": "SOM",
        "COMPUTER-ON-MODULE": "COM",
        "MINI ITX": "MINI-ITX",
        "PICO ITX": "PICO-ITX",
        "MICRO ITX": "MICRO-ITX",
        "NANO ITX": "NANO-ITX",
        "QSEVEN": "Q7",
        "IMXM": "IMX-M",
        "IMX M": "IMX-M",
        "IMXUL": "IMX-UL",
        "IMX UL": "IMX-UL",
    }
    for old, new in replacements.items():
        name = name.replace(old, new)

    # Remove parentheses and their contents
    name = re.sub(r"\([^)]*\)", "", name)

    # Remove non-alphanumeric characters except hyphens and spaces
    name = re.sub(r"[^A-Z0-9\- ]", "", name)

    # Remove extra spaces
    name = " ".join(name.split())

    return name

**Apply Cleaning Function and Display Statistics**:


In [9]:
df = print_before_after_stats(df, 'name', clean_name)


=== Cleaning 'name' Column ===
Original unique values: 677
Cleaned unique values:  671
Reduction:              6
Reduction percentage:   0.89%

Top 50 most frequent values after cleaning:
name
AIMB                                                47
SOM                                                 20
ASMB                                                12
PCM                                                 12
ARK                                                 11
MIO                                                 10
RSB                                                  9
RASPBERRY-PI MODEL B                                 7
VENICE GW                                            7
AIMB-275                                             7
EDHMIC                                               6
COM-EXPRESS COMPACT                                  6
COM-EXPRESS BASIC                                    6
TREK                                                 6
MIC                                 

**Example changes**:

- Before cleaning: "Raspberry Pi Model B+ (2014)"
- After cleaning: "RASPBERRY-PI MODEL B+"



---



### 3.2 Manufacturer

**Rationale**:

- Standardize manufacturer names to consolidate variations.
- Improve grouping and filtering by manufacturer.


In [10]:
df["manufacturer"].value_counts().head(50)

manufacturer
ADVANTECH                             404
IEI                                    60
CONGATEC                               39
KONTRON                                37
ARBOR TECHNOLOGY                       25
VERSALOGIC CORPORATION                 22
IBASE                                  21
ADLINK TECHNOLOGY INC                  18
MYIR ELECTRONICS LIMITED               15
AXIOMTEK                               14
KARO ELECTRONICS GMBH                  14
VERSALOGIC                             13
IWAVE SYSTEMS TECHNOLOGIES             11
EUROTECH                               11
PHOENIX CONTACT                        11
SOLIDRUN                               10
ADLINK TECHNOLOGY                      10
EDA TECHNOLOGY CO LTD                   9
CONGATEC GMBH                           9
IWAVE SYSTEMS                           9
NEXCOBOT                                9
SOLIDRUN LTD                            9
GATEWORKS CORPORATION                   8
RASPBERRY PI TRADING 


**Cleaning Function**:


In [11]:
def clean_manufacturer(value):
    if not isinstance(value, str):
        return value

    # Convert to uppercase
    value = value.upper().strip()

    # Remove extra spaces
    value = " ".join(value.split())

    # Standardize manufacturer names
    replacements = {
        "ADLINK": "ADLINK TECHNOLOGY",
        "ADLINK TECHNOLOGY INC": "ADLINK TECHNOLOGY",
        "IWAVE SYSTEMS TECHNOLOGIES": "IWAVE SYSTEMS",
        "IWAVE SYSTEMS TECHNOLOGIES PVT LTD": "IWAVE SYSTEMS",
        "IWAVE SYSTEMS TECH": "IWAVE SYSTEMS",
        "SOLIDRUN LTD": "SOLIDRUN",
        "IBASE TECHNOLOGY": "IBASE",
        "RASPBERRY PI TRADING LTD": "RASPBERRY PI",
        "RASPBERRY PI LTD": "RASPBERRY PI",
        "RASPBERRY PI FOUNDATION": "RASPBERRY PI",
        "CONGATEC GMBH": "CONGATEC",
        "CONGATEC AG": "CONGATEC",
        "VERSALOGIC CORPORATION": "VERSALOGIC",
        "DIGI INTERNATIONAL INC": "DIGI INTERNATIONAL",
        "KARO ELECTRONICS GMBH": "KARO ELECTRONICS",
        "FORLINX EMBEDDED TECHNOLOGY CO LTD": "FORLINX EMBEDDED TECHNOLOGY",
        "SHANGHAI EDA TECHNOLOGY CO LTD": "EDA TECHNOLOGY",
        "ESPRESSIF SYSTEMS": "ESPRESSIF",
        "SEEED STUDIO": "SEEEDSTUDIO",
        "ADVANTECH CO LTD": "ADVANTECH",
        "ADVANTECH INNOCORE": "ADVANTECH",
        "INTRINSYC TECHNOLOGIES CORPORATION": "INTRINSYC TECHNOLOGIES",
        "INTRINSYC TECHNOLOGIES CORP": "INTRINSYC TECHNOLOGIES",
        "SECO SPA": "SECO",
        "KUNBUS GMBH": "KUNBUS",
        "GATEWORKS CORPORATION": "GATEWORKS",
        "MYIR ELECTRONICS LIMITED": "MYIR ELECTRONICS",
        "ACURA EMBEDDED SYSTEMS INC": "ACURA EMBEDDED SYSTEMS",
        "WINSYSTEMS INC": "WINSYSTEMS",
        "AUVIDEA GMBH": "AUVIDEA",
        "MOXA INC": "MOXA",
    }
    for old, new in replacements.items():
        if value == old or value.startswith(old):
            return new

    return value



**Apply Cleaning Function and Display Statistics**:


In [12]:
df = print_before_after_stats(df, 'manufacturer', clean_manufacturer)


=== Cleaning 'manufacturer' Column ===
Original unique values: 112
Cleaned unique values:  88
Reduction:              24
Reduction percentage:   21.43%

Top 50 most frequent values after cleaning:
manufacturer
ADVANTECH                      406
IEI                             60
CONGATEC                        50
KONTRON                         37
VERSALOGIC                      35
ADLINK TECHNOLOGY               29
IBASE                           27
ARBOR TECHNOLOGY                25
IWAVE SYSTEMS                   22
RASPBERRY PI                    21
SOLIDRUN                        19
KARO ELECTRONICS                17
MYIR ELECTRONICS                15
AXIOMTEK                        14
GATEWORKS                       12
EUROTECH                        11
PHOENIX CONTACT                 11
EDA TECHNOLOGY CO LTD            9
SECO                             9
NEXCOBOT                         9
DIGI INTERNATIONAL               8
GIGAIPC                          6
FORLINX EMBEDDED TE

**Example changes**:
- Before cleaning: "Adlink"
- After cleaning: "ADLINK TECHNOLOGY"



---


### 3.3 Form Factor

**Rationale**:

- Standardize form factors to enable filtering and comparison.
- Group similar form factors under common terms.


In [13]:
df["form_factor"].value_counts().head(50)

form_factor
MINI-ITX                    56
SBC                         55
ATX                         51
SMARC                       38
COM EXPRESS COMPACT         38
QSEVEN                      35
SINGLE BOARD COMPUTER       32
Single Board Computer       30
COM EXPRESS                 29
COM EXPRESS BASIC           25
MICROATX                    22
RASPBERRY PI                20
COM EXPRESS MINI            18
SODIMM                      17
PICO-ITX                    16
COM EXPRESS BASIC MODULE    13
Mini-ITX                    13
PICOITX                     13
PCPLUS                      12
EBX                         12
EPIC                        11
SYSTEM ON MODULE            11
Fanless Box PC              10
MICRO ATX                   10
QFN                         10
ETX                          9
SOM                          9
Box PC                       9
COM                          7
PICMG                        7
ALL-IN-ONE                   7
DIN RAIL                   


**Cleaning Function**:


In [14]:
def clean_form_factor(value):
    if not isinstance(value, str):
        return value

    value = value.upper().strip()
    value = " ".join(value.split())

    replacements = {
        "MINI-ITX": "MINI-ITX",
        "MINI ITX": "MINI-ITX",
        "MINIITX": "MINI-ITX",
        "THIN MINI-ITX": "THIN MINI-ITX",
        "THIN MINIITX": "THIN MINI-ITX",
        "MICRO-ATX": "MICRO-ATX",
        "MICRO ATX": "MICRO-ATX",
        "MICROATX": "MICRO-ATX",
        "MATX": "MICRO-ATX",
        "PICO-ITX": "PICO-ITX",
        "PICO ITX": "PICO-ITX",
        "PICOITX": "PICO-ITX",
        "NANO-ITX": "NANO-ITX",
        "NANO ITX": "NANO-ITX",
        "NANOITX": "NANO-ITX",
        "COM EXPRESS": "COM EXPRESS",
        "COMEXPRESS": "COM EXPRESS",
        "COM-EXPRESS": "COM EXPRESS",
        "QSEVEN": "QSEVEN",
        "Q7": "QSEVEN",
        "SMARC": "SMARC",
        "SINGLE BOARD COMPUTER": "SBC",
        "SINGLEBOARD COMPUTER": "SBC",
        "SINGLE-BOARD COMPUTER": "SBC",
        "SINGLEBOARD": "SBC",
        "EMBEDDED SBC": "SBC",
        "SYSTEM ON MODULE": "SOM",
        "SYSTEM-ON-MODULE": "SOM",
        "SYSTEM MODULE": "SOM",
        "COMPUTER ON MODULE": "COM",
        "RASPBERRY PI": "RASPBERRY PI",
        "RASPBERRY PI COMPATIBLE": "RASPBERRY PI",
        "FANLESS BOX PC": "BOX PC",
        "FANLESS EMBEDDED BOX PC": "BOX PC",
        "EMBEDDED BOX PC": "BOX PC",
        "EMBEDDED BOX COMPUTER": "BOX PC",
        "BOX COMPUTER": "BOX PC",
        "FANLESS PC": "BOX PC",
        "FANLESS COMPACT SYSTEM": "BOX PC",
        "COMPACT EXPANDABLE FANLESS SYSTEM": "BOX PC",
        "DINRAIL": "DIN RAIL",
        "DIN RAIL MOUNTABLE": "DIN RAIL",
        "PICMG": "PICMG",
        "FULLSIZE PICMG": "PICMG FULLSIZE",
        "PICMG FULLSIZE CPU CARD": "PICMG FULLSIZE",
        "HALF-SIZE PICMG": "PICMG HALF-SIZE",
        "HALFSIZE PICMG": "PICMG HALF-SIZE",
        "COMPACTPCI": "COMPACT PCI",
        "MINI PCIE": "MINI PCIE",
        "HALF-SIZE": "HALF-SIZE",
        "HALFSIZE": "HALF-SIZE",
        "HALF SIZE": "HALF-SIZE",
        "ALL-IN-ONE": "ALL-IN-ONE",
        "RACKMOUNT": "RACKMOUNT",
        "U RACKMOUNT": "RACKMOUNT",
        "U RACKMOUNT CHASSIS": "RACKMOUNT",
        "EMBEDDED": "EMBEDDED",
        "EMBEDDED SYSTEM": "EMBEDDED",
        "EMBEDDED IPC": "EMBEDDED",
        "EMBEDDED MODULE": "EMBEDDED",
    }

    for old, new in replacements.items():
        if value == old or value.startswith(old + " "):
            return new

    # Remove non-alphanumeric characters except hyphens and spaces
    value = re.sub(r"[^A-Z0-9\- ]", "", value)

    return value



**Apply Cleaning Function and Display Statistics**:


In [15]:
df = print_before_after_stats(df, 'form_factor', clean_form_factor)


=== Cleaning 'form_factor' Column ===
Original unique values: 227
Cleaned unique values:  140
Reduction:              87
Reduction percentage:   38.33%

Top 50 most frequent values after cleaning:
form_factor
COM EXPRESS                         147
SBC                                 133
MINI-ITX                             76
ATX                                  51
SMARC                                40
BOX PC                               38
QSEVEN                               37
MICRO-ATX                            35
PICO-ITX                             35
SOM                                  31
RASPBERRY PI                         25
SODIMM                               17
DIN RAIL                             13
EBX                                  12
PCPLUS                               12
EPIC                                 11
EMBEDDED                             11
HALF-SIZE                            11
QFN                                  10
ETX                           

**Example changes**:

- Before cleaning: "Mini ITX"
- After cleaning: "MINI-ITX"


---



### 3.4 Evaluation or Commercialization

**Rationale**:

- Standardizing this field enables users to distinguish between products intended for evaluation and those ready for commercialization.


In [16]:
df["evaluation_or_commercialization"].value_counts().head(50)

evaluation_or_commercialization
False                826
True                 151
Commercialization      1
Name: count, dtype: int64

**Cleaning Function**:


In [17]:
def clean_evaluation_or_commercialization(value):
    if isinstance(value, str):
        value = value.lower()
        if value in ["true", "false"]:
            return value == "true"
    return None

**Apply Cleaning Function and Display Statistics**:


In [18]:
df = print_before_after_stats(df, 'evaluation_or_commercialization', clean_evaluation_or_commercialization)


=== Cleaning 'evaluation_or_commercialization' Column ===
Original unique values: 3
Cleaned unique values:  2
Reduction:              1
Reduction percentage:   33.33%

Top 50 most frequent values after cleaning:
evaluation_or_commercialization
False    826
True     151



### 3.5 Processor Architecture

**Rationale**:

- Standardize processor architectures for accurate filtering.
- Group similar architectures (e.g., x86 variants).


In [19]:
df["processor_architecture"].value_counts().head(50)

processor_architecture
x86                                                         372
X86                                                         251
ARM                                                         250
ARM Cortex-A53                                                9
x86-64                                                        8
ARM Cortex-A                                                  5
SILVERMONT                                                    4
ARM Cortex-A72                                                3
INTEL                                                         3
ARM Cortex-A8                                                 3
ARM Cortex-A9                                                 2
ARM v8                                                        2
Intel 64                                                      2
Skylake-U                                                     2
Apollo Lake                                                   2
ARM big.LITTLE   

**Cleaning Function**:


In [20]:
def clean_processor_architecture(value):
    if not isinstance(value, str):
        return value

    value = value.upper().strip()
    value = " ".join(value.split())

    replacements = {
        "X86": "X86",
        "X86-64": "X86-64",
        "X86_64": "X86-64",
        "INTEL 64": "X86-64",
        "IA-32": "X86",
        "ARM": "ARM",
        "ARM CORTEX-A53": "ARM CORTEX-A53",
        "ARM CORTEX A53": "ARM CORTEX-A53",
        "CORTEX-A53 64-BIT": "ARM CORTEX-A53",
        "ARM CORTEX-A": "ARM CORTEX-A",
        "ARM CORTEX A": "ARM CORTEX-A",
        "CORTEX-A": "ARM CORTEX-A",
        "ARM CORTEXA": "ARM CORTEX-A",
        "ARM CORTEX-A72": "ARM CORTEX-A72",
        "ARM CORTEX-A8": "ARM CORTEX-A8",
        "ARM CORTEX-A9": "ARM CORTEX-A9",
        "ARM CORTEX-A7": "ARM CORTEX-A7",
        "CORTEX-A7": "ARM CORTEX-A7",
        "ARM CORTEX-M3": "ARM CORTEX-M3",
        "ARM CORTEX-M4": "ARM CORTEX-M4",
        "ARM CORTEX-M33": "ARM CORTEX-M33",
        "ARMV8": "ARMV8",
        "ARM BIG.LITTLE": "ARM BIG.LITTLE",
        "ARM CORTEX-A72/A53": "ARM BIG.LITTLE",
        "SILVERMONT": "INTEL ATOM",
        "APOLLO LAKE": "INTEL ATOM",
        "BAY TRAIL": "INTEL ATOM",
        "INTEL ATOM C3000": "INTEL ATOM",
        "ZEN": "AMD ZEN",
        "SKYLAKE-U": "INTEL CORE",
        "HASWELL": "INTEL CORE",
        "ALDER LAKE": "INTEL CORE",
        "ALDER LAKE-S": "INTEL CORE",
        "ALDER LAKE N": "INTEL CORE",
        "RAPTOR LAKE-S": "INTEL CORE",
        "INTEL PERFORMANCE HYBRID ARCHITECTURE": "INTEL HYBRID",
        "PERFORMANCE HYBRID ARCHITECTURE": "INTEL HYBRID",
        "X86 HYBRID ARCHITECTURE": "INTEL HYBRID",
        "INTEL® CORE™ ULTRA METEOR LAKE-H/U": "INTEL HYBRID",
        "RISCV": "RISC-V",
        "XTENSA": "XTENSA",
        "XTENSA LX6": "XTENSA",
        "MOBILE ATHLON ARCHITECTURE": "AMD",
        "8-BIT": "8-BIT",
        "16-BIT": "16-BIT",
        "32-BIT ARM ARCHITECTURE": "ARM",

    }
    for old, new in replacements.items():
        if value == old or value.startswith(old + " "):
            return new

    return value

**Apply Cleaning Function and Display Statistics**:


In [21]:
df = print_before_after_stats(df, 'processor_architecture', clean_processor_architecture)



=== Cleaning 'processor_architecture' Column ===
Original unique values: 60
Cleaned unique values:  19
Reduction:              41
Reduction percentage:   68.33%

Top 50 most frequent values after cleaning:
processor_architecture
X86                                                         626
ARM                                                         291
X86-64                                                       11
INTEL ATOM                                                    9
INTEL CORE                                                    7
INTEL HYBRID                                                  4
AMD ZEN                                                       3
INTEL                                                         3
RISC-V                                                        2
XTENSA                                                        2
ARM CORTEX-A7                                                 1
PICMZ                                                         1
8-


**Example changes**:

- Before cleaning: "x86 64-bit"
- After cleaning: "X86_64"




---



### 3.6 Processor Core Count

**Rationale**:

- Standardize core counts to numerical values.
- Handle ranges and textual representations.


In [22]:
df["processor_core_count"].value_counts().head(50)

processor_core_count
4                                            270
2                                             95
Quad-core                                     56
Quad Core                                     51
6                                             36
8                                             27
Quad                                          20
1                                             20
Dual-core                                     18
Quad core                                     16
Quad-Core                                     15
Dual                                          12
Dual Core                                     11
Up to 14 cores                                10
24                                            10
Quad/Dual                                      9
Up to 16 cores                                 8
16 cores                                       7
16                                             7
Dual/Quad                                      7


**Cleaning Function**:


In [23]:
def clean_processor_core_count(value):
    if isinstance(value, str):
        value = value.lower()
        value = re.sub(r"\s+", " ", value)  # Remove extra spaces

        # Handle specific mappings
        mapping = {
            "single": "1",
            "dual": "2",
            "quad": "4",
            "hexa": "6",
            "octa": "8",
            "four": "4",
            "eight": "8",
            "nine": "9",
        }
        for key, replacement in mapping.items():
            if key in value:
                return replacement

        # Handle ranges
        range_match = re.search(r"(\d+)\s*(?:to|-)\s*(\d+)", value)
        if range_match:
            return f"{range_match.group(1)}-{range_match.group(2)}"

        # Handle "up to X cores"
        up_to_match = re.search(r"up to (\d+)", value)
        if up_to_match:
            return f"1-{up_to_match.group(1)}"

        # Handle "X to Y cores"
        to_match = re.search(r"(\w+)\s+to\s+(\w+)\s+cores", value)
        if to_match:
            start = mapping.get(to_match.group(1), to_match.group(1))
            end = mapping.get(to_match.group(2), to_match.group(2))
            return f"{start}-{end}"

        # Handle "X processing cores"
        processing_cores_match = re.search(r"(\w+)\s+processing\s+cores", value)
        if processing_cores_match:
            return mapping.get(processing_cores_match.group(1), processing_cores_match.group(1))

        # Extract single number if present
        number_match = re.search(r"\d+", value)
        if number_match:
            return number_match.group()

    return value


**Apply Cleaning Function and Display Statistics**:


In [24]:
df = print_before_after_stats(df, 'processor_core_count', clean_processor_core_count)



=== Cleaning 'processor_core_count' Column ===
Original unique values: 141
Cleaned unique values:  30
Reduction:              111
Reduction percentage:   78.72%

Top 50 most frequent values after cleaning:
processor_core_count
4       460
2       202
1        60
6        41
8        37
24       15
16       15
1-14     12
14       10
1-16     10
1-24      9
6-10      6
12        6
1-4       5
10        5
9         2
1-8       2
128       2
1-6       2
1-28      2
80        1
3         1
64        1
5         1
32        1
2-16      1
1-60      1
20        1
1-10      1
1-32      1



**Example changes**:

- Before cleaning: "Quad-core"
- After cleaning: 4


---


### 3.7 Processor Manufacturer

**Rationale**:

- Variations in manufacturer names can hinder effective grouping and filtering..

In [25]:
df["processor_manufacturer"].value_counts().head(50)

processor_manufacturer
INTEL                        543
NXP                          120
Intel                         84
AMD                           61
BROADCOM                      23
ROCKCHIP                      22
FREESCALE                     11
Broadcom                      11
ARM                           10
TEXAS INSTRUMENTS             10
NVIDIA                         8
ALLWINNER                      7
XILINX                         5
TI                             5
RENESAS                        5
DMP                            5
QUALCOMM                       4
ESPRESSIF SYSTEMS              3
MEDIATEK                       3
MARVELL                        3
STMICROELECTRONICS             3
Freescale                      2
DMP ELECTRONICS INC            2
RASPBERRY PI LTD               2
KNERON                         1
STERICSSON                     1
Rabbit                         1
Marvell                        1
ST Microelectronics            1
RABBIT              

**Cleaning Function**:


In [26]:
def clean_processor_manufacturer(value):
    if not isinstance(value, str):
        return value

    # Convert to uppercase
    value = value.upper()

    # Remove extra spaces
    value = " ".join(value.split())

    # Standardize common terms
    replacements = {
        "INTEL": "INTEL",
        "NXP": "NXP",
        "AMD": "AMD",
        "BROADCOM": "BROADCOM",
        "ROCKCHIP": "ROCKCHIP",
        "FREESCALE": "FREESCALE",
        "ARM": "ARM",
        "ARM LIMITED": "ARM",
        "TEXAS INSTRUMENTS": "TEXAS INSTRUMENTS",
        "TI": "TEXAS INSTRUMENTS",
        "NVIDIA": "NVIDIA",
        "ALLWINNER": "ALLWINNER",
        "ALLWINNER TECH": "ALLWINNER",
        "ALLWINNER TECHNOLOGY": "ALLWINNER",
        "XILINX": "XILINX",
        "RENESAS": "RENESAS",
        "DMP": "DMP",
        "DMP ELECTRONICS INC": "DMP",
        "QUALCOMM": "QUALCOMM",
        "QUALCOMM TECHNOLOGIES INC": "QUALCOMM",
        "ESPRESSIF SYSTEMS": "ESPRESSIF",
        "ESPRESSIF": "ESPRESSIF",
        "MEDIATEK": "MEDIATEK",
        "MARVELL": "MARVELL",
        "STMICROELECTRONICS": "STMICROELECTRONICS",
        "ST MICROELECTRONICS": "STMICROELECTRONICS",
        "ST": "STMICROELECTRONICS",
        "STM": "STMICROELECTRONICS",
        "RASPBERRY PI LTD": "RASPBERRY PI",
        "RASPBERRY PI": "RASPBERRY PI",
        "KNERON": "KNERON",
        "STERICSSON": "STERICSSON",
        "RABBIT": "RABBIT",
        "RABBIT SEMICONDUCTOR": "RABBIT",
        "RABBIT SEMICONDUCTOR INC": "RABBIT",
        "WESTERN DESIGN CENTER": "WESTERN DESIGN CENTER",
        "ADVANTECH": "ADVANTECH",
        "ATMEL": "ATMEL",
        "VIA": "VIA",
        "MICROCHIP": "MICROCHIP",
    }

    for old, new in replacements.items():
        if value == old or value.startswith(old + " "):
            return new

    # If no match found, return the original value
    return value

**Apply Cleaning Function and Display Statistics**:


In [27]:
df = print_before_after_stats(df, 'processor_manufacturer', clean_processor_manufacturer)


=== Cleaning 'processor_manufacturer' Column ===
Original unique values: 49
Cleaned unique values:  27
Reduction:              22
Reduction percentage:   44.90%

Top 50 most frequent values after cleaning:
processor_manufacturer
INTEL                    627
NXP                      120
AMD                       61
BROADCOM                  34
ROCKCHIP                  23
TEXAS INSTRUMENTS         15
FREESCALE                 13
ARM                       12
ALLWINNER                 10
NVIDIA                     8
DMP                        7
STMICROELECTRONICS         6
RENESAS                    6
QUALCOMM                   5
XILINX                     5
RABBIT                     4
ESPRESSIF                  4
MARVELL                    4
MEDIATEK                   3
RASPBERRY PI               3
VIA                        1
MICROCHIP                  1
ATMEL                      1
ADVANTECH                  1
WESTERN DESIGN CENTER      1
KNERON                     1
STERICSSON      

### 3.8 Processor TDP


In [28]:
df["processor_tdp"].value_counts().head(50)

processor_tdp
15W                                                         90
6W                                                          86
35W                                                         43
45W                                                         35
95W                                                         25
65W                                                         25
10W                                                         23
125W                                                        15
12W                                                         12
6W/6W/12W/9W/6W up to 45W                                   10
28W (passive cooled)                                        10
65W / 35W                                                    8
25 W                                                         8
125 W                                                        7
10 watts                                                     7
28W                                      

**Cleaning Function**:


In [29]:
def clean_processor_tdp(value):
    if not isinstance(value, str):
        return value

    value = value.lower().strip()
    value = re.sub(r"\s+", " ", value)  # Remove extra spaces

    # Handle special cases
    if value in ["ultra low power", "low power consumption", "low power design", "low power"]:
        return "LOW POWER"

    # Convert ranges to standard format
    range_match = re.search(r"(\d+(?:\.\d+)?)\s*(?:w|watts)?\s*(?:to|-)\s*(\d+(?:\.\d+)?)\s*(?:w|watts)?", value)
    if range_match:
        return f"{float(range_match.group(1)):.1f}W-{float(range_match.group(2)):.1f}W"

    # Handle "up to X" format
    up_to_match = re.search(r"(?:up to|max) (\d+(?:\.\d+)?)\s*(?:w|watts)", value)
    if up_to_match:
        return f"0.0W-{float(up_to_match.group(1)):.1f}W"

    # Handle single values
    single_match = re.search(r"(\d+(?:\.\d+)?)\s*(?:w|watts)?", value)
    if single_match:
        return f"{float(single_match.group(1)):.1f}W"

    return value

**Apply Cleaning Function and Display Statistics**:


In [30]:
df = print_before_after_stats(df, "processor_tdp", clean_processor_tdp)


=== Cleaning 'processor_tdp' Column ===
Original unique values: 143
Cleaned unique values:  83
Reduction:              60
Reduction percentage:   41.96%

Top 50 most frequent values after cleaning:
processor_tdp
6.0W           102
15.0W           94
35.0W           49
45.0W           45
65.0W           38
10.0W           33
95.0W           26
125.0W          22
28.0W           20
12.0W           16
0.0W-45.0W      12
25.0W           10
6.0W-12.0W       9
35.0W-65.0W      8
LOW POWER        7
5.0W             6
205.0W           5
12.0W-25.0W      5
15.0W-45.0W      5
15.0W-25.0W      4
6.7W             4
9.0W             4
6.0W-45.0W       4
31.0W            3
80.0W            3
25.0W-45.0W      3
8.0W             3
70.0W            3
17.0W            3
5.0W-10.0W       3
6.0W-10.0W       3
19.0W            3
12.0W-28.0W      2
12.0W-54.0W      2
77.0W            2
1.0W             2
10.0W-54.0W      2
13.0W            2
0.0W-125.0W      2
2.0W             2
19.5W            1
0.0W-13.

### 3.9 Memory

**Rationale**:

- Standardize memory sizes to GB.
- Handle ranges and different units.


In [31]:

df["memory"].value_counts().head(50)

memory
4 GB LPDDR4                                     24
8GB DDR3L                                       17
1 GB DDR3                                       13
2 GB LPDDR4                                     11
8GB LPDDR4                                      10
8 GB DDR3L                                      10
4GB LPDDR4                                       9
4 GB DDR3L                                       9
Up to 8GB DDR3L                                  8
Up to 32GB DDR4                                  8
512 MB DDR3                                      7
8GB DDR3L SDRAM                                  7
4 GB DDR3L SDRAM                                 7
8 GB DDR4                                        6
4 GByte LPDDR4                                   6
4 GB LPDDR3                                      6
512 MB DDR3 SDRAM                                6
2GB LPDDR4                                       6
8GB DDR4 SODIMM                                  5
Up to 32GB DDR4 SODIMM  


**Cleaning Function**:


In [32]:
def clean_memory(value):
    if not isinstance(value, str):
        return value

    value = value.upper().strip()
    value = re.sub(r'\s+', ' ', value)

    # Handle "Up to X" format
    up_to_match = re.match(r'UP TO (\d+(\.\d+)?)\s*(GB|MB|TB)', value)
    if up_to_match:
        size = float(up_to_match.group(1))
        unit = up_to_match.group(3)
        size_gb = standardize_units(size, unit, 'GB')
        return f"0-{size_gb:.1f}GB {value.split(maxsplit=3)[-1].split()[0]}"

    # Handle ranges
    range_match = re.match(r'(\d+(\.\d+)?)\s*(GB|MB|TB)\s*-\s*(\d+(\.\d+)?)\s*(GB|MB|TB)', value)
    if range_match:
        size1 = float(range_match.group(1))
        unit1 = range_match.group(3)
        size2 = float(range_match.group(4))
        unit2 = range_match.group(6)
        size1_gb = standardize_units(size1, unit1, 'GB')
        size2_gb = standardize_units(size2, unit2, 'GB')
        return f"{size1_gb:.1f}-{size2_gb:.1f}GB {value.split()[-1]}"

    # Handle single values
    single_match = re.match(r'(\d+(\.\d+)?)\s*(GB|MB|TB)', value)
    if single_match:
        size = float(single_match.group(1))
        unit = single_match.group(3)
        size_gb = standardize_units(size, unit, 'GB')
        memory_type = ' '.join(re.findall(r'(DDR\d+|LPDDR\d+)', value))
        return f"{size_gb:.1f}GB {memory_type}".strip()

    # Handle multiple memory types
    if 'DDR' in value and ',' in value:
        return f"UNKNOWN {value}"

    # Remove unnecessary words
    value = re.sub(r'\b(GB|MB|TB|GBYTE|GBYTES|DRAM|SDRAM|SODIMM)\b', '', value)
    value = re.sub(r'\d+MHZ', '', value)

    return value.strip()




**Apply Cleaning Function and Display Statistics**:


In [33]:
df = print_before_after_stats(df, 'memory', clean_memory)



=== Cleaning 'memory' Column ===
Original unique values: 638
Cleaned unique values:  376
Reduction:              262
Reduction percentage:   41.07%

Top 50 most frequent values after cleaning:
memory
8.0GB DDR3                                              60
4.0GB LPDDR4                                            47
4.0GB DDR3                                              36
1.0GB DDR3                                              26
2.0GB LPDDR4                                            26
8.0GB LPDDR4                                            23
0-8.0GB DDR3L                                           22
0-32.0GB DDR4                                           21
0.5GB DDR3                                              20
8.0GB DDR4                                              19
0-32.0GB GBYTE                                          18
0-64.0GB DDR4                                           18
64.0GB DDR4                                             15
2.0GB DDR3                      

**Example changes**:


- Before cleaning: "Up to 8GB"
- After cleaning: "0-8GB"


---


### 3.10 Onboard Storage

**Rationale**:

- Standardize storage capacities to GB.
- Handle different storage types and units.


In [34]:
df["onboard_storage"].value_counts().head(50)

onboard_storage
mSATA                                                         43
16 GB eMMC                                                    35
64GB eMMC                                                     30
Up to 64GB eMMC                                               20
16GB eMMC                                                     17
8 GB eMMC                                                     16
eMMC                                                          15
64 GB eMMC                                                    13
Optional NVMe SSD                                             12
32 GB eMMC                                                    10
8GB eMMC                                                       9
8 GB eMMC NAND Flash                                           8
CompactFlash Type III                                          8
CompactFlash socket                                            7
32Gb/s M.2                                                     7
Up to 64 


**Cleaning Function**:


In [35]:
def clean_onboard_storage(value):
    if not isinstance(value, str):
        return value

    value = value.upper().strip()
    value = re.sub(r'\s+', ' ', value)

    # Standardize storage sizes
    size_matches = re.findall(r'(\d+(\.\d+)?)\s*(GB|MB|TB)', value)
    for match in size_matches:
        size = float(match[0])
        unit = match[2]
        size_gb = standardize_units(size, unit, 'GB')
        value = value.replace(''.join(match), f"{size_gb:.1f}GB")

    # Remove unnecessary words
    value = re.sub(r'\b(UP TO|OPTIONAL|ONBOARD|SUPPORTS|STORAGE)\b', '', value)

    # Standardize eMMC format
    value = re.sub(r'EMMC\s+(\d+\.?\d*)GB', r'\1GB EMMC', value)
    value = re.sub(r'(\d+\.?\d*)GB\s+EMMC', r'\1GB EMMC', value)

    # Remove extra spaces
    value = " ".join(value.split())

    return value


**Apply Cleaning Function and Display Statistics**:


In [36]:
df = print_before_after_stats(df, 'onboard_storage', clean_onboard_storage)



=== Cleaning 'onboard_storage' Column ===
Original unique values: 508
Cleaned unique values:  451
Reduction:              57
Reduction percentage:   11.22%

Top 50 most frequent values after cleaning:
onboard_storage
64.0GB EMMC                                                   65
MSATA                                                         49
16 GB EMMC                                                    36
EMMC                                                          22
64 GB EMMC                                                    21
NVME SSD                                                      20
16.0GB EMMC                                                   20
8 GB EMMC                                                     16
8.0GB EMMC                                                    14
32 GB EMMC                                                    11
32.0GB EMMC                                                   11
COMPACTFLASH TYPE III                                         10
NO

In [37]:
df["onboard_storage"].value_counts().head(50)

onboard_storage
64.0GB EMMC                                                   65
MSATA                                                         49
16 GB EMMC                                                    36
EMMC                                                          22
64 GB EMMC                                                    21
NVME SSD                                                      20
16.0GB EMMC                                                   20
8 GB EMMC                                                     16
8.0GB EMMC                                                    14
32 GB EMMC                                                    11
32.0GB EMMC                                                   11
COMPACTFLASH TYPE III                                         10
NON-VOLATILE USER DATA                                         9
64 GBYTE EMMC                                                  9
8 GB EMMC NAND FLASH                                           8
32.0GB/S 



**Example changes**:

- Before cleaning: "Up to 256GB SSD"
- After cleaning: "256GB SSD"



---


### 3.11 Input Voltage

**Rationale**:

- Handles special cases like 'ATX', 'AC', 'VDC', etc.
- Standardizes all ATX-related entries to simply 'ATX'.
- Categorizes entries like "WIDE VOLTAGE INPUT RANGE" as 'VARIABLE'.
- Converts all numeric values to floats for consistency.
- Handles ranges that don't include 'V' at the end.


In [38]:
df["input_voltage"].value_counts().head(50)

input_voltage
12V                                            308
5V                                             144
12V DC                                         103
12 VDC                                          34
5V DC                                           28
24 V DC                                         15
12 V                                            14
9-36 VDC                                        12
5 VDC                                           12
3.3V                                            12
8-60 VDC                                        12
5 V                                             11
12 V DC                                         11
12VDC                                           10
5 V DC                                           8
100-240V AC                                      8
9-36V DC                                         8
3.3 V                                            7
12V DC-in                                        7
+12V/+5V/+3.3V/+5

**Cleaning Function**:


In [39]:
def clean_input_voltage(value):
    if not isinstance(value, str):
        return value

    value = value.upper().strip()
    value = re.sub(r"\s+", " ", value)  # Remove extra spaces

    # Handle special cases
    if value in ["ATX", "AC", "VDC", "DC POWER", "DC POWER INPUT", "USB POWERED"]:
        return value
    if "ATX" in value or "AT POWER" in value:
        return "ATX"
    if "WIDE VOLTAGE INPUT RANGE" in value or "NOMINAL VOLTAGE" in value:
        return "VARIABLE"

    # Standardize range format
    range_match = re.search(r"(\d+(?:\.\d+)?)\s*-\s*(\d+(?:\.\d+)?)\s*V", value)
    if range_match:
        return f"{float(range_match.group(1))}V-{float(range_match.group(2))}V"

    # Handle single values
    single_match = re.search(r"(\d+(?:\.\d+)?)\s*V(?:OLT)?S?", value)
    if single_match:
        return f"{float(single_match.group(1))}V"

    # Handle ranges without 'V'
    range_match_no_v = re.search(r"(\d+(?:\.\d+)?)\s*-\s*(\d+(?:\.\d+)?)", value)
    if range_match_no_v:
        return f"{float(range_match_no_v.group(1))}V-{float(range_match_no_v.group(2))}V"

    return value



**Apply Cleaning Function and Display Statistics**:


In [40]:
df = print_before_after_stats(df, 'input_voltage', clean_input_voltage)


=== Cleaning 'input_voltage' Column ===
Original unique values: 137
Cleaned unique values:  42
Reduction:              95
Reduction percentage:   69.34%

Top 50 most frequent values after cleaning:
input_voltage
12.0V                    513
5.0V                     239
3.3V                      33
9.0V-36.0V                28
24.0V                     23
ATX                       21
100.0V-240.0V             17
12.0V-24.0V               14
8.0V-60.0V                12
19.0V                      6
8.0V-30.0V                 5
3.0V                       3
VARIABLE                   2
AC                         2
8.5V                       2
12.0V-28.0V                2
36.0V                      2
12.0V-60.0V                1
8.0V                       1
30.0V                      1
DC POWER INPUT             1
7.0V-36.0V                 1
9.0V-15.0V                 1
4.2V-5.5V                  1
DC POWER                   1
7.0V-12.0V                 1
3.0V-3.6V                  1
9.0V

### 3.12 IO Count

**Rationale**:
- USB standardization:
    - USB 2.0 and USB 2 are now treated as the same.
    - USB 3.0, 3.1, and 3.2 are grouped as "USB 3.x" since the differences are often minimal and inconsistently reported.
    - Unspecified USB remains as just "USB".
- SATA and PCIe versioning:
    - We now capture version information for SATA and PCIe when available.
    - This allows us to differentiate between, for example, SATA II and SATA III.
- Other interfaces:
    - We've kept other interfaces (like ETHERNET, SERIAL, DISPLAY) without version information, as it's less commonly specified and less critical for differentiation.
- Handling ambiguity:
    - When a product just specifies "USB" without a version, we keep it as "USB". This ambiguity is preserved in the data, allowing for further analysis if needed.



In [41]:
df["io_count"].value_counts().head(50)


io_count
['Not Available']                                                                                                                 68
['PCIe, USB, SATA, Ethernet']                                                                                                      5
['Multiple I/Os including PCIe4, USB3.2, and 10GBase-KR']                                                                          4
['2 x DP, 4 x USB, 2 x COM ports, 2 x Ethernet']                                                                                   4
['Multiple USB, PCIe, SATA, Ethernet']                                                                                             4
['PCIe, SATA, USB, Ethernet']                                                                                                      3
['4x USB', '2x LAN', '2x DisplayPort', '1x Audio In/Out', '1x PCIe x4', '1x PCIe x1']                                              2
['8 x USB 2.0, 8-bit GPIO']                                 

**Cleaning Function**:


In [42]:
def standardize_io_type(io_type):
    io_type = io_type.upper().strip()

    # Standardize USB types
    usb_match = re.search(r"\bUSB\s*([\d.]+)?(?:\s*(?:GEN)?\s*([\d.]+))?", io_type)
    if usb_match:
        version = usb_match.group(1) or usb_match.group(2)
        if version:
            # Standardize version numbers
            if version in ["1", "1.0"]:
                return "USB 1.0"
            if version in ["2", "2.0"]:
                return "USB 2.0"
            elif version in ["3", "3.0"]:
                return "USB 3.0"
            elif version in ["4", "4.0"]:
                return "USB 4.0"
            else:
                return f"USB {version}"
        return "USB"

    # Standardize SATA
    sata_match = re.search(r"\bSATA\s*(I{1,3}|[\d.]+)?", io_type)
    if sata_match:
        version = sata_match.group(1)
        if version:
            if version == "III":
                return "SATA 3.0"
            elif version == "II":
                return "SATA 2.0"
            elif version == "I":
                return "SATA 1.0"
            elif version in ["3", "3.0"]:
                return "SATA 3.0"
            elif version in ["4", "4.0"]:
                return "SATA 4.0"
            elif version in ["5", "5.0"]:
                return "SATA 5.0"
            elif version in ["6", "6.0"]:
                return "SATA 6.0"
            else:
                return f"SATA {version}"
        return "SATA"

    # Standardize PCIe
    pcie_match = re.search(r"\bPCIE\s*([\d.]+)?", io_type)
    if pcie_match:
        version = pcie_match.group(1)
        if version:
            return f"PCIE {version}"
        return "PCIE"

    # Standardize other common I/O types
    elif re.search(r"\b(ETHERNET|LAN|RJ45)\b", io_type):
        return "ETHERNET"
    elif re.search(r"\b(RS-?232|RS-?485|RS-?422|UART|COM|SERIAL)\b", io_type):
        return "SERIAL"
    elif re.search(r"\b(HDMI|VGA|DVI|DISPLAYPORT|DP)\b", io_type):
        return "DISPLAY"
    elif re.search(r"\b(GPIO|DIGITAL I/O)\b", io_type):
        return "GPIO"
    elif re.search(r"\b(I2C|SPI|CAN|SDIO)\b", io_type):
        return "OTHER_BUS"
    elif re.search(r"\b(AUDIO|MIC|LINE-IN|LINE-OUT)\b", io_type):
        return "AUDIO"
    else:
        return "OTHER"


def clean_io_count(value):
    if isinstance(value, str):
        try:
            value_list = ast.literal_eval(value)
            if isinstance(value_list, list):
                if value_list == ["Not Available"]:
                    return []

                # Extract I/O types using regex
                io_types = re.findall(r"\b[\w\s.-]+\b", " ".join(value_list))

                # Standardize and filter I/O types
                return list(set(standardize_io_type(io) for io in io_types if io))
        except:
            # If ast.literal_eval fails, try to extract I/O types directly from the string
            io_types = re.findall(r"\b[\w\s.-]+\b", value)
            return list(set(standardize_io_type(io) for io in io_types if io))
    return value if isinstance(value, list) else []

**Apply Cleaning Function and Display Statistics**:


In [43]:
# Apply the cleaning function
df["io_count"] = df["io_count"].apply(clean_io_count)

# Count unique IO configurations, treating each list as a single entity
unique_io_configs = df["io_count"].apply(lambda x: tuple(sorted(x))).nunique()
print(f"Number of unique IO configurations: {unique_io_configs}")

# Count occurrences of specific IO types
io_type_counts = df["io_count"].explode().value_counts()
print("\nTop 20 most common IO types:")
print(io_type_counts.head(20))

# Count number of IO types per product
io_count_per_product = df["io_count"].apply(len)
print(f"\nAverage number of IO types per product: {io_count_per_product.mean():.2f}")
print(f"Median number of IO types per product: {io_count_per_product.median()}")
print(f"Max number of IO types per product: {io_count_per_product.max()}")

Number of unique IO configurations: 280

Top 20 most common IO types:
io_count
USB          414
OTHER        321
SERIAL       233
ETHERNET     206
USB 3.0      186
USB 2.0      185
DISPLAY      143
PCIE         138
SATA         104
GPIO          81
USB 3.2       65
OTHER_BUS     60
AUDIO         50
USB 3.1       43
SATA 3.0      38
USB 1.0       31
SATA 6.0      13
SATA 2.0      11
USB 4.0        7
PCIE 4         5
Name: count, dtype: int64

Average number of IO types per product: 2.39
Median number of IO types per product: 2.0
Max number of IO types per product: 8



### 3.13 Wireless

**Rationale**:
- Standardizes Wi-Fi versions (Wi-Fi 6, Wi-Fi 5, and generic Wi-Fi).
- Standardizes Bluetooth versions (Bluetooth 5+, Bluetooth 4, and generic Bluetooth).
- Standardizes cellular technologies (5G, 4G/LTE, 3G, and generic cellular).
- Identifies other common wireless technologies (GPS, NFC, Zigbee, LoRa).
- Groups less common or unspecified wireless technologies as "OTHER".



In [44]:
df["wireless"].value_counts().head(50)

wireless
Wi-Fi, Bluetooth                                                                   35
['Wi-Fi', 'Bluetooth']                                                             30
Wi-Fi                                                                              15
WiFi, Bluetooth                                                                    13
WiFi                                                                               12
802.11 a/b/g/n/ac+BT5.0                                                             9
5G connectivity                                                                     7
Wi-Fi, Bluetooth, Cellular                                                          6
M.2 E key for wireless                                                              6
Wi-Fi, Bluetooth, GSM                                                               6
Supports WiFi modems                                                                6
Wi-Fi/Bluetooth, 3g and GPS                  

**Cleaning Function**:


In [45]:
def standardize_wireless(item):
    item = item.upper().strip()

    # Standardize Wi-Fi
    if re.search(r"\b(WI-?FI|WLAN|802\.11)\b", item):
        if re.search(r"\b(6E?|AX)\b", item):
            return "WI-FI 6"
        elif re.search(r"\b(6|AX)\b", item):
            return "WI-FI 6"
        elif re.search(r"\b(AC|5)\b", item):
            return "WI-FI 5"
        else:
            return "WI-FI"

    # Standardize Bluetooth
    elif re.search(r"\bBLUETOOTH\b", item):
        if re.search(r"\b(5\.\d|LE)\b", item):
            return "BLUETOOTH 5+"
        elif re.search(r"\b4\.\d\b", item):
            return "BLUETOOTH 4"
        else:
            return "BLUETOOTH"

    # Standardize Cellular
    elif re.search(r"\b(5G|LTE|4G|3G|CELLULAR|WWAN)\b", item):
        if "5G" in item:
            return "5G"
        elif "LTE" in item or "4G" in item:
            return "4G/LTE"
        elif "3G" in item:
            return "3G"
        else:
            return "CELLULAR"

    # Other common wireless technologies
    elif "GPS" in item:
        return "GPS"
    elif "NFC" in item:
        return "NFC"
    elif "ZIGBEE" in item:
        return "ZIGBEE"
    elif "LORA" in item:
        return "LORA"

    # For items that don't match any specific category
    else:
        return "OTHER"


def clean_wireless(value):
    if isinstance(value, str):
        try:
            value_list = ast.literal_eval(value)
            if isinstance(value_list, list):
                return list(set(standardize_wireless(item) for item in value_list if item.strip()))
        except:
            return list(set(standardize_wireless(item) for item in value.split(",") if item.strip()))
    return value if isinstance(value, list) else []



**Apply Cleaning Function and Display Statistics**:


In [46]:

# Apply the cleaning function
df["wireless"] = df["wireless"].apply(clean_wireless)

# Display the updated value counts
print(df["wireless"].value_counts().head(20))

# Count unique wireless configurations
unique_wireless_configs = df["wireless"].apply(lambda x: tuple(sorted(x))).nunique()
print(f"\nNumber of unique wireless configurations: {unique_wireless_configs}")

# Count number of wireless types per product
wireless_count_per_product = df["wireless"].apply(len)
print(f"\nAverage number of wireless types per product: {wireless_count_per_product.mean():.2f}")
print(f"Median number of wireless types per product: {wireless_count_per_product.median()}")
print(f"Max number of wireless types per product: {wireless_count_per_product.max()}")

wireless
[]                                320
[WI-FI]                           189
[WI-FI, BLUETOOTH]                111
[OTHER]                            84
[WI-FI 5]                          27
[WI-FI 6]                          26
[CELLULAR, WI-FI, BLUETOOTH]       22
[WI-FI 5, BLUETOOTH]               15
[5G]                               14
[WI-FI, BLUETOOTH, 4G/LTE]         14
[OTHER, WI-FI, BLUETOOTH]          12
[BLUETOOTH 5+, WI-FI]               9
[BLUETOOTH 4, WI-FI 5]              8
[BLUETOOTH 5+, WI-FI 5]             7
[BLUETOOTH 4, WI-FI]                7
[OTHER, WI-FI]                      7
[WI-FI, 3G]                         6
[BLUETOOTH 5+, OTHER, WI-FI 5]      6
[4G/LTE]                            6
[GPS, WI-FI, BLUETOOTH]             5
Name: count, dtype: int64

Number of unique wireless configurations: 79

Average number of wireless types per product: 1.12
Median number of wireless types per product: 1.0
Max number of wireless types per product: 5


### 3.8 Operating System BSP



**Rationale**:

- Standardize operating system names for accurate filtering.
- Group similar OS variants.


In [47]:
df["operating_system_bsp"].value_counts().head(50)

operating_system_bsp
Linux                                                                                                                   32
['Linux', 'Android']                                                                                                    30
['Windows', 'Linux']                                                                                                    29
['Linux']                                                                                                               16
Windows                                                                                                                 15
['Windows', 'Linux', 'VxWorks']                                                                                         14
['Windows', 'Windows Embedded', 'Linux', 'VxWorks', 'QNX']                                                              12
['Windows 10', 'Linux']                                                                                               


**Cleaning Function**:


In [48]:
def standardize_os(item):
    item = item.upper().strip()

    # Standardize Windows
    if "WINDOWS" in item:
        if "IOT" in item:
            return "WINDOWS IOT"
        elif "EMBEDDED" in item:
            return "WINDOWS EMBEDDED"
        elif "10" in item:
            return "WINDOWS 10"
        elif "11" in item:
            return "WINDOWS 11"
        elif "SERVER" in item:
            return "WINDOWS SERVER"
        else:
            return "WINDOWS"

    # Standardize Linux
    elif "LINUX" in item:
        if "UBUNTU" in item:
            return "UBUNTU"
        elif "YOCTO" in item:
            return "YOCTO LINUX"
        elif "DEBIAN" in item:
            return "DEBIAN"
        elif "FEDORA" in item:
            return "FEDORA"
        elif "CENTOS" in item:
            return "CENTOS"
        elif "REDHAT" in item or "RED HAT" in item:
            return "RED HAT ENTERPRISE LINUX"
        else:
            return "LINUX"

    # Other common operating systems
    elif "ANDROID" in item:
        return "ANDROID"
    elif "VXWORKS" in item:
        return "VXWORKS"
    elif "QNX" in item:
        return "QNX"
    elif "RASPBIAN" in item or "RASPBERRY PI OS" in item:
        return "RASPBERRY PI OS"

    # For items that don't match any specific category
    else:
        return "OTHER"


def clean_operating_system_bsp(value):
    if isinstance(value, str):
        try:
            value_list = ast.literal_eval(value)
            if isinstance(value_list, list):
                return list(set(standardize_os(item) for item in value_list if item.strip()))
        except:
            return list(set(standardize_os(item) for item in value.split(",") if item.strip()))
    return value if isinstance(value, list) else []




**Apply Cleaning Function and Display Statistics**:


In [49]:
# Apply the cleaning function
df["operating_system_bsp"] = df["operating_system_bsp"].apply(clean_operating_system_bsp)

# Display the updated value counts
print(df["operating_system_bsp"].value_counts().head(20))

# Count unique OS configurations
unique_os_configs = df["operating_system_bsp"].apply(lambda x: tuple(sorted(x))).nunique()
print(f"\nNumber of unique OS configurations: {unique_os_configs}")

# Count number of OS types per product
os_count_per_product = df["operating_system_bsp"].apply(len)
print(f"\nAverage number of OS types per product: {os_count_per_product.mean():.2f}")
print(f"Median number of OS types per product: {os_count_per_product.median()}")
print(f"Max number of OS types per product: {os_count_per_product.max()}")

operating_system_bsp
[]                                                  109
[OTHER]                                              92
[LINUX]                                              67
[LINUX, WINDOWS]                                     61
[LINUX, ANDROID]                                     42
[WINDOWS]                                            28
[OTHER, WINDOWS 10]                                  25
[WINDOWS 10]                                         23
[RASPBERRY PI OS]                                    22
[WINDOWS EMBEDDED]                                   21
[LINUX, OTHER]                                       19
[LINUX, WINDOWS 10]                                  18
[LINUX, WINDOWS IOT]                                 18
[LINUX, WINDOWS EMBEDDED]                            18
[LINUX, VXWORKS, WINDOWS]                            18
[WINDOWS IOT]                                        17
[QNX, WINDOWS, LINUX, VXWORKS, WINDOWS EMBEDDED]     14
[LINUX, OTHER, WINDOWS]    

### 3.15 Operating Temperature Max

In [50]:
df["operating_temperature_max"].value_counts()

operating_temperature_max
60°C                  451
85°C                  279
70°C                   38
60 C                   35
85 C                   27
85C                    14
40°C                   13
50°C                   13
+85°C                   7
60° C                   7
70 C                    7
75°C                    6
60C                     6
85 °C                   5
50 C                    4
70C                     4
35°C                    3
60 °C                   3
+85C                    3
55 C                    3
125°C                   2
85℃                     2
100°C                   2
+60°C                   2
125 °C                  2
+85° C                  2
105°C                   2
40 °C                   2
85°C (185°F)            1
50 degrees Celsius      1
85 oC                   1
+85 C                   1
70° C                   1
95°C                    1
158°F (70°C)            1
70 °C                   1
85 degrees Celsius      1
+70 °C      

**Cleaning Function**:


In [51]:
def clean_temperature(value):
    if isinstance(value, str):
        value = value.upper()
        value = re.sub(r"\s+", "", value)  # Remove all spaces
        # Standardize temperature format
        value = re.sub(r"(\d+(?:\.\d+)?)°?C", r"\1°C", value)
        value = re.sub(r"(\d+(?:\.\d+)?)°?F", r"\1°F", value)
        # Handle 'DEGREESCELSIUS' format
        value = re.sub(r"(\d+(?:\.\d+)?)DEGREESCELSIUS", r"\1°C", value)
        # Handle Fahrenheit to Celsius conversion
        f_to_c_match = re.search(r"(\d+(?:\.\d+)?)°F\((\d+(?:\.\d+)?)°C\)", value)
        if f_to_c_match:
            return f"{f_to_c_match.group(2)}°C"
        # Handle Celsius to Fahrenheit (remove Fahrenheit)
        c_to_f_match = re.search(r"(\d+(?:\.\d+)?)°C\(\d+(?:\.\d+)?°F\)", value)
        if c_to_f_match:
            return f"{c_to_f_match.group(1)}°C"
        # Handle ranges
        range_match = re.search(r"(-?\d+(?:\.\d+)?)°?C\s*(?:TO|-)\s*(-?\d+(?:\.\d+)?)°?C", value)
        if range_match:
            return f"{range_match.group(1)}°C-{range_match.group(2)}°C"
        # Handle single Celsius values
        celsius_match = re.search(r"(-?\d+(?:\.\d+)?)°?C", value)
        if celsius_match:
            return f"{celsius_match.group(1)}°C"
    return value




**Apply Cleaning Function and Display Statistics**:


In [52]:


original_unique = df["operating_temperature_max"].nunique()
df["operating_temperature_max"] = df["operating_temperature_max"].apply(clean_temperature)
cleaned_unique = df["operating_temperature_max"].nunique()

# Compare the number of unique values before and after cleaning
print(f"\nOriginal unique operating temperature max: {original_unique}")
print(f"Cleaned unique operating temperature max: {cleaned_unique}")
print(f"Reduction: {original_unique - cleaned_unique} ({((original_unique - cleaned_unique) / original_unique) * 100:.2f}%)")

df["operating_temperature_max"].value_counts()


Original unique operating temperature max: 48
Cleaned unique operating temperature max: 21
Reduction: 27 (56.25%)


operating_temperature_max
60°C     504
85°C     340
70°C      54
50°C      18
40°C      15
75°C       6
125°C      4
55°C       4
105°C      3
35°C       3
100°C      2
85℃        2
83°C       1
+80℃       1
65°C       1
90°C       1
45°C       1
80°C       1
85OC       1
95°C       1
70℃        1
Name: count, dtype: int64

### 3.16 Operating Temperature Min


**Apply Cleaning Function and Display Statistics**:


In [53]:
# Apply the cleaning function
original_unique = df["operating_temperature_min"].nunique()
df["operating_temperature_min"] = df["operating_temperature_min"].apply(clean_temperature)
cleaned_unique = df["operating_temperature_min"].nunique()

# Compare the number of unique values before and after cleaning
print(f"\nOriginal unique operating temperature min: {original_unique}")
print(f"Cleaned unique operating temperature min: {cleaned_unique}")
print(
    f"Reduction: {original_unique - cleaned_unique} ({((original_unique - cleaned_unique) / original_unique) * 100:.2f}%)"
)

df["operating_temperature_min"].value_counts()


Original unique operating temperature min: 27
Cleaned unique operating temperature min: 10
Reduction: 17 (62.96%)


operating_temperature_min
-40°C    384
0°C      373
-20°C    161
-25°C     17
-30°C     10
-10°C      8
-40℃       3
5°C        2
-40OC      1
-20℃       1
Name: count, dtype: int64


### 3.17 Certifications

This cleaning process will:
- Standardizes common certifications (CE, FCC, ROHS, UL, CCC, BSMI, CB, MIL-STD-810, EN 50155, IEC 60068, REACH, WEEE, ISO, VCCI, KC, TELEC, IC, RCM).
- Distinguishes between FCC Class A and FCC Class B.
- Groups less common or unspecified certifications as "OTHER".

**Cleaning Function**:


In [54]:
def standardize_certification(cert):
    cert = cert.upper().strip()

    # Common certifications
    if re.search(r"\bCE\b", cert):
        return "CE"
    elif re.search(r"\bFCC\b", cert):
        if "CLASS A" in cert:
            return "FCC CLASS A"
        elif "CLASS B" in cert:
            return "FCC CLASS B"
        else:
            return "FCC"
    elif re.search(r"\bROHS\b", cert):
        return "ROHS"
    elif re.search(r"\bUL\b", cert):
        return "UL"
    elif re.search(r"\bCCC\b", cert):
        return "CCC"
    elif re.search(r"\bBSMI\b", cert):
        return "BSMI"
    elif re.search(r"\bCB\b", cert):
        return "CB"
    elif re.search(r"\bMIL-STD-810[GH]?\b", cert):
        return "MIL-STD-810"
    elif re.search(r"\bEN\s*50155\b", cert):
        return "EN 50155"
    elif re.search(r"\bIEC\s*60068\b", cert):
        return "IEC 60068"
    elif re.search(r"\bREACH\b", cert):
        return "REACH"
    elif re.search(r"\bWEEE\b", cert):
        return "WEEE"
    elif re.search(r"\bISO\s*\d+", cert):
        return "ISO"
    elif re.search(r"\bVCCI\b", cert):
        return "VCCI"
    elif re.search(r"\bKC\b", cert):
        return "KC"
    elif re.search(r"\bTELEC\b", cert):
        return "TELEC"
    elif re.search(r"\bIC\b", cert):
        return "IC"
    elif re.search(r"\bRCM\b", cert):
        return "RCM"

    # For items that don't match any specific category
    return "OTHER"


def clean_certifications(value):
    if isinstance(value, str):
        try:
            value_list = ast.literal_eval(value)
            if isinstance(value_list, list):
                return list(set(standardize_certification(item) for item in value_list if item.strip()))
        except:
            return list(set(standardize_certification(item) for item in re.split(r"[,;]", value) if item.strip()))
    return value if isinstance(value, list) else []



**Apply Cleaning Function and Display Statistics**:


In [55]:


# Apply the cleaning function
df["certifications"] = df["certifications"].apply(clean_certifications)

# Display the updated value counts
print(df["certifications"].value_counts().head(20))

# Count unique certification configurations
unique_cert_configs = df["certifications"].apply(lambda x: tuple(sorted(x))).nunique()
print(f"\nNumber of unique certification configurations: {unique_cert_configs}")

# Count number of certifications per product
cert_count_per_product = df["certifications"].apply(len)
print(f"\nAverage number of certifications per product: {cert_count_per_product.mean():.2f}")
print(f"Median number of certifications per product: {cert_count_per_product.median()}")
print(f"Max number of certifications per product: {cert_count_per_product.max()}")

certifications
[]                          221
[OTHER]                     108
[ROHS]                       89
[CE, FCC]                    81
[CE]                         56
[CE, FCC CLASS B]            37
[CE, FCC CLASS B, ROHS]      29
[CE, FCC, ROHS]              16
[MIL-STD-810]                15
[CE, FCC, CCC, UL, BSMI]     14
[MIL-STD-810, ROHS]          14
[FCC]                        14
[CE, FCC CLASS A]            13
[IEC 60068, OTHER]           12
[FCC CLASS B]                12
[REACH, ROHS]                12
[OTHER, ROHS]                11
[ISO]                        11
[OTHER, EN 50155]             9
[FCC CLASS A]                 9
Name: count, dtype: int64

Number of unique certification configurations: 134

Average number of certifications per product: 1.73
Median number of certifications per product: 1.0
Max number of certifications per product: 8



---


## 4. Full cleaned data overview

In [56]:
columns = [
    "name",
    "manufacturer",
    "form_factor",
    "evaluation_or_commercialization",
    "processor_architecture",
    "processor_core_count",
    "processor_manufacturer",
    "processor_tdp",
    "memory",
    "onboard_storage",
    "input_voltage",
    "io_count",
    "wireless",
    "operating_system_bsp",
    "operating_temperature_max",
    "operating_temperature_min",
    "certifications",
]

In [57]:
# Display top 10 unique values for each column, sorted by frequency
for column in columns:
    if df[column].dtype == "object":
        # Handle columns with list values
        if df[column].apply(lambda x: isinstance(x, list)).any():
            # Convert lists to tuples for hashing
            value_counts = df[column].apply(lambda x: tuple(x) if isinstance(x, list) else x).value_counts().head(50)
        else:
            value_counts = df[column].value_counts().head(50)
    else:
        value_counts = df[column].value_counts().head(50)

    unique_count = len(value_counts)

    print(f"\n{'-' * 100}")
    print(f"{column.upper()} (Total unique values: {unique_count})")
    print(f"{'-' * 100}")

    if not value_counts.empty:
        max_value_length = max(len(str(value)) for value in value_counts.index)
        max_count_length = max(len(str(count)) for count in value_counts.values)

        for value, count in value_counts.items():
            percentage = (count / len(df)) * 100
            print(f"{str(value):<{max_value_length}} | {count:>{max_count_length}} | {percentage:.2f}%")
    else:
        print("No data available for this column")

    print(f"{'-' * 100}")


----------------------------------------------------------------------------------------------------
NAME (Total unique values: 50)
----------------------------------------------------------------------------------------------------
AIMB                                             | 47 | 4.80%
SOM                                              | 20 | 2.04%
ASMB                                             | 12 | 1.23%
PCM                                              | 12 | 1.23%
ARK                                              | 11 | 1.12%
MIO                                              | 10 | 1.02%
RSB                                              |  9 | 0.92%
RASPBERRY-PI MODEL B                             |  7 | 0.72%
VENICE GW                                        |  7 | 0.72%
AIMB-275                                         |  7 | 0.72%
EDHMIC                                           |  6 | 0.61%
COM-EXPRESS COMPACT                              |  6 | 0.61%
COM-EXPRESS BASIC     

In [62]:
# Display top 20 unique values for each column, sorted by frequency
for column in columns:
    print(f"\n{column.upper()}: ", end="")

    if df[column].dtype == "object" and df[column].apply(lambda x: isinstance(x, list)).any():
        # For columns with list values
        value_counts = df[column].explode().value_counts()
        top_20 = value_counts.head(100)
        print(", ".join(map(str, top_20.index)))
    else:
        # For non-list columns
        value_counts = df[column].value_counts()
        top_20 = value_counts.head(100)
        print(", ".join(map(str, top_20.index)))


NAME: AIMB, SOM, ASMB, PCM, ARK, MIO, RSB, RASPBERRY-PI MODEL B, VENICE GW, AIMB-275, EDHMIC, COM-EXPRESS COMPACT, COM-EXPRESS BASIC, TREK, MIC, EMETXEI, PCA, UNOG, AIMB-580, CONGATC, COM MODULES, PCEGA, ARKL, CONGATS, EMETXEIM, VENICE GW SINGLE BOARD, PCEGAE, COM-EXPRESS MINI, SOM COM-EXPRESS COMPACT, AGS GPU SERVER, SOM INTEL ATOMCELERON PROCESSOR COM-EXPRESS MINI, CONGASA, MIO EXTENSION SBC, COMPUTE, ARKDS, RASPBERRY-PI COMPUTE, CONGATCA, RASPBERRY-PI, ROCK PI N, UBIQUITOUS TOUCH, IBASE IBAF, AIMB-225, MIOJUAE, ODYSSEY XJ, ROCK PI S, ROM Q7, AIMB KIOSK, EMNANOAM, CONGAMA, KINODH, RKCONS, ROM NXP ARM CORTEXA IMX Q7, SNMPB, TOYBRICK RKPRO AI, MIONSAE, EDIPC, MIOCPQA, AMIAF FANLESS BOX, PCM-NS-AE, CORAL DEV, CONGAQMX, SBCTGL, IQBT, COMECTL, ARKFSAE, AIMB-273, BL PPC W, WAFERULT, TX COM, BL BPC EW, MANO, NEX MICROATX, CONGAQA, SMARC SHORT SIZE, CONGAPA, KINOTGLU, UBC, TXCOM, AIMB-212, QSXM, AIMB-GS-AE, TOMCAT, UNO, IMX-M MINI SOM, DIGI CONNECTCORE MP, CONGA-IC, ESRPCMSUV, PCE-5120, QBI


## 5. Saving Cleaned Data


In [59]:
# Save cleaned data
cleaned_data_file = "../data/cleaned_data.csv"
df.to_csv(cleaned_data_file, index=False)
print(f"Cleaned data saved to {cleaned_data_file}")


Cleaned data saved to ../data/cleaned_data.csv



---



## 6. Conclusion

In this notebook, we have:

- Created utility functions for common operations to build a modular data cleaning pipeline.

- Implemented consistent cleaning functions for each column, preserving the nuances of each data type.

- Provided clear explanations and rationales for each cleaning step.

- Standardized data formats, units, and handling of multi-value fields to optimize the data for searching and filtering operations.

- Ensured that no relevant data is lost during the cleaning process and handled special cases appropriately.

By standardizing and cleaning the data, we have prepared it for efficient use in the BoardBot system, enabling accurate semantic search and filtering capabilities.

---