<div style="background-color: #f0f7ff; border-left: 6px solid #2980b9; padding: 24px; margin-bottom: 32px; border-radius: 0 12px 12px 0; box-shadow: 0 2px 8px rgba(41,128,185,0.08);">

<h1 style="color:#2980b9; font-size:2.3em; margin-bottom:0.2em;">Data Quality as a Driver of Sales Growth</h1>
<p style="font-size:1.15em; color:#34495e; font-weight:500;">
Building a Reliable Foundation: Python Data Pipeline for Product Catalog Analytics and Business Insights
</p>
</div>

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Objective</span>

This notebook implements a Python-based data pipeline for **cleaning, standardizing, integrating, and exploring product catalog data** in the industrial supply sector.  
The main objectives are:
- **Transform raw, inconsistent data into a reliable, analysis-ready dataset**
- **Enforce the presence and correctness of all join keys and essential fields**
- **Provide a robust foundation for further analytics, dashboarding, and business intelligence**

</div>

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Background</span>

The data originates from legacy exports and includes product descriptions, technical properties, and manufacturer information.  
It contains various inconsistencies, missing values, and formatting issues that must be resolved before meaningful analysis.

**Key cleaning and preparation steps include:**
- Normalizing invalid values (e.g., `N/A`, `None`, empty strings, and zeros in physical dimensions) to `NaN`
- Removing columns with 100% missing values
- Enforcing the presence of join keys (`Articlenumber`, `Manufacturernumber`) and removing records with missing keys
- Validating uniqueness of product-manufacturer pairs to ensure reliable joins

**All subsequent analysis, feature engineering, and visualization are based on this cleaned, integrated dataset.**

</div>

<div style="background-color: #fff4e6; border-left: 4px solid #e67e22; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#e67e22;">Dataset Characteristics</span>

- **Source:** Raw CSV files (`product_descriptions.csv`, `product_properties.csv`, `manufacturers.csv`)
- **Content:** Product identifiers, multilingual descriptions, technical properties, manufacturer data, engineered features (e.g., product volume, length category), and more
- **Quality:** Incomplete, inconsistent, and duplicate records addressed; all join keys enforced; outliers reviewed and retained if legitimate

<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; margin-top: 10px;">

<div style="background-color: #f5f5f5; padding: 10px; border-radius: 6px;">
<h3 style="color: #2980b9;">Python Data Pipeline Steps</h3>

1. **Data Loading & Exploration:** Import and inspect raw tables  
2. **Standardization & Null Handling:** Normalize missing/invalid values and harmonize schema  
3. **Duplicate & Outlier Review:** Identify and remove duplicates; review outliers for legitimacy  
4. **Data Type Conversion:** Ensure columns have appropriate data types  
5. **Data Integration:** Merge tables using required join keys  
6. **Feature Engineering:** Create new fields for enhanced analysis 
7. **Descriptive Analysis & Visualization:** Generate statistics, distributions, and correlations  
8. **Final Validation:** Confirm data quality and export cleaned, enriched dataset  
</div>

<div style="background-color: #f0f7ff; padding: 10px; border-radius: 6px;">
<h3 style="color: #27ae60;">Pipeline Scope</h3>

- **Data integrity:** All join keys present and valid  
- **Schema standardization:** Consistent column names and types  
- **Feature enrichment:** Engineered features for deeper insights  
- **Ready for analysis:** Output is suitable for SQL queries, dashboarding, and reporting  
</div>

<div style="background-color: #fff0f0; padding: 10px; border-radius: 6px;">
<h3 style="color: #e67e22;">Output</h3>

- **Cleaned, analysis-ready product catalog table**  
- **Engineered features and documented data quality checks**  
- **Summary tables, visualizations, and insights**  
- **Exported CSV for reproducibility and further use**  
</div>
</div>
</div>

<div style="background-color: #fff4e6; border-left: 4px solid #e74c3c; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#e74c3c;">Expected Deliverables</span>

1. **Python scripts/notebook:** Fully documented code for all cleaning, integration, and analysis steps  
2. **Summary tables and visualizations:** Before/after statistics, engineered features, and key insights  
3. **Exported cleaned and enriched dataset:** For use in further analysis and dashboarding  
4. **Clear markdown cells:** Explaining methodology, rationale, and findings throughout the workflow  

</div>

<div style="background-color: #f5f5f5; padding: 15px; text-align: center; border-left: 4px solid #9b59b6; border-radius: 0 8px 8px 0;">
<p style="font-weight: bold; color: #2c3e50;">"A robust Python pipeline that transforms raw product catalog data into a foundation for actionable business insights and effective analytics."</p>
</div>

In [1]:
import sys
import os

srcPath = os.path.abspath("../src")
if srcPath not in sys.path:
    sys.path.append(srcPath)

import logging

logging.basicConfig(
    level=logging.INFO,  # or DEBUG, WARNING, etc.
    format='[%(levelname)s] %(message)s'
)
logger = logging.getLogger(__name__)

In [None]:
from config import manufacturersFile, productDescriptionsFile, productPropertiesFile, outputCleanedFile, badValues
from ingest_datasets import loadCsv
from clean import cleanBadValues, dropRowsMissingKeys, dropFullyNullColumns
from merge import mergeCatalogTables
from engineer import engineerFeatures
from analyze import describeDf, printDatasetInfo, getMissingValueReport, getDuplicateReport, describeNumerics
from visualize import plotCorrelationHeatmap, plotFrequencyGridSmart, plotBoxplotGrid

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Manufacturer Dataset</span>

This section focuses on loading and exploring the manufacturer data, which provides essential reference information (IDs and names) for all manufacturers in the product catalog.  
I will inspect its structure, check for missing or duplicate entries, and prepare it for integration with product records.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Loading and Initial Exploration
</div>

In [None]:
manufacturersDf= loadCsv(manufacturersFile)
manufacturersDf

In [None]:
printDatasetInfo(manufacturersDf)

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Cleaning and Preprocessing
</div>

In [None]:
cleanManufacturersDf = manufacturersDf.copy()

In [None]:
cleanManufacturersDf = cleanBadValues(manufacturersDf, badValues)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

To ensure a high level of data quality, I created a flexible cleaning file that addresses common issues in both string and numeric columns:

- **String Columns:**  
  The function strips whitespace and replaces any values from a predefined list of "bad values" (such as `"None"`, `"N/A"`, empty strings, etc.) with `NaN`.

- **Numeric Columns:**  
  The function also checks for zero values in all numeric columns and replaces them with `NaN`. This is important for cases where a value of zero is not realistic (such as product dimensions), and likely indicates missing or invalid data.

This approach ensures that both obvious and subtle data quality issues are addressed before analysis or visualization, resulting in a cleaner and more reliable dataset.

</div>

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

After running this cleaning step on the manufacturer dataset, **no bad values were found**.  
This confirms that the manufacturer data is already clean with respect to these common invalid entries.
</div>

In [None]:
getMissingValueReport(cleanManufacturersDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

After cleaning, I checked for missing values in the manufacturer dataset:

All manufacturer names are present, but there are 24 records with missing `Manufacturernumber`.  
Since the assignment requires all join keys to be present, these records will need to be removed before further analysis or merging with other datasets.

</div>

In [None]:
cleanManufacturersDf.dropna(subset=['Manufacturernumber'], inplace=True)

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Descriptive Statistics and Data Quality Assessment
</div>

In [None]:
describeDf(cleanManufacturersDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">
 
- Each manufacturer number is unique, indicating there are no duplicate Manufacturernumber in manufacture dataset.
- The dataset covers 5 distinct manufacturers, with BOSCH being the most represented.
- The distribution suggests some manufacturers have significantly more products or entries than others.

</div>

In [None]:
cleanManufacturersDf

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Product Description Dataset</span>

This section focuses on the **product description dataset**, which contains multilingual descriptions for each product in the catalog.  
I will load the data, inspect its structure, and perform initial checks for missing values, duplicates, and data consistency.
</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Loading and Initial Exploration
</div>

In [None]:
productDescriptionsDf= loadCsv(productDescriptionsFile)
productDescriptionsDf

In [None]:
printDatasetInfo(productDescriptionsDf)

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Cleaning and Preprocessing
</div>

In [None]:
cleanProductDescriptionsDf = productDescriptionsDf.copy()

In [None]:
cleanProductDescriptionsDf = cleanBadValues(productDescriptionsDf, badValues)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

Applied the `cleanBadValues` function to the product description dataset to identify and replace common invalid entries (such as `"N/A"`, `"None"`, empty strings, and other placeholders) with `NaN`.

- No bad values were found in the product description dataset.  
- This indicates that all entries are free from the predefined set of problematic values, and the dataset is already clean in this regard.
</div>

In [None]:
getMissingValueReport(cleanProductDescriptionsDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

After cleaning, I checked for missing values in the product description dataset:

- All records have valid article numbers and language codes.  
- However, some description fields are missing for a subset of records-most notably, `Short description 2` and `Long description`. 

According to the assignment requirements, every record in the `product_descriptions` dataset must have a valid `Articlenumber` to ensure reliable joins with other tables. As confirmed in the missing values check, there are **zero missing values in the `Articlenumber` column**. No rows needed to be dropped to satisfy the join key requirement.
</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Descriptive Statistics and Data Quality Assessment
</div>

In [None]:
describeDf(productDescriptionsDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">
 
- Most article numbers are unique, but a few are repeated, possibly indicating variants or duplicates.
- The dataset includes two languages, with the majority of records in German (`de`).
- Some descriptions are shared across multiple products, which may reflect similar or related items.

</div>

<div style="background-color: #fff4e6; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Duplicate Analysis
</div>

In [None]:
dupCount, dupRows = getDuplicateReport(cleanProductDescriptionsDf, ['Articlenumber'])
dupCount

In [None]:
dupRows['Articlenumber'].value_counts()

In [None]:
dupRows[['Articlenumber', 'Short description', 'Short description 2',	'Long description', 'Language',]].sort_values('Articlenumber')

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

I checked for duplicate values in the `Articlenumber` column of the product description dataset:

- **Number of duplicate Articlenumber entries:** 40
- **List of duplicated Articlenumber values:**  
  Each of these article numbers appears twice in the dataset.

On further inspection, these duplicates are not exact copies. Instead, each duplicated `Articlenumber` corresponds to different language versions (typically German and English) of the same product.  
For example:

| Articlenumber | Language | Short description (sample)                    |
|---------------|----------|-----------------------------------------------|
| 06012A0400    | en       | Cordless band saw GCB 18V-63                  |
| 06012A0400    | de       | Akku-Bandsäge GCB 18V-63                      |
| ...           | ...      | ...                                           |

- The presence of duplicates is expected due to the multilingual nature of the dataset.
- Some product may have multiple description records (one per language). This multilingual structure is intentional and important, as it allows users to access product information in their preferred language. Therefore, these entries should be retained to support both German and English speaking users, and do not represent a data quality issue.
</div>

In [None]:
cleanProductDescriptionsDf

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Product Properties Dataset</span>

This section covers the **product properties dataset**, which contains detailed technical attributes and specifications for each product in the catalog.  
I will load the data, examine its structure, and perform initial checks for missing values, duplicates, and data consistency

</div>

In [None]:
productPropertiesDf= loadCsv(productPropertiesFile)
productPropertiesDf

In [None]:
printDatasetInfo(productPropertiesDf)

In [None]:
cleanProductPropertiesDf=productPropertiesDf.copy()

In [None]:
cleanProductPropertiesDf = cleanBadValues(productPropertiesDf, badValues)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">
    
I applied the `cleanBadValues` function to the product properties dataset to identify and replace common invalid entries (such as `"N/A"`, `"None"`, empty strings, and other placeholders) with `NaN`.
 
- Bad values were detected and successfully replaced with `NaN` in the product properties dataset.  
- This step ensures that all such problematic values are treated consistently as missing data, improving data quality for further analysis.

</div>

In [None]:
getMissingValueReport(cleanProductPropertiesDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

After cleaning, I checked for missing values in the product properties dataset:

- There are substantial missing values in several columns, especially in classification and feature fields (such as `Product category`, `Technical specifications`, `ECLASS`, `PROFICLASS`, etc.).  
- Additionally, some key columns used for joining (`Manufacturernumber`, `Articlenumber`) also contain missing values.  
- According to the assignment requirements, any records missing these join keys will need to be removed before further analysis.

</div>

In [None]:
cleanProductPropertiesDf.dropna(subset=['Articlenumber','Manufacturernumber'], inplace=True)

In [None]:
cleanProductPropertiesDf = dropFullyNullColumns(cleanProductPropertiesDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

- Several columns in the product properties dataset are completely empty (contain only missing values).  
- Since these columns provide no information for analysis, SQL queries, or dashboarding, I am dropping them from the cleaned dataset.  
- This step improves data quality and ensures that all remaining columns are potentially useful for further analysis.

</div>

In [None]:
describeDf(cleanProductPropertiesDf).style.background_gradient(cmap='Pastel2_r')

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

Here are the summary statistics for the main numerical columns in the product properties dataset:

- The minimum value for **`Depth m`**, **`Width m`** and **`Length m`** is 0.001, reflecting only realistic, positive measurements.
- **Weight kg** varies widely, with a minimum of 0.35 kg and a maximum of 32.7 kg.
- **Delivery time days** and **Price quantity** have very limited variation, with most values being the same.
- **Counts** for each column reflect the number of non-missing, valid entries after cleaning.

</div>


In [None]:
dupCount, dupRows = getDuplicateReport(cleanProductPropertiesDf, ['Articlenumber', 'Manufacturernumber'])
dupCount

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">
 
- Each product-manufacturer pair is unique in the dataset.  
- This ensures data integrity and prevents issues during joins or further analysis.

</div>

In [None]:
cleanProductPropertiesDf

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Integration
</div>

In [None]:
finalDf = mergeCatalogTables(cleanProductDescriptionsDf, cleanProductPropertiesDf, cleanManufacturersDf)
finalDf.shape

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

To prepare a comprehensive dataset for analysis, I performed the following joins:

1. **Join product_properties with product_descriptions**  
   - **Key:** `Articlenumber`  
   - **Type:** Inner join  
   - **Purpose:** Combine technical product details with multilingual product descriptions.
<br>
<br>
2. **Join the result with manufacturers**  
   - **Key:** `Manufacturernumber`  
   - **Type:** Inner join  
   - **Purpose:** Add manufacturer information to each product entry.


- All records in the final dataset have valid `Articlenumber` and `Manufacturernumber` values.
- The use of inner joins ensures that only records with matching keys in all datasets are retained, so no data is lost due to missing references.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Dataset Initial Exploration
</div>

In [None]:
finalDf.head(2).style.background_gradient(cmap='Pastel1')

In [None]:
finalDf.tail(2).style.background_gradient(cmap='Pastel1')

In [None]:
logger.info("The Dataset contains %d rows & %d columns", finalDf.shape[0], finalDf.shape[1])

In [None]:
printDatasetInfo(finalDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

After joining all relevant tables, the final dataset contains **308 records** and **19 columns**. Here is an overview of the data structure and completeness:

- All join keys (`Articlenumber`, `Manufacturernumber`) and manufacturer names are present for every record.
- Most technical and descriptive fields are well-populated, but some columns (e.g., `Delivery time days`, `ETIM Features`, `Short description 2`) have significant missing values.
</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Cleaning and Preprocessing
</div>

<div style="background-color: #fff4e6; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Duplicate Analysis
</div>

In [None]:
dupCount, dupRows = getDuplicateReport(finalDf, ['Articlenumber'])
dupCount

In [None]:
dupRows[['Articlenumber', 'Short description', 'Short description 2',	'Long description', 'Language',]].sort_values('Articlenumber')

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

During data quality assessment, I identified **27 duplicate entries** in the `Articlenumber` column of the final dataset.

### Example of Duplicate Entries

| Articlenumber | Short description (de/en)         | Short description 2 | Long description | Language |
|---------------|-----------------------------------|---------------------|------------------|----------|
| 06012B4001    | Cordless straight grinder GGS 18V-10 SLC (en) <br> Akku-Geradschleifer GGS 18V-10 SLC (de) | GGS 18V-10 SLC (C) | (de has full text, en is NaN) | en/de |
| 06014A3100    | Radio GPB 18V-2 SC (en) <br> (NaN) (de)          | GPB 18V-2 SC (C) CLC | (de has full text, en is NaN) | en/de |
| 06016B4000    | Akku-Tauchsäge BITURBO GKT 18V-52 GC in L-BOXX (de) <br> Cordless plunge saw BITURBO GKT 18V-52 GC in L-BOXX (en) | GKT 18V-52 GC (L) solo CLC | (de has full text, en is NaN) | de/en |

- **Some product (`Articlenumber`) appears once per language.**
- These are **not true duplicates** in the sense of redundant data; rather, they represent **multilingual product records**.
- The presence of both German and English versions for each product is intentional and valuable for supporting internationalization and localization.

### Recommendation

- **No action is needed** to remove these records unless a truly unique, language-agnostic product list is required.
- For language-specific analysis or display, filter by the `Language` column as needed.
- If a unique product list is required (one row per `Articlenumber`), aggregate or select the desired language version.

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Descriptive Statistics and Data Quality Assessment
</div>

In [None]:
describeNumerics(finalDf).T.style.background_gradient(cmap='Pastel2')

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

I calculated comprehensive descriptive statistics for all numeric columns in the final analytical dataset, including **mean, standard deviation, min, max, quartiles, median, mode, and range**.

- **Product dimensions** (`Depth m`, `Width m`, `Length m`) and **weight** show a wide range, indicating significant variety in the product catalog.
- **Delivery time days** and **price quantity** have very limited variation, with most values being the same.
- **Median and mode** values are very close to the quartiles, suggesting a somewhat symmetrical distribution for most numeric fields.
- The **range** for `Weight kg` (32.35 kg) highlights a few very heavy products compared to the rest.

This summary provides a clear statistical overview of the numeric fields in the final dataset, supporting further analysis and visualization.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Feature Engineering
</div>

In [None]:
engineeredDf = finalDf.copy()

In [None]:
engineeredDf = engineerFeatures(engineeredDf)

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Exploratory Data Analysis (EDA) and Visualization
</div>

<div style="background-color: #fff4e6; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Correlation Analysis of Numeric Features
</div>

In [None]:
plotCorrelationHeatmap(engineeredDf)

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

The table summarizes the correlation coefficients between all numeric columns in the dataset. Values close to 1 indicate a strong positive relationship, values close to -1 indicate a strong negative relationship, and values near 0 suggest no linear correlation.

- Product dimensions and weight are moderately to strongly positively correlated.
- Product volume (`Volume_m3`) is highly correlated with all three dimensions and weight.
- `Delivery time days` shows strong negative correlations with all size-related features, but this should be interpreted with caution due to the very small sample size for delivery time.
</div>

<div style="background-color: #fff4e6; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Frequencies and Distributions of Variables
</div>

In [None]:
plotFrequencyGridSmart(engineeredDf, title="Frequencies and Distributions of Variables")

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

The visualizations above provide an overview of the frequencies and distributions for both numeric and categorical variables in the dataset.

### **Numeric Variables**
- **EAN:** Most products cluster around a few EAN values, indicating possible batches or product families.
- **Depth m, Width m, Length m:** These dimensions show right-skewed distributions, with most products being relatively compact and a few much larger items.
- **Weight kg:** Also right-skewed, with most products under 10 kg and a few heavy outliers.
- **Delivery time days:** Very limited variation, with most products having a delivery time of 4 or 5 days.
- **Volume_m3:** Most products have a small volume, with a long tail for larger items.

### **Categorical Variables**
- **Type of product:** Nearly all products are labeled as "main product".
- **Price quantity:** All products have a price quantity of 1, indicating unit-based pricing.
- **ETIM:** A few ETIM codes dominate, suggesting a concentration in certain product categories.
- **Language:** The majority of entries are in German (`de`), with a smaller portion in English (`en`).
- **Manufacturername:** "BOSCH" and "FEIN" are the most common manufacturers.
- **Product_length_category:** Most products fall into the "Medium" category, with fewer "Small" and "Large" items.

</div>

<div style="background-color: #fff4e6; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Outlier Detection and Handling
</div>

In [None]:
plotBoxplotGrid(engineeredDf, title="Outliers Detection in Variables")

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 12px; margin-bottom: 18px; border-radius: 0 8px 8px 0;">

During the data exploration phase, several outliers were detected in the numeric columns (e.g., product dimensions, weight, and volume). Upon closer inspection, I verified that these outlier values are legitimate and correspond to actual products in the dataset.

Since the goal of this project is **insight generation and descriptive analysis**, there is no need to remove or adjust these outliers. In fact, retaining them provides a more accurate and comprehensive view of the product assortment, capturing the full diversity of the catalog.

- All outliers are kept in the dataset.
- Insights and visualizations will be based on the complete, unfiltered data to ensure transparency and business relevance.

This approach ensures that the analysis reflects real-world product variety and supports meaningful business conclusions.

</div>

In [None]:
engineeredDf.to_csv(outputCleanedFile, index=False)

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 18px; margin-bottom: 24px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Conclusion</span>

This notebook provided a thorough, Python-based exploration and assessment of product catalog data quality, integrating product properties, descriptions, and manufacturer information. Using a combination of data cleaning, feature engineering, and descriptive analytics, I systematically evaluated the completeness and consistency of key product attributes across the catalog.

**Key achievements include:**

- Rigorous data cleaning and normalization, including the removal of fully empty columns and the handling of invalid or missing values in both categorical and numeric fields.
- Validation of unique product-manufacturer combinations, ensuring robust joins and reliable downstream analysis.
- Comprehensive descriptive statistics and visualizations, revealing the distribution, diversity, and interrelationships among product attributes.
- Feature engineering to create new analytical perspectives, such as product length categories and volumetric calculations.
- Careful investigation of outliers, confirming their legitimacy and retaining them to preserve the true diversity of the product range.
- Identification of data quality gaps, especially in fields like ETIM classification, delivery time, and certain product dimensions, highlighting areas for potential enrichment.

This analysis equips business stakeholders with actionable insights to prioritize data quality improvements, optimize product information management, and enhance the discoverability and utility of the product catalog. The workflow demonstrated here can be readily adapted for continuous monitoring and iterative enhancement of catalog data quality.

**Next steps:**  
The cleaned and enriched dataset is now ready for advanced SQL queries, dashboarding in Power BI, and further business analysis. Stakeholders can use these insights to guide targeted data enrichment, monitor catalog completeness over time, and support data-driven decision-making across product management, logistics, and marketing functions.
</div>