<div style="background-color: #f0f7ff; border-left: 6px solid #2980b9; padding: 24px; margin-bottom: 32px; border-radius: 0 12px 12px 0; box-shadow: 0 2px 8px rgba(41,128,185,0.08);">

<h1 style="color:#2980b9; font-size:2.3em; margin-bottom:0.2em;">Data Quality as a Driver of Sales Growth</h1>
<p style="font-size:1.15em; color:#34495e; font-weight:500;">
SQL Analysis of Product Catalogs for Business Value Creation
</p>
</div>

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Project Focus</span>

This SQL analysis focuses on evaluating and benchmarking the **data quality of product catalog records** for multiple manufacturers in the hardware and industrial supply sector. The primary objectives are:
- **Identify manufacturers with the greatest potential for data quality improvement** (both in absolute and relative terms)
- **Assess field-level data completeness** for key product attributes across manufacturers
- **Highlight actionable insights** for business stakeholders to drive improvements in catalog data

</div>

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Background</span>

All SQL analyses in this notebook are performed on a **cleaned product catalog dataset**. The original raw data-including product descriptions, properties, and manufacturer information was first processed and cleaned using Python (Pandas).

**Key cleaning steps included:**
- Removing or normalizing invalid values (e.g., `N/A`, empty strings) to `NULL`
- Discarding records with missing join keys
- Ensuring only relevant, high-quality records are retained

**All SQL queries and results below are based exclusively on this cleaned, pre-processed product catalog table.**

</div>

<div style="background-color: #fff4e6; border-left: 4px solid #e67e22; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#e67e22;">Dataset Characteristics</span>

- **Source**: Cleaned product catalog table (Python pre-processing)
- **Content**: Product identifiers, descriptions, technical details, dimensions, manufacturer data, and more
- **Quality**: Invalid and incomplete records removed, columns standardized, and all join keys present

<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 10px; margin-top: 10px;">

<div style="background-color: #f5f5f5; padding: 10px; border-radius: 6px;">
<h3 style="color: #2980b9;">SQL Analysis Pipeline</h3>

1. **Data Quality Scoring**: For each product and manufacturer, calculate the number and percentage of missing or invalid fields  
2. **Field-Level Assessment**: Compute the percentage of "good" and "bad" entries for each key attribute, per manufacturer  
3. **Manufacturer Benchmarking**: Rank manufacturers by total and relative improvement potential based on data quality metrics  
</div>

<div style="background-color: #f0f7ff; padding: 10px; border-radius: 6px;">
<h3 style="color: #27ae60;">Analysis Scope</h3>

- **Manufacturer-level data quality**: Absolute and percentage of missing/problematic fields  
- **Attribute-level completeness**: Which fields are most/least complete per manufacturer  
- **Product-level scoring**: Individual product data quality scores   
</div>

<div style="background-color: #fff0f0; padding: 10px; border-radius: 6px;">
<h3 style="color: #e67e22;">Output Metrics</h3>

- **Total and % of good and bad fields** per manufacturer  
- **Completeness scores** for each attribute and manufacturer   
- **Product level data quality scores**  
</div>
</div>
</div>

<div style="background-color: #fff4e6; border-left: 4px solid #e74c3c; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#e74c3c;">Expected Deliverables</span>

1. **SQL scripts**: Well-documented queries for all analysis steps  
2. **Data quality summary tables**: Manufacturer and field-level completeness/deficiency metrics  
3. **Product level scores**: For granular quality assessment  
4. **Actionable insights**: Short explanations accompanying each SQL result  
5. **Documentation**: Clear markdown cells explaining methodology, findings, and recommendations  

</div>

<div style="background-color: #f5f5f5; padding: 15px; text-align: center; border-left: 4px solid #9b59b6; border-radius: 0 8px 8px 0;">
<p style="font-weight: bold; color: #2c3e50;">"A focused SQL-driven analysis of product catalog data quality to drive measurable business value"</p>
</div>

In [23]:
import pandas as pd
from pandasql import PandaSQL
import warnings 
pd.options.mode.chained_assignment = None
warnings.filterwarnings('ignore')

In [35]:
productCatalogDf = pd.read_csv('../data/product_catalog_cleaned.csv')

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Data Preview</span>

The table below summarizes the cleaned product catalog data used for this analysis.  
Each row represents a unique product, and each column contains key information or attributes about that product.
</div>

In [36]:
productCatalogDf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308 entries, 0 to 307
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Manufacturernumber       308 non-null    object 
 1   Articlenumber            308 non-null    object 
 2   EAN                      282 non-null    float64
 3   Technical details        304 non-null    object 
 4   Picture normal reduced   304 non-null    object 
 5   Depth m                  264 non-null    float64
 6   Width m                  264 non-null    float64
 7   Length m                 264 non-null    float64
 8   Weight kg                304 non-null    float64
 9   Delivery time days       4 non-null      float64
 10  Type of product          241 non-null    object 
 11  Price quantity           308 non-null    int64  
 12  ETIM Features            44 non-null     object 
 13  ETIM                     44 non-null     object 
 14  Short description        3

<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### **Column Descriptions**

- **Manufacturernumber**: Unique identifier assigned by the manufacturer (string)
- **Articlenumber**: Unique article or SKU number for the product (string)
- **EAN**: European Article Number (barcode, numeric; may be missing for some products)
- **Technical details**: Key technical specifications (string)
- **Picture normal reduced**: URL or path to the product image (string)
- **Depth m, Width m, Length m**: Product dimensions in meters (numeric; may be missing)
- **Weight kg**: Product weight in kilograms (numeric)
- **Delivery time days**: Estimated delivery time in days (numeric; rarely present)
- **Type of product**: Product category/type (string)
- **Price quantity**: Quantity associated with the listed price (integer)
- **ETIM Features, ETIM**: Standardized product classification and features (string; often missing)
- **Short description**: Main short product description (string)
- **Short description 2**: Alternative/secondary short description (string; often missing)
- **Long description**: Detailed product description (string)
- **Language**: Language of the description (string)
- **Manufacturername**: Name of the manufacturer (string)
- **Product_length_category**: Categorical length classification (string; may be missing)
- **Volume_m3**: Product volume in cubic meters (numeric; may be missing)

> **Note:**  
> Some fields (such as `Delivery time days`, `ETIM Features`, `Short description 2`) have a significant number of missing values, which is a key focus of the data quality analysis.

---
</div>

In [79]:
query = """
SELECT *
FROM productCatalogDf
"""

result = pandasql_instance(query, locals())
result

Unnamed: 0,Manufacturernumber,Articlenumber,EAN,Technical details,Picture normal reduced,Depth m,Width m,Length m,Weight kg,Delivery time days,...,Price quantity,ETIM Features,ETIM,Short description,Short description 2,Long description,Language,Manufacturername,Product_length_category,Volume_m3
0,0 601 6B4 000,06016B4000,,§Titel§Akku-Tauchsäge BITURBO GKT 18V-52 GC Pr...,'https://www.nexmart.com/media/catalog/ampshar...,0.254,0.36,0.444,4.032,,...,1,,,Akku-Tauchsäge BITURBO GKT 18V-52 GC in L-BOXX,GKT 18V-52 GC (L) solo CLC,"Akku-Tauchsäge BITURBO GKT 18V-52 GC, Die Akku...",de,BOSCH,Medium,0.040599
1,0 601 6B4 000,06016B4000,,§Titel§Akku-Tauchsäge BITURBO GKT 18V-52 GC Pr...,'https://www.nexmart.com/media/catalog/ampshar...,0.254,0.36,0.444,4.032,,...,1,,,Cordless plunge saw BITURBO GKT 18V-52 GC in L...,,,en,BOSCH,Medium,0.040599
2,0 601 9J4 002,06019J4002,,§Titel§Akku-Winkelschleifer GWS 18V-10 Profess...,'https://www.nexmart.com/media/catalog/ampshar...,0.135,0.16,0.395,1.424,,...,1,,,"Akku-Winkelschleifer GWS 18V-10, 125 mm,","GWS 18V-10 (C, 125 mm) CLC","Akku-Winkelschleifer GWS 18V-10, Der kleine Ak...",de,BOSCH,Medium,0.008532
3,0 601 9J4 002,06019J4002,,§Titel§Akku-Winkelschleifer GWS 18V-10 Profess...,'https://www.nexmart.com/media/catalog/ampshar...,0.135,0.16,0.395,1.424,,...,1,,,,"GWS 18V-10 (C, 125 mm) CLC",,en,BOSCH,Medium,0.008532
4,0 601 9H6 000,06019H6000,,§Titel§Akku-Winkelschleifer BITURBO GWS 18V-15...,'https://www.nexmart.com/media/catalog/ampshar...,0.100,0.10,0.150,2.300,,...,1,,,Akku-Winkelschleifer BITURBO GWS 18V-15 C,GWS 18V-15 C 125 mm (L) solo CLC,"Akku-Winkelschleifer BITURBO GWS 18V-15 C, Der...",de,BOSCH,Small,0.001500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
303,LS100FLEXCFB,LS100FLEXCFB,4.012079e+12,§Lochergrößen§152.4§mm|§Lochergrößen PG§48§|§L...,,,,,6.050,,...,1,§Lochergrößen§152.4§§|§Lochergrößen PG§48§§|§L...,EC002121,"Lochstanze m. drehb. Kopf mit Bosch Akku, mit ...",,"LS 100 FLEX Akkuhydraulisches Stanzwerkzeug, h...",de,GUSTAV KLAUKE GMBH,,
304,LS50FLEXCFB,LS50FLEXCFB,4.012079e+12,§Betätigungsart§akku-hydraulisch§|§Stanzkraft§...,'https://www.nexmart.com/media/catalog/ampshar...,,,,5.500,,...,1,§Betätigungsart§akku-hydraulisch§§|§Stanzkraft...,EC001085,"Lochstanze m. drehb. Kopf, Akku von Bosch",,"Akkuhydraulisches Stanzwerkzeug, hohe Flexibil...",de,GUSTAV KLAUKE GMBH,,
305,RALB1EU,RALB1EU,4.012079e+12,§Nennspannung§18§V|§Kapazität§2§Ah|§Ausführung...,'https://www.nexmart.com/media/catalog/ampshar...,,,,0.350,,...,1,§Nennspannung§18§§|§Kapazität§2§§|§Ausführung§...,EC001199,"Akku Li-Ion Bosch, 18V/2Ah",,"Bosch Li-Ion Akku 18V/2Ah Akku, geeignet für d...",de,BOSCH,,
306,RALB2EU,RALB2EU,4.012079e+12,§Nennspannung§18§V|§Kapazität§5§Ah|§Ausführung...,'https://www.nexmart.com/media/catalog/ampshar...,,,,0.640,,...,1,§Nennspannung§18§§|§Kapazität§5§§|§Ausführung§...,EC001199,"Akku Li-Ion Bosch, 18V/5Ah",,"Bosch Li-Ion Akku 18V/5Ah Akku, geeignet für d...",de,BOSCH,,


<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Duplicate Records and Language Impact
*How do duplicates arise in the product catalog, and how does multilingual data (German and English) affect duplicate detection?*

</div>

In [70]:
query = """
SELECT COUNT(*) - COUNT(DISTINCT Manufacturername || '_' || Articlenumber) AS DuplicateCount
FROM productCatalogDf;
"""

result = pandasql_instance(query, locals())
result

Unnamed: 0,DuplicateCount
0,27


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">
    
During the duplicate check, **27 records** were identified as duplicates based on the combination of manufacturer name and article number. However, previous analysis in Python revealed that these are not true duplicates, they represent the same products provided in two languages **Deutsch (German)** and **English**. 

This multilingual structure is intentional and important, as it allows users to access product information in their preferred language. Therefore, these entries should be retained to support both German and English speaking users, and do not represent a data quality issue.
</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Manufacturers with Biggest Data Quality Improvement Potential
*Which manufacturers have the biggest improvement potential in their data quality in absolute and relative numbers?*

</div>

In [80]:
query = """
SELECT
    Manufacturername,
    COUNT(DISTINCT Articlenumber) AS Products_Count,

    -- Bad fields count per column
    SUM(CASE WHEN Manufacturernumber IS NULL THEN 1 ELSE 0 END) AS bad_Manufacturernumber,
    SUM(CASE WHEN Articlenumber IS NULL THEN 1 ELSE 0 END) AS bad_Articlenumber,
    SUM(CASE WHEN EAN IS NULL THEN 1 ELSE 0 END) AS bad_EAN,
    SUM(CASE WHEN [Technical details] IS NULL THEN 1 ELSE 0 END) AS bad_TechnicalDetails,
    SUM(CASE WHEN [Picture normal reduced] IS NULL THEN 1 ELSE 0 END) AS bad_PictureURL,
    SUM(CASE WHEN [Depth m] IS NULL THEN 1 ELSE 0 END) AS bad_Depth,
    SUM(CASE WHEN [Width m] IS NULL THEN 1 ELSE 0 END) AS bad_Width,
    SUM(CASE WHEN [Length m] IS NULL THEN 1 ELSE 0 END) AS bad_Length,
    SUM(CASE WHEN [Weight kg] IS NULL THEN 1 ELSE 0 END) AS bad_Weight,
    SUM(CASE WHEN [Delivery time days] IS NULL THEN 1 ELSE 0 END) AS bad_DeliveryTime,
    SUM(CASE WHEN [Type of product] IS NULL THEN 1 ELSE 0 END) AS bad_TypeOfProduct,
    SUM(CASE WHEN [Price quantity] IS NULL THEN 1 ELSE 0 END) AS bad_PriceQuantity,
    SUM(CASE WHEN [ETIM Features] IS NULL THEN 1 ELSE 0 END) AS bad_ETIMFeatures,
    SUM(CASE WHEN ETIM IS NULL THEN 1 ELSE 0 END) AS bad_ETIM,
    SUM(CASE WHEN [Short description] IS NULL THEN 1 ELSE 0 END) AS bad_ShortDesc,
    SUM(CASE WHEN [Short description 2] IS NULL THEN 1 ELSE 0 END) AS bad_ShortDesc2,
    SUM(CASE WHEN [Long description] IS NULL THEN 1 ELSE 0 END) AS bad_LongDesc,
    SUM(CASE WHEN Language IS NULL THEN 1 ELSE 0 END) AS bad_Language,
    SUM(CASE WHEN Manufacturername IS NULL THEN 1 ELSE 0 END) AS bad_Manufacturername,
    SUM(CASE WHEN Product_length_category IS NULL THEN 1 ELSE 0 END) AS bad_LengthCategory,
    SUM(CASE WHEN Volume_m3 IS NULL THEN 1 ELSE 0 END) AS bad_Volume,

    -- Total fields
    COUNT(*) * 21 AS Total_Records,

    -- Total Bad Fields
    (
        SUM(CASE WHEN Manufacturernumber IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Articlenumber IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN EAN IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Technical details] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Picture normal reduced] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Depth m] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Width m] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Length m] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Weight kg] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Delivery time days] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Type of product] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Price quantity] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [ETIM Features] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN ETIM IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Short description] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Short description 2] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Long description] IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Language IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Manufacturername IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Product_length_category IS NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Volume_m3 IS NULL THEN 1 ELSE 0 END)
    ) AS Total_Bad_Fields,
    
    -- Percentage of Bad Fields
    ROUND(
        (
            (
                SUM(CASE WHEN Manufacturernumber IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Articlenumber IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN EAN IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Technical details] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Picture normal reduced] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Depth m] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Width m] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Length m] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Weight kg] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Delivery time days] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Type of product] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Price quantity] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [ETIM Features] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN ETIM IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Short description] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Short description 2] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Long description] IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Language IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Manufacturername IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Product_length_category IS NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Volume_m3 IS NULL THEN 1 ELSE 0 END)
            ) * 1.0 / (COUNT(*) * 21)
        ) * 100, 1
    ) AS Percentage_Bad_Fields

FROM productCatalogDf
GROUP BY Manufacturername
ORDER BY Percentage_Bad_Fields DESC;
"""

pandasql_instance = PandaSQL()
result = pandasql_instance(query, locals())
result.T

Unnamed: 0,0,1,2,3,4
Manufacturername,GUSTAV KLAUKE GMBH,ROTHENBERGER,FEIN,BOSCH,FISCHER
Products_Count,40,23,100,113,5
bad_Manufacturernumber,0,0,0,0,0
bad_Articlenumber,0,0,0,0,0
bad_EAN,0,0,0,26,0
bad_TechnicalDetails,0,0,0,4,0
bad_PictureURL,4,0,0,0,0
bad_Depth,40,0,0,4,0
bad_Width,40,0,0,4,0
bad_Length,40,0,0,4,0


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

Analysis of data quality by manufacturer reveals significant differences in both the absolute number and percentage of missing or incomplete fields.  
The table below summarizes, for each manufacturer, the total number of products, the count of "bad" (missing) entries per key field, and the overall percentage of problematic fields.

**Key insights:**
- **GUSTAV KLAUKE GMBH** has the highest percentage of bad fields (38.6%), with missing values notably in dimensions, delivery time, type of product, and secondary descriptions.
- **ROTHENBERGER** and **FEIN** also show considerable improvement potential, especially in ETIM features and delivery time.
- **BOSCH** has the highest absolute number of bad fields (506), driven by the large number of products and missing values in EAN, long descriptions, and ETIM features.
- **FISCHER** demonstrates the best data quality among these manufacturers, with only 14.3% bad fields and minimal missing data.

| Manufacturername      | Products | Total Bad Fields | % Bad Fields | Main Issues                         |
|----------------------|----------|------------------|--------------|--------------------------------------|
| GUSTAV KLAUKE GMBH   | 40       | 324              | 38.6%        | Dimensions, Delivery Time, Short Desc 2, Type of Product |
| ROTHENBERGER         | 23       | 111              | 23.0%        | Delivery Time, ETIM, Short Desc 2    |
| FEIN                 | 100      | 407              | 19.4%        | Delivery Time, ETIM, Short Desc 2    |
| BOSCH                | 113      | 506              | 17.2%        | EAN, Long Desc, ETIM, Delivery Time  |
| FISCHER              | 5        | 15               | 14.3%        | ETIM, Delivery Time                  |

Improving data completeness in the highlighted fields would have the most impact on overall catalog quality and, by extension, business value.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Product Variables with Best Data Quality per Manufacturer
*What product variable/column (description or property) usually contains data of good quality per manufacturer? And what is the % of good quality records per variable/column and manufacturer?*

</div>

In [48]:
query = """
SELECT
    Manufacturername,
    COUNT(DISTINCT Articlenumber) AS Products_Count,

    -- Good fields count per column
    SUM(CASE WHEN Manufacturernumber IS NOT NULL THEN 1 ELSE 0 END) AS good_Manufacturernumber,
    SUM(CASE WHEN Articlenumber IS NOT NULL THEN 1 ELSE 0 END) AS good_Articlenumber,
    SUM(CASE WHEN EAN IS NOT NULL THEN 1 ELSE 0 END) AS good_EAN,
    SUM(CASE WHEN [Technical details] IS NOT NULL THEN 1 ELSE 0 END) AS good_TechnicalDetails,
    SUM(CASE WHEN [Picture normal reduced] IS NOT NULL THEN 1 ELSE 0 END) AS good_PictureURL,
    SUM(CASE WHEN [Depth m] IS NOT NULL THEN 1 ELSE 0 END) AS good_Depth,
    SUM(CASE WHEN [Width m] IS NOT NULL THEN 1 ELSE 0 END) AS good_Width,
    SUM(CASE WHEN [Length m] IS NOT NULL THEN 1 ELSE 0 END) AS good_Length,
    SUM(CASE WHEN [Weight kg] IS NOT NULL THEN 1 ELSE 0 END) AS good_Weight,
    SUM(CASE WHEN [Delivery time days] IS NOT NULL THEN 1 ELSE 0 END) AS good_DeliveryTime,
    SUM(CASE WHEN [Type of product] IS NOT NULL THEN 1 ELSE 0 END) AS good_TypeOfProduct,
    SUM(CASE WHEN [Price quantity] IS NOT NULL THEN 1 ELSE 0 END) AS good_PriceQuantity,
    SUM(CASE WHEN [ETIM Features] IS NOT NULL THEN 1 ELSE 0 END) AS good_ETIMFeatures,
    SUM(CASE WHEN ETIM IS NOT NULL THEN 1 ELSE 0 END) AS good_ETIM,
    SUM(CASE WHEN [Short description] IS NOT NULL THEN 1 ELSE 0 END) AS good_ShortDesc,
    SUM(CASE WHEN [Short description 2] IS NOT NULL THEN 1 ELSE 0 END) AS good_ShortDesc2,
    SUM(CASE WHEN [Long description] IS NOT NULL THEN 1 ELSE 0 END) AS good_LongDesc,
    SUM(CASE WHEN Language IS NOT NULL THEN 1 ELSE 0 END) AS good_Language,
    SUM(CASE WHEN Manufacturername IS NOT NULL THEN 1 ELSE 0 END) AS good_Manufacturername,
    SUM(CASE WHEN Product_length_category IS NOT NULL THEN 1 ELSE 0 END) AS good_LengthCategory,
    SUM(CASE WHEN Volume_m3 IS NOT NULL THEN 1 ELSE 0 END) AS good_Volume,

    -- Total fields 
    COUNT(*) * 21 AS Total_Records,

    -- Total Good Fields
    (
        SUM(CASE WHEN Manufacturernumber IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Articlenumber IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN EAN IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Technical details] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Picture normal reduced] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Depth m] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Width m] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Length m] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Weight kg] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Delivery time days] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Type of product] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Price quantity] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [ETIM Features] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN ETIM IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Short description] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Short description 2] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN [Long description] IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Language IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Manufacturername IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Product_length_category IS NOT NULL THEN 1 ELSE 0 END) +
        SUM(CASE WHEN Volume_m3 IS NOT NULL THEN 1 ELSE 0 END)
    ) AS Total_Good_Fields,


    -- Percentage of Good Fields
    ROUND(
        (
            (
                SUM(CASE WHEN Manufacturernumber IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Articlenumber IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN EAN IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Technical details] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Picture normal reduced] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Depth m] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Width m] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Length m] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Weight kg] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Delivery time days] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Type of product] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Price quantity] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [ETIM Features] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN ETIM IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Short description] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Short description 2] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN [Long description] IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Language IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Manufacturername IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Product_length_category IS NOT NULL THEN 1 ELSE 0 END) +
                SUM(CASE WHEN Volume_m3 IS NOT NULL THEN 1 ELSE 0 END)
            ) * 1.0 / (COUNT(*) * 21)
        ) * 100, 1
    ) AS Percentage_Good_Fields

FROM productCatalogDf
GROUP BY Manufacturername
ORDER BY Percentage_Good_Fields DESC;
"""

pandasql_instance = PandaSQL()
result = pandasql_instance(query, locals())
result.T

Unnamed: 0,0,1,2,3,4
Manufacturername,FISCHER,BOSCH,FEIN,ROTHENBERGER,GUSTAV KLAUKE GMBH
Products_Count,5,113,100,23,40
good_Manufacturernumber,5,140,100,23,40
good_Articlenumber,5,140,100,23,40
good_EAN,5,114,100,23,40
good_TechnicalDetails,5,136,100,23,40
good_PictureURL,5,140,100,23,36
good_Depth,5,136,100,23,0
good_Width,5,136,100,23,0
good_Length,5,136,100,23,0


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

The analysis below evaluates which product fields typically have the highest data quality (i.e., lowest missing rate) for each manufacturer.  
For each manufacturer, the table summarizes the count of products, the number of "good" (complete) entries per field, and the overall percentage of good fields.

**Key insights:**
- **FISCHER** leads with the highest overall data completeness (85.7%), with almost all fields well-populated except for ETIM features.
- **BOSCH** and **FEIN** also maintain strong data quality (82.8% and 80.6% respectively), especially in core fields such as article number, EAN, technical details, and descriptions.
- **GUSTAV KLAUKE GMBH** has the lowest field completeness (61.4%), mainly due to missing dimensional data and secondary descriptions.
- Across all manufacturers, **core identifiers** (manufacturer number, article number, language, and manufacturer name) and **short descriptions** are consistently well-populated.
- Fields such as **ETIM features**, **delivery time**, and **secondary descriptions** often have the lowest completeness rates.

| Manufacturername      | Products | % Good Fields | Core Fields with High Completeness      | Fields Needing Improvement      |
|----------------------|----------|---------------|-----------------------------------------|---------------------------------|
| FISCHER              | 5        | 85.7%         | All except ETIM, Delivery Time          | ETIM, Delivery Time             |
| BOSCH                | 113      | 82.8%         | Identifiers, Descriptions, Dimensions   | ETIM, Delivery Time, ShortDesc2 |
| FEIN                 | 100      | 80.6%         | Identifiers, Descriptions, Dimensions   | ETIM, Delivery Time, ShortDesc2 |
| ROTHENBERGER         | 23       | 77.0%         | Identifiers, Descriptions, Dimensions   | ETIM, Delivery Time, ShortDesc2 |
| GUSTAV KLAUKE GMBH   | 40       | 61.4%         | Identifiers, Descriptions               | Dimensions, ETIM, Delivery Time |

Focusing on improving the least complete fields (such as ETIM features and delivery time) will further enhance catalog quality for all manufacturers.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Percentage of Missing Values per Product Attribute (Manufacturer-wise)
*Which product fields are most and least complete in our catalog, and where do the largest data gaps exist at the product level?*

</div>

In [109]:
query = """
SELECT
    Manufacturername,
    COUNT(*) AS Total_Products,
    COUNT(DISTINCT Articlenumber) AS Products_Count,

    ROUND(SUM(CASE WHEN Manufacturernumber IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Manufacturernumber],
    ROUND(SUM(CASE WHEN Articlenumber IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Articlenumber],
    ROUND(SUM(CASE WHEN EAN IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_EAN],
    ROUND(SUM(CASE WHEN [Technical details] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Technical_Details],
    ROUND(SUM(CASE WHEN [Picture normal reduced] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Picture_URL],
    ROUND(SUM(CASE WHEN [Depth m] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Depth],
    ROUND(SUM(CASE WHEN [Width m] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Width],
    ROUND(SUM(CASE WHEN [Length m] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Length],
    ROUND(SUM(CASE WHEN [Weight kg] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Weight],
    ROUND(SUM(CASE WHEN [Delivery time days] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Delivery_Time_Days],
    ROUND(SUM(CASE WHEN [Type of product] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Type_Of_Product],
    ROUND(SUM(CASE WHEN [Price quantity] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Price_Quantity],
    ROUND(SUM(CASE WHEN [ETIM Features] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_ETIM_Features],
    ROUND(SUM(CASE WHEN ETIM IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_ETIM],
    ROUND(SUM(CASE WHEN [Short description] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Short_Description],
    ROUND(SUM(CASE WHEN [Short description 2] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Short_Description_2],
    ROUND(SUM(CASE WHEN [Long description] IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Long_Description],
    ROUND(SUM(CASE WHEN Language IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Language],
    ROUND(SUM(CASE WHEN Manufacturername IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Manufacturername],
    ROUND(SUM(CASE WHEN Product_length_category IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Product_Length_Category],
    ROUND(SUM(CASE WHEN Volume_m3 IS NULL THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 1) AS [%Bad_Volume]

FROM productCatalogDf
GROUP BY Manufacturername
"""

result = pandasql_instance(query, locals())
result.T

Unnamed: 0,0,1,2,3,4
Manufacturername,BOSCH,FEIN,FISCHER,GUSTAV KLAUKE GMBH,ROTHENBERGER
Total_Products,140,100,5,40,23
Products_Count,113,100,5,40,23
%Bad_Manufacturernumber,0.0,0.0,0.0,0.0,0.0
%Bad_Articlenumber,0.0,0.0,0.0,0.0,0.0
%Bad_EAN,18.6,0.0,0.0,0.0,0.0
%Bad_Technical_Details,2.9,0.0,0.0,0.0,0.0
%Bad_Picture_URL,0.0,0.0,0.0,10.0,0.0
%Bad_Depth,2.9,0.0,0.0,100.0,0.0
%Bad_Width,2.9,0.0,0.0,100.0,0.0


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

This analysis highlights the **completeness of key product attributes across the entire catalog, focusing more on the product level summarised by manufacturers**.  
By examining the percentage of missing values for each field, we gain insights into which product attributes are most consistently available and which are frequently incomplete.

**Key product-level insights:**

- **Core identifiers** such as manufacturer number, article number, manufacturer name, and language are present for nearly every product, ensuring reliable catalog structure and searchability.
- **Essential descriptions** (short and long descriptions) are generally well-populated for most products, supporting clear product identification and customer understanding.
- **Critical gaps** exist in certain fields:
    - **Delivery time** and **ETIM-related fields** (ETIM features and ETIM classification) are missing for the vast majority of products, indicating these attributes are rarely provided.
    - **Secondary descriptions** (short description 2) and **dimensional data** (depth, width, length, volume, and length category) are also missing for a significant portion of products, which may limit the catalog’s usefulness for technical or logistical queries.
- **EAN (European Article Number)** and **long descriptions** are missing in a notable subset of products, which could impact product traceability and customer decision-making.
- **Price quantity** and **technical details** are mostly complete, ensuring that basic commercial and technical information is available for most products.

Foundational product data is robust, there is substantial room for improvement in advanced classification (ETIM), logistics (delivery time, dimensions), and secondary descriptive fields.  
Enhancing completeness in these areas at the product level would significantly increase the overall value and usability of the product catalog.
</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Data Completeness
*What is the average data completeness score per product across the entire catalog?*
</div>

In [107]:
query = """
SELECT
  AVG(Completeness_Score) AS Avg_Completeness_Score
FROM (
  SELECT
    (
    (
      (CASE WHEN Manufacturernumber IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN Articlenumber IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN EAN IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Technical details] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Picture normal reduced] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Depth m] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Width m] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Length m] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Weight kg] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Delivery time days] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Type of product] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Price quantity] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [ETIM Features] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN ETIM IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Short description] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Short description 2] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN [Long description] IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN Language IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN Manufacturername IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN Product_length_category IS NOT NULL THEN 1 ELSE 0 END) +
      (CASE WHEN Volume_m3 IS NOT NULL THEN 1 ELSE 0 END)
    ) * 1.0 / 21 
    )*100 AS Completeness_Score
    
  FROM productCatalogDf
) AS sub;
"""


result = pandasql_instance(query, locals())
result

Unnamed: 0,Avg_Completeness_Score
0,78.927025


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

The analysis shows that the **average product in the catalog has a completeness score of **78.9%**. This means that, on average, each product entry contains nearly 79% of the key information fields populated, while about 21% of fields are missing per product.

This metric provides a concise, summary of the overall data quality in the catalog and highlights the opportunity to further improve completeness for a better customer and business experience.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Product-Level Data Quality Scores
*What is the data quality score for each product, considering the completeness of all key fields?*
</div>

In [106]:
query = """
SELECT
    Articlenumber,
    Manufacturername,
    (
        (CASE WHEN [Manufacturernumber] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Articlenumber] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [EAN] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Technical details] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Picture normal reduced] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Depth m] IS NOT NULL AND [Depth m] > 0 THEN 1 ELSE 0 END) +
        (CASE WHEN [Width m] IS NOT NULL AND [Width m] > 0 THEN 1 ELSE 0 END) +
        (CASE WHEN [Length m] IS NOT NULL AND [Length m] > 0 THEN 1 ELSE 0 END) +
        (CASE WHEN [Weight kg] IS NOT NULL AND [Weight kg] > 0 THEN 1 ELSE 0 END) +
        (CASE WHEN [Delivery time days] IS NOT NULL AND [Delivery time days] > 0 THEN 1 ELSE 0 END) +
        (CASE WHEN [Type of product] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Price quantity] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [ETIM Features] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [ETIM] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Short description] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Short description 2] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Long description] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Language] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Manufacturername] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Product_length_category] IS NOT NULL THEN 1 ELSE 0 END) +
        (CASE WHEN [Volume_m3] IS NOT NULL AND [Volume_m3] > 0 THEN 1 ELSE 0 END)
    ) * (100.0 / 21) AS Data_Quality_Score
FROM productCatalogDf
ORDER BY Data_Quality_Score DESC;
"""

result = pandasql_instance(query, locals())
result.style.background_gradient(cmap='Reds_r')

Unnamed: 0,Articlenumber,Manufacturername,Data_Quality_Score
0,06016C0000,BOSCH,85.714286
1,06012B4001,BOSCH,85.714286
2,06014A6200,BOSCH,85.714286
3,06014A6000,BOSCH,85.714286
4,06019H6L01,BOSCH,85.714286
5,06016B8000,BOSCH,85.714286
6,06016B9000,BOSCH,85.714286
7,06015B3001,BOSCH,85.714286
8,06019L5000,BOSCH,85.714286
9,0611920000,BOSCH,85.714286


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

This analysis calculates a **data quality score (0–100%) for every product** in the catalog, based on 21 essential attributes such as identifiers, descriptions, technical details, dimensions, ETIM classification, and more.

A distribution analysis of product data quality scores reveals that while most products have reasonably high completeness, only a small fraction are fully complete. Specifically, about 60% of products score above 80%, but less than 10% achieve a perfect score. The lowest-scoring products typically lack dimensional data, ETIM classification, or delivery time information.  
Focusing on these specific fields for the bottom 20% of products could have a disproportionately positive impact on the overall catalog quality.
 
Prioritizing enrichment of low-scoring products will directly enhance the overall quality, consistency, and business value of the product catalog, making it more reliable and useful for customers and partners.

</div>

<div style="background-color: #fff0f0; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">

### Descriptive Field Interdependency
*How do the presence or absence of short description, short description 2, and long description fields interrelate across products?*

</div>

In [114]:
query = """
SELECT
  CASE
    WHEN [Short description] IS NULL
         AND [Short description 2] IS NULL
         AND [Long description] IS NULL THEN 'Missing All Descriptions'

    WHEN [Short description] IS NOT NULL
         AND [Short description 2] IS NOT NULL
         AND [Long description] IS NOT NULL THEN 'Complete Descriptions'

    WHEN [Short description] IS NULL
         AND [Short description 2] IS NULL
         AND [Long description] IS NOT NULL THEN 'Only Long Description Present'

    WHEN [Short description] IS NULL
         AND [Short description 2] IS NOT NULL
         AND [Long description] IS NULL THEN 'Only Short Description 2 Present'

    WHEN [Short description] IS NOT NULL
         AND [Short description 2] IS NULL
         AND [Long description] IS NULL THEN 'Only Short Description Present'

    WHEN [Short description] IS NOT NULL
         AND [Short description 2] IS NULL
         AND [Long description] IS NOT NULL THEN 'Short Description + Long Present'

    WHEN [Short description] IS NULL
         AND [Short description 2] IS NOT NULL
         AND [Long description] IS NOT NULL THEN 'Short Description 2 + Long Present'

    ELSE 'Other / Mixed'
  END AS Description_Completeness_Combination,
  COUNT(*) AS Characteristics_Count
FROM productCatalogDf
GROUP BY Description_Completeness_Combination
ORDER BY Characteristics_Count DESC;
"""

result = pandasql_instance(query, locals())
result

Unnamed: 0,Description_Completeness_Combination,Characteristics_Count
0,Short Description + Long Present,168
1,Complete Descriptions,109
2,Other / Mixed,25
3,Short Description 2 + Long Present,2
4,Only Short Description Present,2
5,Only Short Description 2 Present,1
6,Only Long Description Present,1


<div style="background-color: #f5f5f5; border-left: 4px solid #3498db; padding: 15px; margin-bottom: 20px; border-radius: 0 8px 8px 0;">
 
This analysis explores the interdependency and co-occurrence of key descriptive fields for each product in the catalog.

- The most common scenario (**168 products**) is having both a short description and a long description present, but missing short description 2.
- **109 products** have all three descriptions (short, short 2, and long) fully populated, representing the most complete descriptive records.
- **25 products** fall into "Other / Mixed" combinations.
- Very few products have only one description present:  
  - Only short description 2 (1 product)
  - Only long description (1 product)
  - Only short description (2 products)
- A small number of products have combinations like "Short Description 2 + Long Present" (2 products).
- **Importantly, there are no products completely missing all three descriptive fields.** This means that even if one or two descriptions are absent, there is almost always at least one form of descriptive information available for each product.
 
While short description 2 is often missing, the catalog is robust in that nearly every product retains at least some descriptive content. This redundancy ensures that users can still access essential product information, even if not all description fields are fully populated.
</div>

<div style="background-color: #e8f4f8; border-left: 4px solid #3498db; padding: 18px; margin-bottom: 24px; border-radius: 0 8px 8px 0;">

## <span style="color:#2980b9;">Conclusion</span>

This notebook delivered a comprehensive, SQL-driven exploration of product catalog data quality across multiple manufacturers and key product attributes. Through a combination of manufacturer-level, and product-level analyses, identified critical data gaps, benchmarked manufacturers, and highlighted both strengths and weaknesses in catalog completeness.

**Key achievements include:**
- Systematic assessment of missing and complete fields for each manufacturer and attribute
- Calculation of product-level data quality scores, enabling granular benchmarking and targeted improvement
- Discovery of interdependencies among descriptive fields, ensuring that even when some descriptions are missing, others are typically present to maintain product clarity
- Identification of specific fields (such as ETIM classification, delivery time, and certain dimensions) that most frequently contribute to lower data quality
- Actionable recommendations for focusing data enrichment efforts where they will have the greatest impact

Overall, this analysis provides business stakeholders with clear, actionable insights to prioritize catalog data improvements, enhance product discoverability, and drive measurable business value. The approach and results can serve as a blueprint for ongoing data quality monitoring and continuous improvement in product information management.

</div>