Exploratory Data Analysis (EDA)

Project: NordTech Order Data

This notebook explores the raw NordTech order dataset prior to any permanent cleaning or transformation.
The purpose of this exploratory data analysis is to understand the dataset’s structure, grain, data quality, and potential issues in order to design a robust and well-informed data cleaning and transformation pipeline in the next stage of the project.

Scope

This EDA focuses exclusively on identifying data quality issues, structural properties, and inconsistencies.
Any data type conversions performed in this notebook are temporary and used only for inspection purposes.
All permanent cleaning, normalization, and imputation decisions are deferred to the transformation pipeline.

In this notebook, we:
- Load and inspect the raw dataset
- Examine column data types and missing values
- Identify inconsistent formats (dates, prices, regions, payment methods)
- Check for duplicates and logical inconsistencies
- Document findings that inform the transformation step

Dataset Grain

Each row in the dataset represents a single order line (orderrad) within a customer order.
An order (order_id) may contain multiple order lines, each corresponding to a specific product (produkt_sku) and quantity (antal).

1. Load Raw Dataset

In [20]:
import pandas as pd
import numpy as np

df=pd.read_csv('../data/raw/nordtech_data.csv')

2. Initial Data Inspection

In [2]:
df.head()

Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
0,ORD-2024-00001,ORD-2024-00001-1,2024-05-19,2024-05-22,SKU-WC001,Webbkamera HD,Tillbehör,1,SEK 799,Uppsala,Privat,Kort,KND-53648,Levererad,,,
1,ORD-2024-00002,ORD-2024-00002-1,2024-12-02,5 december 2024,SKU-HB001,USB-C Hub 7-port,Tillbehör,1,549.00,Göteborg,Privat,Swish,KND-84095,Levererad,,,
2,ORD-2024-00003,ORD-2024-00003-1,2024-12-31,2025-01-03,SKU-SD001,Extern SSD 1TB,Lagring,1,1199.00,,Företag,Faktura,KND-91748,Levererad,Stämmer inte överens med produktbeskrivningen.,2025-01-12,2.0
3,ORD-2024-00003,ORD-2024-00003-2,2024-12-31,2025-01-03,SKU-SD002,Extern SSD 500GB,Lagring,10,699 kr,Stockholm,Företag,FAKTURA,KND-91748,Mottagen,"Leveransen tog lite längre än utlovat, men pro...",2025-01-14,3.0
4,ORD-2024-00003,ORD-2024-00003-3,2024-12-31,2025-01-03,SKU-MS001,Trådlös Mus X1,Tillbehör,1,399.00,Stockholm,Företag,Faktura,KND-91748,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2767 entries, 0 to 2766
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   order_id         2767 non-null   object 
 1   orderrad_id      2767 non-null   object 
 2   orderdatum       2767 non-null   object 
 3   leveransdatum    2767 non-null   object 
 4   produkt_sku      2767 non-null   object 
 5   produktnamn      2767 non-null   object 
 6   kategori         2767 non-null   object 
 7   antal            2767 non-null   object 
 8   pris_per_enhet   2767 non-null   object 
 9   region           2612 non-null   object 
 10  kundtyp          2767 non-null   object 
 11  betalmetod       2651 non-null   object 
 12  kund_id          2767 non-null   object 
 13  leveransstatus   2673 non-null   object 
 14  recension_text   1355 non-null   object 
 15  recensionsdatum  1355 non-null   object 
 16  betyg            1355 non-null   float64
dtypes: float64(1),

In [4]:
df.shape

(2767, 17)

Finding:
- The dataset contains a mix of numerical, categorical, textual, and date-related fields.
- Several columns that represent dates and prices are stored as object types.
- The dataset includes multiple order lines per order, not one row per order.

Implication:
Data type validation and conversion will be required during transformation.

3. Missing Values Analysis

In [5]:
df.isna().sum()

order_id              0
orderrad_id           0
orderdatum            0
leveransdatum         0
produkt_sku           0
produktnamn           0
kategori              0
antal                 0
pris_per_enhet        0
region              155
kundtyp               0
betalmetod          116
kund_id               0
leveransstatus       94
recension_text     1412
recensionsdatum    1412
betyg              1412
dtype: int64

Finding:
Missing values are present across multiple columns, including both numerical and categorical features.

Implication:
A structured and type-aware missing value handling strategy is required.

In [6]:
# Column Summary Table
summary = pd.DataFrame({
    "dtype": df.dtypes,
    "missing": df.isna().sum(),
    "unique": df.nunique()
})
summary

Unnamed: 0,dtype,missing,unique
order_id,object,0,1657
orderrad_id,object,0,2700
orderdatum,object,0,536
leveransdatum,object,0,544
produkt_sku,object,0,17
produktnamn,object,0,17
kategori,object,0,5
antal,object,0,22
pris_per_enhet,object,0,76
region,object,155,36


4. Date Columns Investigation

In [7]:
date_cols = ["orderdatum", "leveransdatum", "recensionsdatum"]

for col in date_cols:
    print(f"\nUnique samples from {col}:")
    print(df[col].dropna().astype(str).unique()[:20])


Unique samples from orderdatum:
['2024-05-19' '2024-12-02' '2024-12-31' '2024-04-22' '2024-07-01'
 '2024-03-10' '2024-06-16' '2024-08-07' '2024-06-10' '2024/06/10'
 '2024-10-26' '2024-07-26' '2024-04-08' '2024-02-23' '2024-05-14'
 '2024-11-25' '2024-04-27' '2024-10-24' '2024-12-05' '2024-06-20']

Unique samples from leveransdatum:
['2024-05-22' '5 december 2024' '2025-01-03' '2024-04-26' '2024-07-05'
 '2024-03-12' '2024-06-20' '2024-08-14' '2024-06-12' '2024-10-30'
 '2024-07-28' '2024-04-11' '2024-02-25' '2024-05-16' '2024-11-27'
 '2024-05-01' '2024/10/27' '2024-12-07' '2024-06-22' '2024-11-02']

Unique samples from recensionsdatum:
['2025-01-12' '2025-01-14' '2024-07-12' '2024-03-19' '2024-08-24'
 '2024-08-21' '2024-06-15' '2024-05-23' '7 maj 2024' '2024-12-08'
 '2024-12-16' '2024-07-06' '2024-07-21' '2024-12-09' '2024-03-17'
 '2024-09-26' '2024-11-08' '2024-02-15' '2024-02-05' '2024-08-30']


Finding:
- Multiple date formats exist across all date columns.
- Some values cannot be parsed directly.
- Some delivery dates occur before order dates.

Implication:
A robust date-parsing function and logical validation rules are required during transformation.

5. Price Format Inspection

In [9]:
df["pris_per_enhet"].astype(str).unique()[:30]

array(['SEK 799', '549.00', '1199.00', '699 kr', '399.00', '799.00',
       '1899.00', '599.00', '4999.00', '5999.00', '18999.00', '1299.00',
       '14999.00', '699.00', '2499.00', '7999.00', '499.00', '4999 kr',
       '599 kr', '899.00', '899 kr', '799:-', '18999 kr', '1 899,00',
       '7999 kr', '399 kr', '599:-', 'SEK 1899', '549 kr', '1299:-'],
      dtype=object)

Finding:
Prices are stored in inconsistent textual formats (currency symbols, spacing, and decimal separators),
e.g. "SEK 799", "699 kr", "599:-", "1 899,00".

Implication:
Price values must be standardized and converted to a numerical format before analysis.

6. Region Values Inspection

In [17]:
df["region"].unique()

array(['Uppsala', 'Göteborg', nan, 'Stockholm', 'Örebro', 'örebro',
       'Orebro', 'Norrland', 'Linköping', 'Malmö', 'Västerås', 'GÖTEBORG',
       'LINKÖPING', 'stockholm', 'Gothenburg', 'STHLM', 'STOCKHOLM',
       'malmo', 'linköping', 'uppsala', 'Sthlm', 'GBGB', 'västerås',
       'MALMÖ', 'Sthml', 'Gbg', 'NORRLAND', 'ÖREBRO', 'norrland', 'Norr',
       'Vasteras', 'UPPSALA', 'göteborg', 'Malmo', 'Linkoping',
       'VÄSTERÅS', 'malmö'], dtype=object)

Finding:
Region names are inconsistently formatted (case differences and abbreviations).

Implication:
Region values must be standardized during data cleaning to ensure correct aggregation and reporting.

7. Payment Method Inspection

In [21]:
df["betalmetod"].unique()

array(['Kort', 'Swish', 'Faktura', 'FAKTURA', 'Mobilbetalning', 'faktura',
       'KORT', 'SWISH', 'Kreditkort', 'swish', nan, 'Invoice', 'Visa',
       'kort', 'Mastercard'], dtype=object)

Finding:
Payment methods are inconsistently labeled due to casing differences and naming variations.

Implication:
Payment methods require standardization to avoid duplicated categories.

In [22]:
def clean_payment(df: pd.DataFrame) -> pd.DataFrame:
    df_cleaned = df.copy()
    if "betalmetod" not in df_cleaned.columns:
        return df_cleaned
    mapping = {
        "kort": "card", "kreditkort": "card", "visa": "card", "mastercard": "card", "swish": "swish", "mobilbetalning": "swish", "faktura": "invoice",
        }
    df_cleaned["betalmetod"]=df_cleaned["betalmetod"].astype(str).str.strip().str.lower().replace(mapping).fillna("unknown") 
    return df_cleaned
df_cleaned = clean_payment(df)
df_cleaned["betalmetod"].unique()

array(['card', 'swish', 'invoice', 'nan'], dtype=object)