Exploratory Data Analysis (EDA)

Project: NordTech Order Data

This notebook explores the raw NordTech order dataset prior to any permanent cleaning or transformation.
The purpose of this exploratory data analysis is to understand the dataset’s structure, grain, data quality, and potential issues in order to design a robust and well-informed data cleaning and transformation pipeline in the next stage of the project.

Scope

This EDA focuses exclusively on identifying data quality issues, structural properties, and inconsistencies.
Any data type conversions performed in this notebook are temporary and used only for inspection purposes.
All permanent cleaning, normalization, and imputation decisions are deferred to the transformation pipeline.

In this notebook, we:
- Load and inspect the raw dataset
- Examine column data types and missing values
- Identify inconsistent formats (dates, prices, regions, payment methods)
- Check for duplicates and logical inconsistencies
- Document findings that inform the transformation step

Dataset Grain

Each row in the dataset represents a single order line (orderrad) within a customer order.
An order (order_id) may contain multiple order lines, each corresponding to a specific product (produkt_sku) and quantity (antal).

1. Load Raw Dataset

In [1]:
import pandas as pd
import numpy as np

df=pd.read_csv('../data/raw/nordtech_data.csv')

2. Initial Data Inspection

In [2]:
df.head()

Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
0,ORD-2024-00001,ORD-2024-00001-1,2024-05-19,2024-05-22,SKU-WC001,Webbkamera HD,Tillbehör,1,SEK 799,Uppsala,Privat,Kort,KND-53648,Levererad,,,
1,ORD-2024-00002,ORD-2024-00002-1,2024-12-02,5 december 2024,SKU-HB001,USB-C Hub 7-port,Tillbehör,1,549.00,Göteborg,Privat,Swish,KND-84095,Levererad,,,
2,ORD-2024-00003,ORD-2024-00003-1,2024-12-31,2025-01-03,SKU-SD001,Extern SSD 1TB,Lagring,1,1199.00,,Företag,Faktura,KND-91748,Levererad,Stämmer inte överens med produktbeskrivningen.,2025-01-12,2.0
3,ORD-2024-00003,ORD-2024-00003-2,2024-12-31,2025-01-03,SKU-SD002,Extern SSD 500GB,Lagring,10,699 kr,Stockholm,Företag,FAKTURA,KND-91748,Mottagen,"Leveransen tog lite längre än utlovat, men pro...",2025-01-14,3.0
4,ORD-2024-00003,ORD-2024-00003-3,2024-12-31,2025-01-03,SKU-MS001,Trådlös Mus X1,Tillbehör,1,399.00,Stockholm,Företag,Faktura,KND-91748,,,,


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2767 entries, 0 to 2766
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   order_id         2767 non-null   object 
 1   orderrad_id      2767 non-null   object 
 2   orderdatum       2767 non-null   object 
 3   leveransdatum    2767 non-null   object 
 4   produkt_sku      2767 non-null   object 
 5   produktnamn      2767 non-null   object 
 6   kategori         2767 non-null   object 
 7   antal            2767 non-null   object 
 8   pris_per_enhet   2767 non-null   object 
 9   region           2612 non-null   object 
 10  kundtyp          2767 non-null   object 
 11  betalmetod       2651 non-null   object 
 12  kund_id          2767 non-null   object 
 13  leveransstatus   2673 non-null   object 
 14  recension_text   1355 non-null   object 
 15  recensionsdatum  1355 non-null   object 
 16  betyg            1355 non-null   float64
dtypes: float64(1),

In [4]:
df.shape

(2767, 17)

Finding:
- The dataset contains a mix of numerical, categorical, textual, and date-related fields.
- Several columns that represent dates and prices are stored as object types.
- The dataset includes multiple order lines per order, not one row per order.

Implication:
Data type validation and conversion will be required during transformation.