# üõ†Ô∏è Nordtech ETL Pipeline Development Notebook

This notebook demonstrates and validates the Nordtech ETL pipeline.

It includes:

- Loading the raw dataset  
- Step-by-step cleaning with BEFORE/AFTER comparisons  
- Running the full transformation pipeline  
- Validating the cleaned dataset  
- Saving the final processed data  

This notebook is for *development and verification*, not KPI analysis.

Fix Import Path (because your notebook is inside /notebooks)

In [2]:
import sys
from pathlib import Path

# Automatically find the project root (folder containing src/)
current = Path().resolve()
while current != current.parent:
    if (current / "src").exists():
        sys.path.append(str(current))
        print("Project root added:", current)
        break
    current = current.parent

Project root added: C:\Users\zinah\nordtech_etl_project


In [3]:
import pandas as pd

from src.extract import load_raw_data
from src.transform import (
    clean_column_names,
    clean_id_columns,
    clean_date,
    fix_reversed_dates,
    clean_kundtyp,
    clean_antal,
    clean_prices,
    clean_payment,
    clean_leveransstatus,
    clean_region,
    clean_betyg,
    clean_recension_text,
    remove_duplicates,
    transform_data,
)
from src.load import load_clean_data

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 140)


## 1. üîç Load Raw Data

We begin by loading the raw Nordtech dataset using the extraction module.

In [4]:
df_raw = load_raw_data()
df_raw.head()

[EXTRACT] Loading raw dataset from: C:\Users\zinah\nordtech_etl_project\data\raw\nordtech_data.csv
[EXTRACT] Loaded CSV: C:\Users\zinah\nordtech_etl_project\data\raw\nordtech_data.csv (rows=2767, cols=17)


Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
0,ORD-2024-00001,ORD-2024-00001-1,2024-05-19,2024-05-22,SKU-WC001,Webbkamera HD,Tillbeh√∂r,1,SEK 799,Uppsala,Privat,Kort,KND-53648,Levererad,,,
1,ORD-2024-00002,ORD-2024-00002-1,2024-12-02,5 december 2024,SKU-HB001,USB-C Hub 7-port,Tillbeh√∂r,1,549.00,G√∂teborg,Privat,Swish,KND-84095,Levererad,,,
2,ORD-2024-00003,ORD-2024-00003-1,2024-12-31,2025-01-03,SKU-SD001,Extern SSD 1TB,Lagring,1,1199.00,,F√∂retag,Faktura,KND-91748,Levererad,St√§mmer inte √∂verens med produktbeskrivningen.,2025-01-12,2.0
3,ORD-2024-00003,ORD-2024-00003-2,2024-12-31,2025-01-03,SKU-SD002,Extern SSD 500GB,Lagring,10,699 kr,Stockholm,F√∂retag,FAKTURA,KND-91748,Mottagen,"Leveransen tog lite l√§ngre √§n utlovat, men pro...",2025-01-14,3.0
4,ORD-2024-00003,ORD-2024-00003-3,2024-12-31,2025-01-03,SKU-MS001,Tr√•dl√∂s Mus X1,Tillbeh√∂r,1,399.00,Stockholm,F√∂retag,Faktura,KND-91748,,,,


### 1.1 Raw Data Overview

We inspect:

- Shape  
- Column names  
- Data types  
- Missing values  
- Basic statistics  

In [5]:
print("Shape:", df_raw.shape)
df_raw.info()
df_raw.isna().sum()
df_raw.describe(include="all")

Shape: (2767, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2767 entries, 0 to 2766
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   order_id         2767 non-null   object 
 1   orderrad_id      2767 non-null   object 
 2   orderdatum       2767 non-null   object 
 3   leveransdatum    2767 non-null   object 
 4   produkt_sku      2767 non-null   object 
 5   produktnamn      2767 non-null   object 
 6   kategori         2767 non-null   object 
 7   antal            2767 non-null   object 
 8   pris_per_enhet   2767 non-null   object 
 9   region           2612 non-null   object 
 10  kundtyp          2767 non-null   object 
 11  betalmetod       2651 non-null   object 
 12  kund_id          2767 non-null   object 
 13  leveransstatus   2673 non-null   object 
 14  recension_text   1355 non-null   object 
 15  recensionsdatum  1355 non-null   object 
 16  betyg            1355 non-null   float64
d

Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
count,2767,2767,2767,2767,2767,2767,2767,2767.0,2767.0,2612,2767,2651,2767,2673,1355,1355,1355.0
unique,1657,2700,536,544,17,17,5,22.0,76.0,36,12,14,1644,14,45,442,
top,ORD-2024-00643,ORD-2024-01535-1,2024-12-08,2024-02-25,SKU-MS002,Ergonomisk Mus Pro,Tillbeh√∂r,1.0,699.0,Stockholm,Privat,Faktura,KND-60669,Levererad,Fantastisk produkt! Fungerar precis som utlovat.,2024-12-20,
freq,7,3,18,22,222,222,1202,1839.0,408.0,834,1535,938,7,2005,73,11,
mean,,,,,,,,,,,,,,,,,3.667159
std,,,,,,,,,,,,,,,,,1.26447
min,,,,,,,,,,,,,,,,,1.0
25%,,,,,,,,,,,,,,,,,3.0
50%,,,,,,,,,,,,,,,,,4.0
75%,,,,,,,,,,,,,,,,,5.0


## 2. üßπ Step-by-Step Cleaning

We apply each cleaning function individually and show BEFORE/AFTER results.

In [6]:
# Initialize df_step
df_step = df_raw.copy()
df_step.head()

Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
0,ORD-2024-00001,ORD-2024-00001-1,2024-05-19,2024-05-22,SKU-WC001,Webbkamera HD,Tillbeh√∂r,1,SEK 799,Uppsala,Privat,Kort,KND-53648,Levererad,,,
1,ORD-2024-00002,ORD-2024-00002-1,2024-12-02,5 december 2024,SKU-HB001,USB-C Hub 7-port,Tillbeh√∂r,1,549.00,G√∂teborg,Privat,Swish,KND-84095,Levererad,,,
2,ORD-2024-00003,ORD-2024-00003-1,2024-12-31,2025-01-03,SKU-SD001,Extern SSD 1TB,Lagring,1,1199.00,,F√∂retag,Faktura,KND-91748,Levererad,St√§mmer inte √∂verens med produktbeskrivningen.,2025-01-12,2.0
3,ORD-2024-00003,ORD-2024-00003-2,2024-12-31,2025-01-03,SKU-SD002,Extern SSD 500GB,Lagring,10,699 kr,Stockholm,F√∂retag,FAKTURA,KND-91748,Mottagen,"Leveransen tog lite l√§ngre √§n utlovat, men pro...",2025-01-14,3.0
4,ORD-2024-00003,ORD-2024-00003-3,2024-12-31,2025-01-03,SKU-MS001,Tr√•dl√∂s Mus X1,Tillbeh√∂r,1,399.00,Stockholm,F√∂retag,Faktura,KND-91748,,,,


### 2.1 Clean Column Names

In [7]:
before = list(df_step.columns)
df_step = clean_column_names(df_step)
after = list(df_step.columns)

print("Before:", before)
print("After:", after)

Before: ['order_id', 'orderrad_id', 'orderdatum', 'leveransdatum', 'produkt_sku', 'produktnamn', 'kategori', 'antal', 'pris_per_enhet', 'region', 'kundtyp', 'betalmetod', 'kund_id', 'leveransstatus', 'recension_text', 'recensionsdatum', 'betyg']
After: ['order_id', 'orderrad_id', 'orderdatum', 'leveransdatum', 'produkt_sku', 'produktnamn', 'kategori', 'antal', 'pris_per_enhet', 'region', 'kundtyp', 'betalmetod', 'kund_id', 'leveransstatus', 'recension_text', 'recensionsdatum', 'betyg']


### 2.2 Clean ID Columns

In [8]:
before = df_step[["order_id", "orderrad_id", "kund_id"]].head()
df_step = clean_id_columns(df_step)
after = df_step[["order_id", "orderrad_id", "kund_id"]].head()

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


Unnamed: 0,order_id,orderrad_id,kund_id
0,ORD-2024-00001,ORD-2024-00001-1,KND-53648
1,ORD-2024-00002,ORD-2024-00002-1,KND-84095
2,ORD-2024-00003,ORD-2024-00003-1,KND-91748
3,ORD-2024-00003,ORD-2024-00003-2,KND-91748
4,ORD-2024-00003,ORD-2024-00003-3,KND-91748



After:


Unnamed: 0,order_id,orderrad_id,kund_id
0,ORD-2024-00001,ORD-2024-00001-1,KND-53648
1,ORD-2024-00002,ORD-2024-00002-1,KND-84095
2,ORD-2024-00003,ORD-2024-00003-1,KND-91748
3,ORD-2024-00003,ORD-2024-00003-2,KND-91748
4,ORD-2024-00003,ORD-2024-00003-3,KND-91748


### 2.3 Clean Date Columns

In [9]:
before = df_step[["orderdatum", "leveransdatum", "recensionsdatum"]].head(10)
df_step = clean_date(df_step)
after = df_step[["orderdatum", "leveransdatum", "recensionsdatum"]].head(10)

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


Unnamed: 0,orderdatum,leveransdatum,recensionsdatum
0,2024-05-19,2024-05-22,
1,2024-12-02,5 december 2024,
2,2024-12-31,2025-01-03,2025-01-12
3,2024-12-31,2025-01-03,2025-01-14
4,2024-12-31,2025-01-03,
5,2024-04-22,2024-04-26,
6,2024-07-01,2024-07-05,
7,2024-07-01,2024-07-05,2024-07-12
8,2024-07-01,2024-07-05,
9,2024-07-01,2024-07-05,



After:


Unnamed: 0,orderdatum,leveransdatum,recensionsdatum
0,2024-05-19,2024-05-22,NaT
1,2024-12-02,2024-12-05,NaT
2,2024-12-31,2025-01-03,2025-01-12
3,2024-12-31,2025-01-03,2025-01-14
4,2024-12-31,2025-01-03,NaT
5,2024-04-22,2024-04-26,NaT
6,2024-07-01,2024-07-05,NaT
7,2024-07-01,2024-07-05,2024-07-12
8,2024-07-01,2024-07-05,NaT
9,2024-07-01,2024-07-05,NaT


### 2.4 Fix Reversed Dates

In [10]:
# Create mask for reversed dates BEFORE fixing
mask_before = pd.to_datetime(df_step["leveransdatum"], errors="coerce") < \
              pd.to_datetime(df_step["orderdatum"], errors="coerce")

before = df_step.loc[mask_before, ["orderdatum", "leveransdatum"]].copy()

print("Before (rows where delivery date is earlier than order date):")
display(before)

# Apply the fix
df_step = fix_reversed_dates(df_step)

# Recalculate mask AFTER fixing
mask_after = pd.to_datetime(df_step["leveransdatum"], errors="coerce") < \
             pd.to_datetime(df_step["orderdatum"], errors="coerce")

after = df_step.loc[mask_before, ["orderdatum", "leveransdatum"]]

print("\nAfter (same rows after applying fix_reversed_dates):")
display(after)


Before (rows where delivery date is earlier than order date):


Unnamed: 0,orderdatum,leveransdatum
36,2024-12-09,2024-09-15
54,2024-02-17,2024-02-14
59,2024-11-01,2024-10-31
68,2024-10-15,2024-10-10
112,2024-04-13,2024-04-11
143,2024-04-26,2024-04-22
161,2024-07-10,2024-07-05
221,2024-11-03,2024-10-30
278,2024-05-14,2024-05-09
293,2024-11-27,2024-11-22



After (same rows after applying fix_reversed_dates):


Unnamed: 0,orderdatum,leveransdatum
36,2024-09-15,2024-12-09
54,2024-02-14,2024-02-17
59,2024-10-31,2024-11-01
68,2024-10-10,2024-10-15
112,2024-04-11,2024-04-13
143,2024-04-22,2024-04-26
161,2024-07-05,2024-07-10
221,2024-10-30,2024-11-03
278,2024-05-09,2024-05-14
293,2024-11-22,2024-11-27


### 2.5 Clean Kundtyp

In [11]:
before = df_step["kundtyp"].value_counts(dropna=False)
df_step = clean_kundtyp(df_step)
after = df_step["kundtyp"].value_counts(dropna=False)

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


kundtyp
Privat       1535
F√∂retag       783
privat         66
b2c            64
Konsument      60
PRIVAT         58
B2C            45
B2B            39
b2b            35
F√ñRETAG        29
Firma          27
f√∂retag        26
Name: count, dtype: int64


After:


kundtyp
private     1828
business     939
Name: count, dtype: int64

### 2.6 Clean Antal

In [12]:
before = df_step["antal"].head()
df_step = clean_antal(df_step)
after = df_step["antal"].head()

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


0     1
1     1
2     1
3    10
4     1
Name: antal, dtype: object


After:


0     1
1     1
2     1
3    10
4     1
Name: antal, dtype: int64

### 2.7 Clean Prices

In [13]:
before = df_step["pris_per_enhet"].head()
df_step = clean_prices(df_step)
after = df_step["pris_per_enhet"].head()

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


0    SEK 799
1     549.00
2    1199.00
3     699 kr
4     399.00
Name: pris_per_enhet, dtype: object


After:


0     799.0
1     549.0
2    1199.0
3     699.0
4     399.0
Name: pris_per_enhet, dtype: float64

### 2.8 Clean Payment Method

In [14]:
before = df_step["betalmetod"].value_counts(dropna=False)
df_step = clean_payment(df_step)
after = df_step["betalmetod"].value_counts(dropna=False)

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


betalmetod
Faktura           938
Kort              775
Swish             550
NaN               116
Invoice            51
FAKTURA            50
faktura            44
Kreditkort         36
KORT               33
swish              31
SWISH              31
kort               30
Mobilbetalning     29
Visa               28
Mastercard         25
Name: count, dtype: int64


After:


betalmetod
invoice    1083
card        927
swish       641
unknown     116
Name: count, dtype: int64

### 2.9 Clean Leveransstatus

In [15]:
before = df_step["leveransstatus"].value_counts(dropna=False)
df_step = clean_leveransstatus(df_step)
after = df_step["leveransstatus"].value_counts(dropna=False)

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


leveransstatus
Levererad          2005
Under transport     152
Retur               132
Skickad              95
NaN                  94
Mottagen             92
levererad            86
LEVERERAD            68
Returnerad           10
√Öters√§nd              8
P√• v√§g                7
retur                 6
under transport       6
RETUR                 4
UNDER TRANSPORT       2
Name: count, dtype: int64


After:


leveransstatus
delivered     2159
in_transit     167
returned       160
sent            95
unknown         94
received        92
Name: count, dtype: int64

### 2.10 Clean Region

In [16]:
before = df_step["region"].value_counts(dropna=False)
df_step = clean_region(df_step)
after = df_step["region"].value_counts(dropna=False)

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


region
Stockholm     834
G√∂teborg      434
Malm√∂         236
Uppsala       192
NaN           155
Norrland      125
√ñrebro        117
Link√∂ping     117
V√§ster√•s       75
STOCKHOLM      50
Sthml          44
stockholm      42
STHLM          39
uppsala        25
Sthlm          25
g√∂teborg       24
G√ñTEBORG       22
UPPSALA        21
Gothenburg     19
MALM√ñ          16
Gbg            14
malmo          12
GBGB           11
LINK√ñPING      11
Orebro         10
Vasteras       10
√∂rebro          9
√ñREBRO          9
norrland        9
link√∂ping       8
NORRLAND        8
Linkoping       8
Malmo           8
v√§ster√•s        7
Norr            7
V√ÑSTER√ÖS        7
malm√∂           7
Name: count, dtype: int64


After:


region
stockholm    1034
g√∂teborg      524
malm√∂         279
uppsala       238
None          155
norrland      149
√∂rebro        145
link√∂ping     144
v√§ster√•s       99
Name: count, dtype: int64

### 2.11 Clean Betyg (Rating)

In [17]:
before = df_step["betyg"].describe()
df_step = clean_betyg(df_step)
after = df_step["betyg"].describe()

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


count    1355.000000
mean        3.667159
std         1.264470
min         1.000000
25%         3.000000
50%         4.000000
75%         5.000000
max         5.000000
Name: betyg, dtype: float64


After:


count    2767.000000
mean        3.837008
std         0.900207
min         1.000000
25%         4.000000
50%         4.000000
75%         4.000000
max         5.000000
Name: betyg, dtype: float64

### 2.12 Clean Recension Text

In [18]:
before = df_step["recension_text"].head()
df_step = clean_recension_text(df_step)
after = df_step["recension_text"].head()

print("Before:")
display(before)
print("\nAfter:")
display(after)

Before:


0                                                  NaN
1                                                  NaN
2       St√§mmer inte √∂verens med produktbeskrivningen.
3    Leveransen tog lite l√§ngre √§n utlovat, men pro...
4                                                  NaN
Name: recension_text, dtype: object


After:


0                                                  NaN
1                                                  NaN
2       St√§mmer inte √∂verens med produktbeskrivningen.
3    Leveransen tog lite l√§ngre √§n utlovat, men pro...
4                                                  NaN
Name: recension_text, dtype: object

### 2.13 Remove Duplicates

In [19]:
before_shape = df_step.shape
df_step = remove_duplicates(df_step)
after_shape = df_step.shape

print("Before:", before_shape)
print("After:", after_shape)

Before: (2767, 17)
After: (2700, 17)


## 3. üöÄ Run Full Transformation Pipeline

Now that each cleaning function has been tested individually, we run the full
`transform_data()` pipeline to ensure all steps work together in sequence.

In [20]:
df_clean = transform_data(df_raw.copy())

print("Preview of fully cleaned dataset:")
display(df_clean.head())

[TRANSFORM] Starting transformation pipeline...
[TRANSFORM] Transformation pipeline completed.
Preview of fully cleaned dataset:


Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
0,ORD-2024-00001,ORD-2024-00001-1,2024-05-19,2024-05-22,SKU-WC001,Webbkamera HD,Tillbeh√∂r,1,799.0,uppsala,private,card,KND-53648,delivered,,NaT,4.0
1,ORD-2024-00002,ORD-2024-00002-1,2024-12-02,2024-12-05,SKU-HB001,USB-C Hub 7-port,Tillbeh√∂r,1,549.0,g√∂teborg,private,swish,KND-84095,delivered,,NaT,4.0
2,ORD-2024-00003,ORD-2024-00003-1,2024-12-31,2025-01-03,SKU-SD001,Extern SSD 1TB,Lagring,1,1199.0,,business,invoice,KND-91748,delivered,St√§mmer inte √∂verens med produktbeskrivningen.,2025-01-12,2.0
3,ORD-2024-00003,ORD-2024-00003-2,2024-12-31,2025-01-03,SKU-SD002,Extern SSD 500GB,Lagring,10,699.0,stockholm,business,invoice,KND-91748,received,"Leveransen tog lite l√§ngre √§n utlovat, men pro...",2025-01-14,3.0
4,ORD-2024-00003,ORD-2024-00003-3,2024-12-31,2025-01-03,SKU-MS001,Tr√•dl√∂s Mus X1,Tillbeh√∂r,1,399.0,stockholm,business,invoice,KND-91748,unknown,,NaT,4.0


## 4. ‚úÖ Validate Cleaned Dataset

We inspect the final cleaned dataset to ensure:

- correct data types
- no unexpected missing values
- consistent categories
- reasonable numeric distributions

In [21]:
print("Shape:", df_clean.shape)
df_clean.info()
df_clean.isna().sum()
df_clean.describe(include="all")

Shape: (2700, 17)
<class 'pandas.core.frame.DataFrame'>
Index: 2700 entries, 0 to 2766
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   order_id         2700 non-null   object        
 1   orderrad_id      2700 non-null   object        
 2   orderdatum       2566 non-null   datetime64[ns]
 3   leveransdatum    2644 non-null   datetime64[ns]
 4   produkt_sku      2700 non-null   object        
 5   produktnamn      2700 non-null   object        
 6   kategori         2700 non-null   object        
 7   antal            2700 non-null   int64         
 8   pris_per_enhet   2700 non-null   float64       
 9   region           2550 non-null   object        
 10  kundtyp          2700 non-null   object        
 11  betalmetod       2700 non-null   object        
 12  kund_id          2700 non-null   object        
 13  leveransstatus   2700 non-null   object        
 14  recension_text   1317 non-n

Unnamed: 0,order_id,orderrad_id,orderdatum,leveransdatum,produkt_sku,produktnamn,kategori,antal,pris_per_enhet,region,kundtyp,betalmetod,kund_id,leveransstatus,recension_text,recensionsdatum,betyg
count,2700,2700,2566,2644,2700,2700,2700,2700.0,2700.0,2550,2700,2700,2700,2700,1317,1292,2700.0
unique,1657,2700,,,17,17,5,,,8,2,4,1644,6,45,,
top,ORD-2024-00038,ORD-2024-00001-1,,,SKU-MS002,Ergonomisk Mus Pro,Tillbeh√∂r,,,stockholm,private,invoice,KND-76077,delivered,Fantastisk produkt! Fungerar precis som utlovat.,,
freq,5,1,,,218,218,1170,,,1012,1788,1051,5,2101,70,,
mean,,,2024-07-06 18:52:28.246297600,2024-07-09 23:35:29.500756224,,,,1.59037,3510.277778,,,,,,,2024-07-14 04:38:38.266254080,3.836296
min,,,2023-12-27 00:00:00,2024-01-01 00:00:00,,,,1.0,399.0,,,,,,,2024-01-05 00:00:00,1.0
25%,,,2024-03-29 06:00:00,2024-04-02 00:00:00,,,,1.0,599.0,,,,,,,2024-04-05 00:00:00,4.0
50%,,,2024-07-03 00:00:00,2024-07-07 00:00:00,,,,1.0,899.0,,,,,,,2024-07-12 12:00:00,4.0
75%,,,2024-10-20 00:00:00,2024-10-24 00:00:00,,,,2.0,4999.0,,,,,,,2024-10-27 00:00:00,4.0
max,,,2024-12-31 00:00:00,2025-01-07 00:00:00,,,,10.0,18999.0,,,,,,,2025-01-14 00:00:00,5.0


## 5. üíæ Save Cleaned Data

We save the fully cleaned dataset using the load module.

In [26]:
load_clean_data(df_clean)
print("Cleaned dataset saved successfully.")

[LOAD] Starting load process...
[LOAD] CSV saved successfully ‚Üí C:\Users\zinah\nordtech_etl_project\data\processed\nordtech_cleaned.csv
[LOAD] SQLite table 'clean_orders' updated successfully ‚Üí C:\Users\zinah\nordtech_etl_project\database\nordtech.db
[LOAD] Load process completed.
Cleaned dataset saved successfully.


# üéâ ETL Pipeline Completed

This notebook demonstrated:

- Raw data exploration
- Step-by-step cleaning with BEFORE/AFTER validation
- Full transformation pipeline execution
- Final dataset validation
- Saving the cleaned dataset

The cleaned dataset is now ready for KPI analysis and further modeling.