## E-Commerce Dataset EDA & Cleaning

### 🎯 Objective

- Check for missing values and data types  
- Clean the data by filling missing values  


In [2]:
import pandas as pd
import os

# Define paths
raw_path = '../datasets/Raw_data'
cleaned_path = '../datasets/cleaned_data'

# Load datasets
datasets = {
    "customers": pd.read_csv(os.path.join(raw_path, "customers.csv")),
    "geolocation": pd.read_csv(os.path.join(raw_path, "geolocation.csv")),
    "order_items": pd.read_csv(os.path.join(raw_path, "order_items.csv")),
    "orders": pd.read_csv(os.path.join(raw_path, "orders.csv")),
    "payments": pd.read_csv(os.path.join(raw_path, "payments.csv")),
    "products": pd.read_csv(os.path.join(raw_path, "products.csv")),
    "sellers": pd.read_csv(os.path.join(raw_path, "sellers.csv")),
}

print("Datasets loaded successfully.")


Datasets loaded successfully.


### 🔍 Missing Values Summary

check each dataset for missing values and data types.


In [None]:
summary_frames = []

for name, df in datasets.items():
    summary = pd.DataFrame({
        "Dataset": name,
        "Column": df.columns,
        "Missing Values": df.isnull().sum().values,
        "Data Type": df.dtypes.values
    })
    summary_frames.append(summary)

eda_summary = pd.concat(summary_frames, ignore_index=True)

# Save EDA result
os.makedirs("../outputs", exist_ok=True)
eda_summary.to_csv("../outputs/eda_missing_summary.csv", index=False)

# Show preview
eda_summary.head(20)

Unnamed: 0,Dataset,Column,Missing Values,Data Type
0,customers,customer_id,0,object
1,customers,customer_unique_id,0,object
2,customers,customer_zip_code_prefix,0,int64
3,customers,customer_city,0,object
4,customers,customer_state,0,object
5,geolocation,geolocation_zip_code_prefix,0,int64
6,geolocation,geolocation_lat,0,float64
7,geolocation,geolocation_lng,0,float64
8,geolocation,geolocation_city,0,object
9,geolocation,geolocation_state,0,object


### Data Cleaning Strategy

- Fill missing **numerical values** with the **median**
- Fill missing **categorical values** with the **mode**


In [7]:
cleaned_datasets = {}

for name, df in datasets.items():
    df_cleaned = df.copy()
    for col in df.columns:
        if df[col].isnull().any():
            if df[col].dtype in ['float64', 'int64']:
                df_cleaned[col].fillna(df[col].median(), inplace=True)
            else:
                df_cleaned[col].fillna(df[col].mode()[0], inplace=True)
    cleaned_datasets[name] = df_cleaned
    # Save cleaned CSV
    os.makedirs(cleaned_path, exist_ok=True)
    df_cleaned.to_csv(os.path.join(cleaned_path, f"cleaned_{name}.csv"), index=False)

print("Cleaned datasets saved to ../datasets/cleaned_data/")


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned[col].fillna(df[col].mode()[0], inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_cleaned[col].fillna(df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we a

Cleaned datasets saved to ../datasets/cleaned_data/
