### Brazilian E-Commerce Public Dataset by Olist
This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

The code in the following cell downloads the dataset from Kagglehub and lists the files in the downloaded directory.

In [None]:
import kagglehub
import os
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import plotly.express as px

# Download the latest version of the dataset
path = kagglehub.dataset_download("olistbr/brazilian-ecommerce")

# List the files in the downloaded directory
file_list = os.listdir(path)
for file_name in file_list:
    print(file_name)

: 

In [None]:
def detect_and_plot_outliers_iqr(df, column, exclude_zero=False):
    """
    Detects outliers using the IQR method, prints outlier information, and plots a box plot.

    Args:
        df (pd.DataFrame): The input DataFrame.
        column (str): The name of the column to analyze for outliers.
        exclude_zero (bool): Whether to exclude zero values from outlier calculation.
                             Useful for columns where zero is a meaningful non-outlier value.
    """
    print(f"\n--- Outlier Analysis for column: {column} ---")

    # Create box plot
    plt.figure(figsize=(8, 4))
    sns.boxplot(y=df[column])
    plt.title(f'Box plot of {column}')
    plt.ylabel(column)
    plt.show()

    # Calculate IQR and identify outliers
    if exclude_zero:
        # Consider only non-zero values for IQR calculation
        data_for_iqr = df[df[column] != 0][column]
    else:
        data_for_iqr = df[column]

    if data_for_iqr.empty:
        print(f"  No non-zero data in column '{column}' to calculate outliers.")
        return

    Q1 = data_for_iqr.quantile(0.25)
    Q3 = data_for_iqr.quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identify outliers in the original DataFrame
    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]

    num_outliers = len(outliers)
    percentage_outliers = (num_outliers / len(df)) * 100

    print(f"  Number of outliers: {num_outliers}")
    print(f"  Percentage of outliers: {percentage_outliers:.2f}%")
    print(f"  Lower bound (IQR): {lower_bound:.2f}")
    print(f"  Upper bound (IQR): {upper_bound:.2f}")

    if num_outliers > 0 and num_outliers < 20: # Display if not too many outliers
        print("\nSample Outlier Rows:")
        display(outliers.head())

### Processing the `olist_customers_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the customers dataset
customers_dataset_path = os.path.join(path, "olist_customers_dataset.csv")
customers_df = pd.read_csv(customers_dataset_path)

# Display the first few rows of the DataFrame
customers_df.head()

In [None]:
# Get information about the customers_df DataFrame
print(customers_df.info())

# Display missing values information
print("\nMissing values in customers_df:")
print(customers_df.isnull().sum())

In [None]:
# Checking for duplicated values
print(f"Duplicated values in customer_df: {customers_df.duplicated().sum()}")
print(f"Duplicated values in customer_df['customer_id']: {customers_df['customer_id'].duplicated().sum()}")

In [None]:
# Standardize city and state columns
print(f"Unique cities before standardization: {customers_df['customer_city'].nunique()}")
customers_df['customer_city'] = customers_df['customer_city'].str.strip().str.lower()
customers_df['customer_state'] = customers_df['customer_state'].str.strip().str.lower()
print(f"Unique cities after standardization: {customers_df['customer_city'].nunique()}")

In [None]:
# Identify numerical columns in customers_df
numerical_cols_customers = customers_df.select_dtypes(include=np.number).columns

print("Numerical columns in customers_df:", numerical_cols_customers)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_customers:
    detect_and_plot_outliers_iqr(customers_df, col)

In [None]:
customers_df.to_parquet("olist_customers_cleaned_dataset.parquet", index=False)

Here are the key observations from the pre-processing of the `customers_df`:

*   **Missing Values:** No missing values were found in the dataset.
*   **Data Types:** All columns have been assigned appropriate data types.
*   **Duplicate Values:** There are no duplicate rows in the entire DataFrame or in the `customer_id` column. Duplicates in the `customer_unique_id` column are expected and represent repeat buyers, as explained in the dataset documentation.
*   **Row Uniqueness:** Each row represents a unique customer entry in this specific dataset instance.
*   **Key Identifiers:** `customer_id` serves as an anonymized link to orders, while `customer_unique_id` allows for tracking individual customers across multiple orders.
*   **Geographical Information:** The dataset includes `customer_city`, `customer_state`, and `customer_zip_code_prefix` for geographical analysis.
*   **Outliers:** No outliers were detected in the numerical column (`customer_zip_code_prefix`) using the IQR method.

### Processing the `olist_sellers_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the sellers dataset
seller_dataset_path = os.path.join(path, "olist_sellers_dataset.csv")
seller_df = pd.read_csv(seller_dataset_path)

# Display the first few rows of the DataFrame
seller_df.head()

In [None]:
# Get information about the seller_df DataFrame
print(seller_df.info())

# Display missing values information
print("\nMissing values in seller_df:")
print(seller_df.isnull().sum())

In [None]:
# Checking for duplicated values
print(f"Duplicated values in seller_df: {seller_df.duplicated().sum()}")
print(f"Duplicated values in sellerer_df['seller_id']: {seller_df['seller_id'].duplicated().sum()}")

In [None]:
# Standardize city and state columns
print(f"Unique seller cities before standardization: {seller_df['seller_city'].nunique()}")
seller_df['seller_city'] = seller_df['seller_city'].str.strip().str.lower()
seller_df['seller_state'] = seller_df['seller_state'].str.strip().str.lower()
print(f"Unique seller cities after standardization: {seller_df['seller_city'].nunique()}")

In [None]:
# Identify numerical columns in seller_df
numerical_cols_sellers = seller_df.select_dtypes(include=np.number).columns

print("Numerical columns in seller_df:", numerical_cols_sellers)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_sellers:
    detect_and_plot_outliers_iqr(seller_df, col)

In [None]:
seller_df.to_parquet("olist_sellers_cleaned_dataset.parquet", index=False)

Here are the key observations from the pre-processing of the `seller_df`:

*   **Missing Values and Data Types:** No missing values were detected, and all columns have appropriate data types.
*   **Duplicate Values:** There are no duplicate values present in the DataFrame, ensuring each entry is unique per seller.
*   **Row Uniqueness:** Each row corresponds to a unique seller identified by `seller_id`.
*   **Geographical Information:** The dataset includes `seller_zip_code_prefix`, `seller_city`, and `seller_state`, allowing for geographical analysis of seller distribution.
*   **Geographical Distribution:** The `seller_state` column indicates the Brazilian state where each seller is located, which is important for analyzing logistics and regional presence.
*   **Outliers:** No outliers were detected in the numerical column (`seller_zip_code_prefix`) using the IQR method.

### Processing the `olist_order_reviews_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the order reviews dataset
order_reviews_dataset_path = os.path.join(path, "olist_order_reviews_dataset.csv")
order_reviews_df = pd.read_csv(order_reviews_dataset_path)

# Display the first few rows of the DataFrame
order_reviews_df.head()

In [None]:
# Get information about the order_reviews_df DataFrame
print(order_reviews_df.info())

# Display missing values information
print("\nMissing values in order_reviews_df:")
print(order_reviews_df.isnull().sum())

In [None]:
# Fill missing values in 'review_comment_title' and 'review_comment_message' with 'no comment'
order_reviews_df['review_comment_title'] = order_reviews_df['review_comment_title'].fillna('no comment')
order_reviews_df['review_comment_message'] = order_reviews_df['review_comment_message'].fillna('no comment')

# Verify that missing values have been handled
print("\nMissing values in order_reviews_df after filling:")
print(order_reviews_df.isnull().sum())

In [None]:
# Datatype mismatch
order_reviews_df['review_answer_timestamp'] = pd.to_datetime(order_reviews_df['review_answer_timestamp'])
order_reviews_df['review_creation_date'] = pd.to_datetime(order_reviews_df['review_creation_date'])

print(f"Datatype after being handled carefully: \n{order_reviews_df.dtypes}")

In [None]:
# Duplicated Values
print(f"Duplicated values in order_reviews_df: {order_reviews_df.duplicated().sum()}")
print(f"Duplicated values in order_reviews_df['review_id']: {order_reviews_df['review_id'].duplicated().sum()}")
# Remove duplicate review_id values from order_reviews_df
order_reviews_df.drop_duplicates(subset='review_id', inplace=True)

# Verify that duplicates have been removed
print(f"Duplicated values in order_reviews_df['review_id'] after removing duplicates: {order_reviews_df['review_id'].duplicated().sum()}")

In [None]:
# Identify numerical columns in seller_df
numerical_cols_sellers = order_reviews_df.select_dtypes(include=np.number).columns

print("Numerical columns in seller_df:", numerical_cols_sellers)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_sellers:
    detect_and_plot_outliers_iqr(order_reviews_df, col)

In [None]:
order_reviews_df.to_parquet("olist_order_reviews_cleaned_dataset.parquet", index=False)

Observations:
* Each row corresponds to exactly one review per order.
* review_score ranges from 1 to 5, with most orders having a rating.
* Many reviews lack textual comments (review_comment_title and review_comment_message have many nulls) which were later replaced by 'no comment' string.
* Review timestamps (review_creation_date, review_answer_timestamp) are stored as objects and will need datetime conversion for time series analysis.
* we found that there are 814 duplicated values in the review_id.
* This dataset can provide insights on customer satisfaction and correlate ratings with delivery times, sellers, or product categories during EDA.
* Number of outliers: 14396, but these outliers are part of rating which are given my customers and it can be highly negative or highly positive but these are essential for the ML.

### Processing the `olist_order_items_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the order items dataset
order_items_dataset_path = os.path.join(path, "olist_order_items_dataset.csv")
order_items_df = pd.read_csv(order_items_dataset_path)

# Display the first few rows of the DataFrame
order_items_df.head()

In [None]:
# Get information about the order_items_df DataFrame
print(order_items_df.info())

# Display missing values information
print("\nMissing values in order_items_df:")
print(order_items_df.isnull().sum())

In [None]:
order_items_df.rename(columns={"shipping_limit_date" : "shipping_deadline"}, inplace=True)
order_items_df['shipping_deadline'] = pd.to_datetime(order_items_df['shipping_deadline'])
print(f"Datatype after being handled carefully: \n{order_items_df.dtypes}")

In [None]:
# Duplicated Values
print(f"Duplicated values in order_reviews_df: {order_items_df.duplicated().sum()}")
print(f"Checking duplicates in the combination of order_id, item_is: {order_items_df.duplicated(subset=['order_id', 'order_item_id']).sum()}")

In [None]:
# Identify numerical columns in order_items_df
numerical_cols_items = order_items_df.select_dtypes(include=np.number).columns

print("Numerical columns in order_items_df:", numerical_cols_items)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_items:
    if col == 'order_item_id': # Exclude order_item_id from outlier detection
      continue
    detect_and_plot_outliers_iqr(order_items_df, col)

In [None]:
# Let's find the outliers without removing them first
# Select only numerical columns for quantile calculation
numerical_order_items_df = order_items_df.select_dtypes(include=np.number)

Q1 = numerical_order_items_df.quantile(0.25)
Q3 = numerical_order_items_df.quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify outliers for 'price' and 'freight_value' using their specific bounds
outliers_price = order_items_df[(order_items_df['price'] < lower_bound['price']) | (order_items_df['price'] > upper_bound['price'])]
outliers_freight = order_items_df[(order_items_df['freight_value'] < lower_bound['freight_value']) | (order_items_df['freight_value'] > upper_bound['freight_value'])]

print("--- Top 5 Price Outliers ---")
display(outliers_price.sort_values('price', ascending=False).head())

print("\n--- Top 5 Freight Value Outliers ---")
display(outliers_freight.sort_values('freight_value', ascending=False).head())

In [None]:
order_items_df['price_log'] = np.log1p(order_items_df['price'])
order_items_df['freight_value_log'] = np.log1p(order_items_df['freight_value'])

print("Log-transformed columns 'price_log' and 'freight_value_log' have been created.")


# --- Step 2: Visualize the "Before and After" ---
# This will clearly show you why this method is so effective.
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Original distributions
sns.histplot(order_items_df['price'], bins=50, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Original Price Distribution (Skewed)')

sns.histplot(order_items_df['freight_value'], bins=50, kde=True, ax=axes[1, 0])
axes[1, 0].set_title('Original Freight Value Distribution (Skewed)')

# Log-transformed distributions
sns.histplot(order_items_df['price_log'], bins=50, kde=True, ax=axes[0, 1])
axes[0, 1].set_title('Log-Transformed Price Distribution (Normalized)')

sns.histplot(order_items_df['freight_value_log'], bins=50, kde=True, ax=axes[1, 1])
axes[1, 1].set_title('Log-Transformed Freight Value Distribution (Normalized)')

plt.tight_layout()
plt.show()

In [None]:
order_items_df.to_parquet("olist_order_items_cleaned_dataset.parquet", index=False)

Observations
* No missing values detected in this dataset.
* Each row represents an individual product item in an order:
* An order can have multiple rows (one per product).
* order_item_id is not a unique identifier by itself, but it helps identify the position of an item within a given order_id.
* price and freight_value will be critical for revenue and cost analysis in the EDA stage.
* Multiple outliers were identified in the price and freight_value but exactly they were the prices of the high-end/expensives products so we cannot remove it but in order to normalize it we transformed it using the logarithmic tranformation.

### Processing the `olist_products_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the products dataset
products_dataset_path = os.path.join(path, "olist_products_dataset.csv")
products_df = pd.read_csv(products_dataset_path)

# Display the first few rows of the DataFrame
products_df.head()

In [None]:
# Get information about the products_df DataFrame
print(products_df.info())

# Display missing values information
print("\nMissing values in products_df:")
print(products_df.isnull().sum())

In [None]:
products_df.rename(columns={
    "product_name_lenght" : "product_name_length",
    "product_description_lenght" : "product_description_length"
}, inplace=True)

In [None]:
# Fill missing values in 'product_category_name' with 'unknown'
products_df['product_category_name'] = products_df['product_category_name'].fillna('unknown')

# Verify that missing values in 'product_category_name' have been handled
print("\nMissing values in products_df after filling 'product_category_name':")
print(products_df.isnull().sum())

In [None]:
# Calculate the median of 'product_name_length', 'product_description_length', 'product_photos_qty', 'product_weight_g', 'product_length_cm', 'product_height_cm', and 'product_width_cm'
median_name_length = products_df['product_name_length'].median()
median_description_length = products_df['product_description_length'].median()
median_photos_qty = products_df['product_photos_qty'].median()
median_weight_g = products_df['product_weight_g'].median()
median_length_cm = products_df['product_length_cm'].median()
median_height_cm = products_df['product_height_cm'].median()
median_width_cm = products_df['product_width_cm'].median()


# Impute missing values in 'product_name_length', 'product_description_length', and 'product_photos_qty' with their medians
products_df['product_name_length'] = products_df['product_name_length'].fillna(median_name_length)
products_df['product_description_length'] = products_df['product_description_length'].fillna(median_description_length)
products_df['product_photos_qty'] = products_df['product_photos_qty'].fillna(median_photos_qty)

# Impute missing values in 'product_weight_g', 'product_length_cm', 'product_height_cm', and 'product_width_cm' with their medians
products_df['product_weight_g'] = products_df['product_weight_g'].fillna(median_weight_g)
products_df['product_length_cm'] = products_df['product_length_cm'].fillna(median_length_cm)
products_df['product_height_cm'] = products_df['product_height_cm'].fillna(median_height_cm)
products_df['product_width_cm'] = products_df['product_width_cm'].fillna(median_width_cm)


# Verify that missing values in 'product_name_length' have been handled
print("\nMissing values in products_df after imputing numerical columns:")
print(products_df.isnull().sum())

In [None]:
# Duplicated Values
print(f"Duplicated values in order_reviews_df: {products_df.duplicated().sum()}")
print(f"Duplicated values in order_reviews_df['product_id']: {products_df['product_id'].duplicated().sum()}")

In [None]:
# Identify numerical columns in order_items_df
numerical_cols_items = products_df.select_dtypes(include=np.number).columns

print("Numerical columns in order_items_df:", numerical_cols_items)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_items:
    detect_and_plot_outliers_iqr(products_df, col)

In [None]:
products_df['product_weight_g_log'] = np.log1p(products_df['product_weight_g'])

# Optional but recommended: Create and transform volume
products_df['product_volume_cm3'] = products_df['product_length_cm'] * products_df['product_height_cm'] * products_df['product_width_cm']
products_df['product_volume_cm3_log'] = np.log1p(products_df['product_volume_cm3'])

print("Log-transformed columns for weight and volume have been created.")

# --- 2. Cap the Product Description Length ---
# Calculate the 99th percentile
desc_len_cap = products_df['product_description_length'].quantile(0.99)
print(f"Product description length will be capped at: {desc_len_cap:.0f} characters.")

# Create a new capped column
products_df['product_description_length_capped'] = products_df['product_description_length'].clip(upper=desc_len_cap)

print("Capped column for description length has been created.")

# --- 3. Do Nothing for Photos Qty and Name Length ---
print("No changes made to 'product_photos_qty' or 'product_name_length'.")

# Display the new columns
print("\nDataFrame with new transformed/capped columns:")
display(products_df[['product_weight_g', 'product_weight_g_log', 'product_volume_cm3', 'product_volume_cm3_log', 'product_description_length', 'product_description_length_capped']].head())


In [None]:
products_df.to_parquet("olist_products_cleaned_dataset.parquet", index=False)

Observations
* The dataset includes product identifiers, product categories (in Portuguese), and physical attributes such as length, height, width, and weight.
* Around 1.85% of entries have missing values in category and product description-related columns, should be addressed by filling it with either unknown or with median values.
* Only a negligible number of missing values are present in the physical dimension columns.
* Some products share the same category ID, which will be translated into English using the product_category dataset.
* Product dimensions and weight will be valuable for analyzing shipping costs and understanding product characteristics in later EDA.

### Processing the `olist_geolocation_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the geolocation dataset
geolocation_dataset_path = os.path.join(path, "olist_geolocation_dataset.csv")
geolocation_df = pd.read_csv(geolocation_dataset_path)

# Display the first few rows of the DataFrame
geolocation_df.head()

In [None]:
# Get information about the geolocation_df DataFrame
geolocation_df_info = geolocation_df.info()

# Display missing values information
print("\nMissing values in geolocation_df:")
print(geolocation_df.isnull().sum())

In [None]:
# Check for duplicated rows in the geolocation_df
print(f"Number of duplicated rows in geolocation_df: {geolocation_df.duplicated().sum()}")

In [None]:
# Remove duplicate rows based on the subset of specified columns
geolocation_df.drop_duplicates(subset=['geolocation_zip_code_prefix', 'geolocation_lat', 'geolocation_lng'], keep='first', inplace=True)

# Verify that duplicates based on the subset have been removed
print(f"Number of duplicated rows based on zip code, lat, and lng after removing duplicates: {geolocation_df.duplicated(subset=['geolocation_zip_code_prefix', 'geolocation_lat', 'geolocation_lng']).sum()}")

In [None]:
# Identify numerical columns in order_items_df
numerical_cols_items = geolocation_df.select_dtypes(include=np.number).columns

print("Numerical columns in order_items_df:", numerical_cols_items)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_items:
    detect_and_plot_outliers_iqr(geolocation_df, col)

In [None]:
# Define the approximate geographical boundaries for Brazil
LAT_MIN, LAT_MAX = -34, 6
LON_MIN, LON_MAX = -74, -34

# Find coordinates that fall outside these boundaries
geo_outliers = geolocation_df[
    (geolocation_df['geolocation_lat'] < LAT_MIN) | (geolocation_df['geolocation_lat'] > LAT_MAX) |
    (geolocation_df['geolocation_lng'] < LON_MIN) | (geolocation_df['geolocation_lng'] > LON_MAX)
]

num_outliers = len(geo_outliers)

if num_outliers > 0:
    print(f"--- Geographic Outlier Analysis ---")
    print(f"Found {num_outliers} coordinates outside the plausible boundaries of Brazil.")
    print("\nDisplaying some of the detected outliers:")
    display(geo_outliers)

    geolocation_df_cleaned = geolocation_df.drop(geo_outliers.index)
    print(f"\n{num_outliers} outlier rows have been removed.")

else:
    print("No geographic outliers found outside the plausible boundaries of Brazil.")


In [None]:
geolocation_df.to_parquet("olist_geolocation_cleaned_dataset.parquet", index=False)

Observations:
* Each zip code prefix is associated with one city and one state.
* This dataset enables mapping customer and seller locations geographically.
* Unique states provide insights into regional coverage and potential logistics challenges.

### Processing the `product_category_name_translation_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the product category name translation dataset
category_name_translation_dataset_path = os.path.join(path, "product_category_name_translation.csv")
category_name_translation_df = pd.read_csv(category_name_translation_dataset_path)

# Display the entire DataFrame
category_name_translation_df.head()

In [None]:
# Get information about the category_name_translation_df DataFrame
print(category_name_translation_df.info())

# Display missing values information
print("\nMissing values in category_name_translation_df:")
print(category_name_translation_df.isnull().sum())

In [None]:
print(f"Duplicated values in category_name_translation_df: {category_name_translation_df.duplicated().sum()}")
print(f"Duplicated values in category_name_translation_df['product_category_name']: {category_name_translation_df['product_category_name'].duplicated().sum()}")

In [None]:
category_name_translation_df.to_parquet("category_name_translation_cleaned_dataset.parquet", index=False)

Observations:
* The dataset contains two columns: the original Portuguese category names and their English translations.
* There are no missing values and duplicate values.
* This will allow us to join with the products dataset to translate product categories for better readability in analysis.

### Processing the `olist_orders_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the orders dataset
orders_dataset_path = os.path.join(path, "olist_orders_dataset.csv")
orders_df = pd.read_csv(orders_dataset_path)

# Display the first few rows of the DataFrame
orders_df.head()

In [None]:
# Get information about the orders_df DataFrame
print(orders_df.info())

# Display missing values information
print("\nMissing values in orders_df:")
print(orders_df.isnull().sum())

In [None]:
date_cols = [
    "order_purchase_timestamp",
    "order_approved_at",
    "order_delivered_carrier_date",
    "order_delivered_customer_date",
    "order_estimated_delivery_date"
]

for col in date_cols:
    orders_df[col] = pd.to_datetime(orders_df[col])

In [None]:
# Keep only rows where 'order_status' is 'delivered' and overwrite the original DataFrame
orders_df = orders_df[orders_df['order_status'] == 'delivered']

# Now, drop rows with missing delivery dates directly from 'orders_df'
orders_df.dropna(subset=['order_delivered_customer_date'], inplace=True)

print("Original DataFrame has been modified. ✅")
print("\nShape after filtering and dropping nulls:", orders_df.shape)
print("\nRemaining missing values:")
print(orders_df.isnull().sum())

In [None]:
orders_df.dropna(subset=['order_approved_at', 'order_delivered_carrier_date'], inplace=True)

In [None]:
# Define the conditions for invalid orders
invalid_order_conditions = (
    (orders_df['order_approved_at'] < orders_df['order_purchase_timestamp']) |
    (orders_df['order_delivered_carrier_date'] < orders_df['order_approved_at']) |
    (orders_df['order_delivered_customer_date'] < orders_df['order_delivered_carrier_date']) |
    (orders_df['order_estimated_delivery_date'] < orders_df['order_purchase_timestamp'])
)

# Filter out the invalid orders
orders_df = orders_df[~invalid_order_conditions].reset_index(drop=True)

# Display the number of remaining orders
print(f"Number of orders after removing invalid ones: {len(orders_df)}")

In [None]:
print(f"Duplicated values in category_name_translation_df: {orders_df.duplicated().sum()}")

In [None]:
orders_df.to_parquet("olist_orders_cleaned_dataset.parquet", index=False)

Observations:
* Missing Values: Missing values were present in `order_approved_at`, `order_delivered_carrier_date`, and `order_delivered_customer_date`. These were not explicitly filled but orders with illogical timestamp sequences were removed.
* Data Types: All timestamp columns (`order_purchase_timestamp`, `order_approved_at`, `order_delivered_carrier_date`, `order_delivered_customer_date`, `order_estimated_delivery_date`) were converted to datetime objects.
* Duplicate Values: Checked for duplicate rows and duplicate `order_id` values, none were found.
* Outliers: No numerical columns were present in this dataset for standard outlier detection using the IQR method on numerical values. However, orders with illogical timestamp sequences (e.g., delivery before purchase) were considered data quality issues and remove

### Processing the `olist_order_payments_dataset.csv`, this includes checking for null/missing values, ensuring all datatypes are correctly assigned to the columns, checking for duplicate values and checking for outliers.

In [None]:
# Load the order payments dataset
order_payments_dataset_path = os.path.join(path, "olist_order_payments_dataset.csv")
order_payments_df = pd.read_csv(order_payments_dataset_path)

# Display the first few rows of the DataFrame
order_payments_df.head()

In [None]:
# Get information about the order_payments_df DataFrame
order_payments_df_info = order_payments_df.info()

# Display missing values information
print("\nMissing values in order_payments_df:")
print(order_payments_df.isnull().sum())

In [None]:
# Identify numerical columns in order_items_df
numerical_cols_items = order_payments_df.select_dtypes(include=np.number).columns

print("Numerical columns in order_items_df:", numerical_cols_items)

# Apply outlier detection and plotting for each numerical column
for col in numerical_cols_items:
    detect_and_plot_outliers_iqr(order_payments_df, col)

In [None]:
# Merge datasets to get full order context
payments_and_orders_df = pd.merge(order_payments_df, orders_df, on='order_id')
full_order_details_df = pd.merge(payments_and_orders_df, order_items_df, on='order_id')

# Sort by payment_value to see the largest transactions
top_payments = full_order_details_df.sort_values(by='payment_value', ascending=False)

# Display the top 10 largest payments and their details
print("\nTop 10 Largest Transactions:")
print(top_payments.head(10)[['order_id', 'payment_value', 'price', 'freight_value', 'product_id']])

# Investigate zero-value payments
zero_value_payments = order_payments_df[order_payments_df['payment_value'] == 0]
print(f"\nNumber of zero-value payments: {len(zero_value_payments)}")
print("Payment types for zero-value transactions:")
print(zero_value_payments['payment_type'].value_counts())

In [None]:
# Determine the 99.9th percentile
cap_value = order_payments_df['payment_value'].quantile(0.999)
print(f"\n99.9th percentile value (capping threshold): {cap_value:.2f}")

# Create a new column with the capped values
order_payments_df['capped_payment_value'] = order_payments_df['payment_value'].clip(upper=cap_value)

# Compare the original and capped statistics
print("\nStatistics after capping:")
print(order_payments_df[['payment_value', 'capped_payment_value']].describe())

In [None]:
order_payments_df.to_parquet("olist_order_payments_cleaned_dataset.parquet", index=False)

 Observation:
* Missing Values: Missing values in `product_category_name` were filled with 'unknown'. Missing values in numerical columns (`product_name_length`, `product_description_length`, `product_photos_qty`, `product_weight_g`, `product_length_cm`, `product_height_cm`, `product_width_cm`) were imputed with their respective medians.
* Data Types: Data types were checked and found to be appropriate after imputation.
* Duplicate Values: Checked for duplicate rows and duplicate `product_id` values, none were found.
* Outliers: Outliers were detected in several numerical columns. Logarithmic transformations (`np.log1p`) were applied to `product_weight_g` and `product_volume_cm3` to reduce the impact of outliers and normalize distributions. The `product_description_length` was capped at the 99th percentile.