# Olist E-commerce: Feature Engineering & Exploratory Analysis

**Author:** Zeeshan Akram <br>
**Date:** November 2025 <br>
**Project:** Olist E-commerce Performance & Profitability Analysis

---

## 1. Objective

The goal of this notebook is to transform the clean, merged dataset into a feature-rich table ready for deep analysis. I will perform exploratory data analysis (EDA) to uncover actionable, business-oriented insights across the key pillars of the operation.

My focus will be on **identifying key drivers, patterns, and anomalies** related to:
* **Logistics & Customer Satisfaction:** How does operational performance impact customer perception?
* **Product & Seller Performance:** Which products and sellers are driving profitability, and which are a drain on the business?
* **Customer Behavior & Value:** What are the purchasing patterns and geographical trends of our customers?

## 2. Data Source

I am loading the final, cleaned, and merged dataset created in the `01_Data_Cleaning_and_Integration` notebook.

* **File Path:** `../outputs/master_data.csv`

## Importing libraries

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Loading data

In [5]:
df = pd.read_csv('../outputs/cleaned_data.csv')
# preview
df.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,product_id,seller_id,...,customer_city,customer_state,product_weight_g,product_length_cm,product_height_cm,product_width_cm,product_category,seller_zip_code_prefix,seller_city,seller_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18,87285b34884572647811a353c7ac498a,3504c0cb71d7fa48d967e0e4c94d59d9,...,sao paulo,SP,500.0,19.0,8.0,13.0,housewares,9350.0,maua,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13,595fac2a385ac33a80bd5114aec74eb8,289cdb325fb7e7f891c38608bf9e0962,...,barreiras,BA,400.0,19.0,13.0,19.0,perfumery,31570.0,belo horizonte,SP
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04,aa4383b373c6aca5d8797843e5594415,4869f7a5dfa277a7dca6462dcf3b52b2,...,vianopolis,GO,420.0,24.0,19.0,21.0,auto,14840.0,guariba,SP
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15,d0b61bfb1de832b15ba9d266ca96e5b0,66922902710d126a0e7d26b0e3805106,...,sao goncalo do amarante,RN,450.0,30.0,10.0,20.0,pet_shop,31842.0,belo horizonte,MG
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26,65266b2da20d04dbe00c5c2d3bb7859e,2c9e548be18521d1c43cde1c582c6de8,...,santo andre,SP,250.0,51.0,15.0,15.0,stationery,8752.0,mogi das cruzes,SP


## Handling Missing Values (NaNs)

A key part of my strategy is to **intentionally keep** the `NaN` values at this stage.

My `LEFT` joins correctly preserved all original orders, and the resulting `NaN`s are not errors; they represent real-world business scenarios. Dropping them would remove critical insights and bias the analysis.

* **Logistics NaNs** (e.g., `order_delivered_customer_date`) represent **canceled or failed deliveries**. These are essential for analyzing delivery failures.
* **Item NaNs** (e.g., `product_id`, `price`) represent the **775 orders that had no items**.
* **Review NaNs** (e.g., `review_score`) represent **orders that were never reviewed**.

My plan is to handle these `NaN`s during the analysis (e.g., `df[df['review_score'].notnull()]`) rather than dropping them now.

## Feature Engineering

My first step in the analysis phase is to engineer new features from the existing data. These features are designed to convert raw data (like timestamps and prices) into actionable business metrics.

### Logistics Funnel Features

I am breaking down the total order time into four key components to pinpoint specific bottlenecks in the fulfillment process.

* **`approval_time`**
    * **Calculation:** `order_approved_at` - `order_purchase_timestamp`
    * **Business Purpose:** Measures the efficiency of the payment system. A long approval time can lead to early customer frustration.

* **`processing_time`**
    * **Calculation:** `order_delivered_carrier_date` - `order_approved_at`
    * **Business Purpose:** This is the **Seller's Lag**. It measures how long it takes the seller to prepare, pack, and hand the order to the shipping partner.

* **`shipping_time`**
    * **Calculation:** `order_delivered_customer_date` - `order_delivered_carrier_date`
    * **Business Purpose:** This is the **Shipper's Lag**. It isolates the performance of the logistics partner.

* **`delivery_delta`**
    * **Calculation:** `order_estimated_delivery_date` - `order_delivered_customer_date`
    * **Business Purpose:** This is the **most critical satisfaction feature**. A positive value means the order arrived *early*; a negative value means it arrived *late*.

### Customer Friction Feature

I am creating one feature to measure the financial impact of shipping on the customer's purchase.

* **`freight_ratio`**
    * **Calculation:** `freight_value` / `price`
    * **Business Purpose:** This feature measures **customer price sensitivity**, *not* profit. It calculates what percentage of the item's cost the customer had to pay *on top* for shipping. A high ratio (e.g., 0.5) means the shipping cost was 50% of the item's price, which is a major friction point and a strong potential driver of low `review_score`s.

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113425 entries, 0 to 113424
Data columns (total 28 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       113425 non-null  object 
 1   customer_id                    113425 non-null  object 
 2   order_status                   113425 non-null  object 
 3   order_purchase_timestamp       113425 non-null  object 
 4   order_approved_at              113264 non-null  object 
 5   order_delivered_carrier_date   111457 non-null  object 
 6   order_delivered_customer_date  110196 non-null  object 
 7   order_estimated_delivery_date  113425 non-null  object 
 8   product_id                     112650 non-null  object 
 9   seller_id                      112650 non-null  object 
 10  price                          112650 non-null  float64
 11  freight_value                  112650 non-null  float64
 12  total_payment_value           

## Again Converting to DateTime

In [9]:
date_cols = ['order_approved_at', 'order_purchase_timestamp', 'order_delivered_carrier_date', 
             'order_delivered_customer_date','order_estimated_delivery_date']
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

In [34]:
## creating these features

df['approval_time'] = (df['order_approved_at'] - df['order_purchase_timestamp']).dt.days.astype('UInt8')

### Processing time

In [28]:
# It has negative values
df['processing_time'] = (df['order_delivered_carrier_date'] - df['order_approved_at']).dt.days.astype("Int16")

In [30]:
df['processing_time'].apply(lambda x: "Negative" if x<0 else "Positive").value_counts()

processing_time
Positive    111874
Negative      1551
Name: count, dtype: int64

In [42]:
df['processing_time'][(df['processing_time'] < 0)] = 0

## Handling Logical Errors

While engineering the `processing_time` (`order_delivered_carrier_date` - `order_approved_at`), I discovered a significant data integrity issue.

* **Finding:** 1,551 records (about 1.4%) had a **negative processing time**. This is logically impossible, as it implies the seller shipped the item *before* the payment was approved.
* **Analysis:** This is dirty data, likely caused by a lag in the payment approval timestamp being recorded.
* **Action:** To neutralize these impossible values without dropping the rows, I will **impute all negative processing times to 0**. This assumes a best-case scenario (instant processing) and corrects the logical error.

### Shipping time

In [56]:
df['shipping_time'] = (df['order_delivered_customer_date'] - df['order_delivered_carrier_date']).dt.days.astype("Int16")

In [59]:
df['shipping_time'].apply(lambda x: 'Negative' if x<0 else 'Positive').value_counts()

shipping_time
Positive    113375
Negative        50
Name: count, dtype: int64

In [64]:
df.shape

(113425, 31)

In [73]:
# dropping these 50 rows
df = df[(df['shipping_time'] >= 0) | (df['shipping_time'].isna())]

In [74]:
## New shape
df.shape

(113375, 31)

## Handling Logical Errors (Shipping)

While engineering the `shipping_time` (`order_delivered_customer_date` - `order_delivered_carrier_date`), I discovered a second data integrity issue.

* **Finding:** 50 records had a **negative shipping time**, implying the customer received the package before the carrier scanned it as "picked up."
* **Analysis:** This is a data-logging error. The number of affected rows (50, or 0.04% of the dataset) is statistically insignificant.
* **Action:** I will **drop these 50 rows** from the dataset. This is a cleaner solution than imputation, as it completely removes the logically impossible records and prevents them from skewing the analysis of shipping performance.