## Business Objective

The goal of this analysis is to understand the drivers of Fulfillment Efficiency across
products, customers, and logistics operations, and to identify actionable levers that
can reduce returns, cancellations, and revenue loss.


## Defining Fulfillment Efficiency

Fulfillment Efficiency measures how much of the gross order value is ultimately realized
as delivered revenue after accounting for returns.


Fulfillment Efficiency = (Delivered Revenue − Returned Revenue) / Gross Order Value


## Key Metrics Used

- Gross Order Value (GOV)
- Delivered Revenue
- Returned Revenue
- Cancellation Rate
- Return Rate
- Fulfillment Efficiency


## Hypotheses

H1: Categories with lower product ratings have higher return rates.

H2: Products with low stock availability experience higher cancellation rates.

H3: Price deviations of ±10% from category averages are associated with poorer
fulfillment outcomes.

H4: Certain logistics combinations (shipping + payment method) drive higher returns.


In [1]:
import gdown
import pandas as pd

## Data Import

The following cells download raw datasets from cloud storage.
These are setup steps and not part of the analysis.


In [2]:
# @title
file_id = "1iFlv5PjnezdaCcTzWsjAX-Ck9kMCBMMK"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, "orders.csv", quiet=False)

Downloading...
From (original): https://drive.google.com/uc?id=1iFlv5PjnezdaCcTzWsjAX-Ck9kMCBMMK
From (redirected): https://drive.google.com/uc?id=1iFlv5PjnezdaCcTzWsjAX-Ck9kMCBMMK&confirm=t&uuid=d41aafa4-252c-40f4-a027-fe1440c126de
To: /content/orders.csv
100%|██████████| 377M/377M [00:07<00:00, 48.9MB/s]


'orders.csv'

In [3]:
file_id = "1o25JTcxDBEaigjCrdq_bzKb9BtzCZdy8"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, "orderline.csv", quiet=False)


Downloading...
From (original): https://drive.google.com/uc?id=1o25JTcxDBEaigjCrdq_bzKb9BtzCZdy8
From (redirected): https://drive.google.com/uc?id=1o25JTcxDBEaigjCrdq_bzKb9BtzCZdy8&confirm=t&uuid=74688600-0fb7-4926-a13c-6a7e9962ada6
To: /content/orderline.csv
100%|██████████| 642M/642M [00:12<00:00, 52.6MB/s]


'orderline.csv'

In [4]:
file_id = "1Aa5oSSE-3Fn6RQpupqcg2sAf3l2VlccA"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, "person.csv", quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1Aa5oSSE-3Fn6RQpupqcg2sAf3l2VlccA
To: /content/person.csv
100%|██████████| 83.3M/83.3M [00:00<00:00, 121MB/s]


'person.csv'

In [5]:
file_id = "1dL388NuXzV8mpTJ44HEmp2LgxKR1d4z8"
url = f"https://drive.google.com/uc?id={file_id}"
gdown.download(url, "product.csv", quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1dL388NuXzV8mpTJ44HEmp2LgxKR1d4z8
To: /content/product.csv
100%|██████████| 1.75M/1.75M [00:00<00:00, 61.3MB/s]


'product.csv'

In [6]:
df_person = pd.read_csv("person.csv", sep=";")
df_orders = pd.read_csv("orders.csv", sep=";")
df_orderline = pd.read_csv("orderline.csv", sep=";")
df_product = pd.read_csv("product.csv", sep=";")

In [7]:
df_orders.shape
df_orderline.shape
df_product.shape
df_person.shape

(600000, 12)

At this stage, no data cleaning was required as IDs and categorical fields were already well structured.

## Initial Data Inspection

To get a quick sense of the structure and fields in each table, I reviewed a small sample of rows from all datasets.  
This helps confirm column meanings and spot any obvious problems in data early.

In [None]:
df_person['person_id'].is_unique
df_orders['order_id'].is_unique
df_product['product_id'].is_unique

True

In [None]:
df_orders['status'].value_counts()
df_orderline['status'].value_counts()

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
fulfilled,2602356
cancelled,2600481
returned,2599406
pending,2599167
shipped,2598590


In [10]:
success_lines = ['fulfilled']
loss_lines = ['returned']

delivered_revenue = df_orderline.loc[
    df_orderline['status'].isin(success_lines), 'subtotal'
].sum()

returned_revenue = df_orderline.loc[
    df_orderline['status'].isin(loss_lines), 'subtotal'
].sum()

GOV = df_orderline['subtotal'].sum()

NRR = delivered_revenue - returned_revenue
fulfillment_efficiency = NRR / GOV
NRR, fulfillment_efficiency


(np.float64(10433400.990003586), np.float64(0.00014825130582453696))

This represents the overall fulfillment efficiency across the platform.


In [15]:
df_combined = (
    df_orderline
        .merge(
            df_product[
                [
                    'product_id',
                    'category',
                    'rating_average',
                    'review_count',
                    'price',
                    'stock_quantity'
                ]
            ],
            on='product_id',
            how='left'
        )
)

In [13]:
category_summary = df_combined.groupby('category').agg(
    GOV=('subtotal','sum'),
    Delivered=('subtotal', lambda x: x[df_combined.loc[x.index,'status']=='fulfilled'].sum()),
    Returned=('subtotal', lambda x: x[df_combined.loc[x.index,'status']=='returned'].sum())
)

category_summary['Fulfillment_Efficiency'] = (
    (category_summary['Delivered'] - category_summary['Returned']) /
    category_summary['GOV']
)

category_summary.sort_values('Fulfillment_Efficiency')


Unnamed: 0_level_0,GOV,Delivered,Returned,Fulfillment_Efficiency
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Beverages,1472792000.0,292524500.0,297788400.0,-0.003574
Gaming,1364663000.0,270812700.0,274798900.0,-0.002921
Kitchenware,1174737000.0,232612700.0,235625200.0,-0.002564
Automotive,1237309000.0,244923600.0,248096200.0,-0.002564
Reference,1164652000.0,231472700.0,234192600.0,-0.002335
Car Electronics,1259669000.0,251868800.0,254524800.0,-0.002108
Health,1371456000.0,273870000.0,276138200.0,-0.001654
Audiobooks,1510470000.0,300780200.0,302962900.0,-0.001445
Computing,1534296000.0,306905200.0,308803000.0,-0.001237
Outdoors,1396697000.0,277880600.0,279284700.0,-0.001005


#Fulfillment efficiency table
Several categories above show negative fulfillment efficiency due to return volumes exceeding fulfilled revenue. This indicates operational or product-level issues such as high return rates, customer dissatisfaction, or fragile logistics in those categories.

In [27]:
df_combined.groupby(pd.cut(df_combined['rating_average'], bins=5))['status'] \
           .value_counts(normalize=True)


  df_combined.groupby(pd.cut(df_combined['rating_average'], bins=5))['status'] \


Unnamed: 0_level_0,Unnamed: 1_level_0,proportion
rating_average,status,Unnamed: 2_level_1
"(0.996, 1.8]",shipped,0.200196
"(0.996, 1.8]",fulfilled,0.200077
"(0.996, 1.8]",cancelled,0.199992
"(0.996, 1.8]",pending,0.199898
"(0.996, 1.8]",returned,0.199838
"(1.8, 2.6]",returned,0.200352
"(1.8, 2.6]",fulfilled,0.200138
"(1.8, 2.6]",cancelled,0.199917
"(1.8, 2.6]",pending,0.199875
"(1.8, 2.6]",shipped,0.199718


### Impact of Product Ratings on Fulfillment Outcomes

Across all rating bands, order status distributions remain nearly uniform.
This suggests that product ratings alone are not a strong predictor of
returns, cancellations, or fulfillment success in this dataset.

As a result, further analysis focuses on operational and pricing factors
that may have a stronger influence on fulfillment efficiency.

No meaningful relationship is observed between product rating and fulfillment status. Status distributions remain nearly uniform across all rating bands.

In [25]:
df_combined.groupby(pd.cut(df_combined['stock_quantity'], bins=5))['status'] \
           .value_counts(normalize=True)

  df_combined.groupby(pd.cut(df_combined['stock_quantity'], bins=5))['status'] \


Unnamed: 0_level_0,Unnamed: 1_level_0,proportion
stock_quantity,status,Unnamed: 2_level_1
"(-0.5, 100.0]",cancelled,0.200117
"(-0.5, 100.0]",returned,0.200098
"(-0.5, 100.0]",fulfilled,0.200061
"(-0.5, 100.0]",pending,0.199921
"(-0.5, 100.0]",shipped,0.199803
"(100.0, 200.0]",shipped,0.200247
"(100.0, 200.0]",returned,0.19995
"(100.0, 200.0]",fulfilled,0.199939
"(100.0, 200.0]",pending,0.199937
"(100.0, 200.0]",cancelled,0.199928
