# Olist E-commerce Performance & Profitability Analysis  
## Notebook 01: Data Cleaning & Integration

### Objective
This notebook focuses on the **data understanding, inspection, cleaning, and integration** of all Olist datasets.  
The goal is to create one **clean, consistent, and analysis-ready master table** that combines data from all sources.

### Business Context
Olist is a Brazilian e-commerce marketplace connecting small businesses (sellers) to customers nationwide.  
However, the company has faced **declining customer satisfaction** and **inconsistent profitability** across categories.

To help the business recover, I aim to:
1. Identify the **key causes of customer dissatisfaction** (e.g., delays, product issues).
2. Find the **drivers of low profitability** (e.g., high shipping cost, poor-performing sellers).
3. Prepare clean data for **visual and dashboard analysis** in the next phase.

### This Notebook Covers:
1. Data Loading — importing all 9 CSV files  
2. Initial Inspection — overview of structure, size, and missing values  
3. Data Cleaning — handling missing values, date formatting, category mapping  
4. Data Integration — merging multiple tables into one “master” dataset  
5. Export — saving the cleaned dataset for the next analysis notebook

## Step 1: Import Required Libraries
I’ll import all the libraries required for data loading, cleaning, and merging.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

## Dataset Overview: Olist Data Dictionary

| File Name | Primary Purpose | Foreign Keys (for joining) |
| :--- | :--- | :--- |
| **`olist_orders_dataset.csv`** | **Master Order Table** (Status, Dates, Customer ID) | `customer_id` |
| **`olist_customers_dataset.csv`** | Customer location and unique user ID. | `customer_id` (from Orders) |
| **`olist_geolocation_dataset.csv`** | Geographical coordinates for zip codes. | `geolocation_zip_code_prefix` (from Customers/Sellers) |
| **`olist_order_items_dataset.csv`** | Transactional core (price, seller, product). | `order_id`, `product_id`, `seller_id` |
| **`olist_order_payments_dataset.csv`** | Payment method, installments, and total value. | `order_id` |
| **`olist_order_reviews_dataset.csv`** | Customer satisfaction score (1-5) and message. | `order_id` |
| **`olist_products_dataset.csv`** | Physical details, attributes, and category name. | `product_id` (from Order Items) |
| **`olist_sellers_dataset.csv`** | Seller registration and location information. | `seller_id` (from Order Items) |
| **`product_category_name_translation.csv`** | Look-up table to convert Portuguese names to English. | `product_category_name` (from Products) |

---

### Loading all files

In [20]:
orders_data = pd.read_csv('../raw_data/olist_orders_dataset.csv')
customers_data = pd.read_csv('../raw_data/olist_customers_dataset.csv')
geolocation_data = pd.read_csv('../raw_data/olist_geolocation_dataset.csv')
orders_item_data = pd.read_csv('../raw_data/olist_order_items_dataset.csv')
orders_payment_data = pd.read_csv('../raw_data/olist_order_payments_dataset.csv')
orders_review_data = pd.read_csv('../raw_data/olist_order_reviews_dataset.csv')
products_data = pd.read_csv('../raw_data/olist_products_dataset.csv')
sellers_data = pd.read_csv('../raw_data/olist_sellers_dataset.csv')
category_names_data = pd.read_csv('../raw_data/product_category_name_translation.csv')

In [21]:
def inspect(df, df_name='df'):
    print(df_name)
    print(f"\nShape: {df.shape}")
    print("\nDtypes & Null Counts:")

    df.info()
    print(f"\nData Preview: {df.head()}")
    print(f"\nNull Count: {df.isnull().sum()}\n")

In [22]:
data_objects_map = {
    'Orders Data': orders_data,
    'Customers Data': customers_data,
    'GeoLocation Data': geolocation_data,
    'Orders Items Data': orders_item_data,
    'Orders Payments Data': orders_payment_data,
    'Orders Reviews Data': orders_review_data,
    'Products Data': products_data,
    'Sellers Data': sellers_data,
    'Category Names Data': category_names_data
}

for name, data in data_objects_map.items():
    inspect(data, name)

Orders Data

Shape: (99441, 8)

Dtypes & Null Counts:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB

Data Preview:                            order_id                       customer_id  \
0  e481f51cbdc54678b7cc49136f2d6af7  9ef432eb6251297304e76186b10a928d   
1  53cdb2fc8bc7dce0b6741e2150273451  b0830fb474

## Initial Data Inspection Summary

### olist_orders_dataset

This is the main **orders timeline** table ($\approx 99.4k$ records). All date columns are currently stored as `object` (string) and **must be converted** to datetime objects for analysis. Approximately **3% of orders lack a customer delivery date** and $\approx 1.8\%$ lack a carrier dispatch date, strongly indicating **canceled or undelivered orders**.

### olist_customers_dataset

This dataset is clean and complete ($\approx 99.4k$ records with 0 nulls). The critical feature is the **`customer_unique_id`**, which is necessary for distinguishing between one-time buyers and repeat customers for RFM analysis.

### olist_geolocation_dataset

This table is **very large** ($\approx 1$ million records) and fully complete. Given its size, I must confirm that I **only load the required zip codes** during the merge phase to avoid memory issues, even after optimizing its high memory usage (38+ MB).

### olist_order_items_dataset

This transactional table is complete and contains **$\approx 112.6k$ item records** linked to $\approx 99.4k$ orders. Since it contains multiple entries per `order_id` (for multiple items), I must be careful to aggregate prices and freight values correctly when merging.

### olist_order_payments_dataset

This table has $\approx 103.8k$ records, slightly more than the number of orders, confirming that some **orders were paid for using multiple payment types** (e.g., credit card and voucher) or sequential transactions. No critical missing values were observed.

### olist_order_reviews_dataset

This table has major sparsity issues in its qualitative data. While review scores are complete, **$\approx 88\%$ of reviews lack a comment title** and $\approx 59\%$ lack a message. I can only rely on the **`review_score`** for satisfaction analysis unless advanced text analysis is performed on the available messages.

### olist_products_dataset

This catalog of $\approx 32.9k$ unique products is relatively clean. The main issue is that $\approx 2\%$ of records lack product attribute data (category, length, weight), which means these **products cannot be categorized or used in logistics analysis** (freight cost estimation).

### olist_sellers_dataset

This small table ($\approx 3k$ records) is fully complete and will be critical for linking seller performance metrics (derived from **Order Items**) to their geographical location.

### product_category_name_translation.csv

This small, complete table ($\approx 71$ records) is essential for converting the Portuguese category names into **English** for immediate, human-readable reporting.

## Keys finding

In [26]:
for name, data in data_objects_map.items():
    print(name)
    print(data.nunique())
    print('\n')

Orders Data
order_id                         99441
customer_id                      99441
order_status                         8
order_purchase_timestamp         98875
order_approved_at                90733
order_delivered_carrier_date     81018
order_delivered_customer_date    95664
order_estimated_delivery_date      459
dtype: int64


Customers Data
customer_id                 99441
customer_unique_id          96096
customer_zip_code_prefix    14994
customer_city                4119
customer_state                 27
dtype: int64


GeoLocation Data
geolocation_zip_code_prefix     19015
geolocation_lat                717360
geolocation_lng                717613
geolocation_city                 8011
geolocation_state                  27
dtype: int64


Orders Items Data
order_id               98666
order_item_id             21
product_id             32951
seller_id               3095
shipping_limit_date    93318
price                   5968
freight_value           6999
dtype: int64


Ord

## Count of Items in keys

In [41]:
orders_payment_data

Unnamed: 0,order_id,payment_sequential,payment_type,payment_installments,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,1,credit_card,8,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,1,credit_card,1,24.39
2,25e8ea4e93396b6fa0d3dd708e76c1bd,1,credit_card,1,65.71
3,ba78997921bbcdc1373bb41e913ab953,1,credit_card,8,107.78
4,42fdf880ba16b47b59251dd489d4441a,1,credit_card,2,128.45
...,...,...,...,...,...
103881,0406037ad97740d563a178ecc7a2075c,1,boleto,1,363.31
103882,7b905861d7c825891d6347454ea7863f,1,credit_card,2,96.80
103883,32609bbb3dd69b3c066a6860554a77bf,1,credit_card,1,47.77
103884,b8b61059626efa996a60be9bb9320e10,1,credit_card,5,369.54


In [48]:
print("Orders Data:")
print("Orders ID and Customer ID Unique Count:")
display(orders_data['order_id'].nunique())
display(orders_data['customer_id'].nunique())

print("\nCustomers Data:")
print("Customer ID and Customer Unique ID Unique count:")
display(customers_data['customer_id'].nunique())
display(customers_data['customer_unique_id'].nunique())

print("\nProducts Data:")
print("Product ID Unique count:")
display(products_data['product_id'].nunique())

print("\nOrders Item Data:")
print("Order ID and Product ID Unique count:")
display(orders_item_data['order_id'].nunique())
display(orders_item_data['product_id'].nunique())

print('\nOrders Review Data:')
print('Order ID Unique count:')
display(orders_review_data['order_id'].nunique())

print("\nOrders Payment Data:")
print('Order ID Unique count:')
display(orders_payment_data['order_id'].nunique())

Orders Data:
Orders ID and Customer ID Unique Count:


99441

99441


Customers Data:
Customer ID and Customer Unique ID Unique count:


99441

96096


Products Data:
Product ID Unique count:


32951


Orders Item Data:
Order ID and Product ID Unique count:


98666

32951


Orders Review Data:
Order ID Unique count:


98673


Orders Payment Data:
Order ID Unique count:


99440

## Data Model and Relational Integrity Check

### Primary Keys (PKs)

I successfully identified the primary identifier for each core entity table:

| Table | Primary Key (PK) |
| :--- | :--- |
| **Orders Data** | `order_id` |
| **Customers Data** | `customer_unique_id` |
| **Reviews Data** | `review_id` |
| **Products Data** | `product_id` |
| **Sellers Data** | `seller_id` |

### Inter-Table Relationships (Foreign Keys)

The transactional nature of the data requires three main merging flows:

| Data Set | Link Key (Foreign Key) | Target Table | Purpose |
| :--- | :--- | :--- | :--- |
| **Orders Items Data** | `order_id` | Orders Data | Links items back to the parent order and customer. |
| | `product_id` | Products Data | Links items to product attributes (category, weight). |
| | `seller_id` | Sellers Data | Identifies the seller responsible for fulfillment. |
| **Orders Reviews Data** | `order_id` | Orders Data | Links satisfaction score to the order timeline. |
| **Orders Data** | `customer_id` | Customers Data | Links the order to the customer's unique ID. |
| **GeoLocation Data** | *None* | *None* | Merging requires matching zip code prefixes, not a unique transaction key. |

### Integrity and Cardinality Observations

Deep inspection revealed key structural differences that must be handled during the merging phase:

1.  **Customer De-duplication:** The `Customers Data` has **99,441** `customer_id` entries (one per order) but only **96,096** **`customer_unique_id`**. This confirms $\approx 3,345$ **repeat buyers**, which is crucial for RFM analysis.

2.  **Missing Order Details:** I have **99,441** orders in the `Orders Data` table, but the child tables report slightly fewer:
    * `Orders Items Data`: **98,666** unique `order_id`s ($\approx 0.8\%$ missing).
    * `Orders Reviews Data`: **98,673** unique `order_id`s.
    These slight gaps suggest a few orders were placed but had **no items** or **no review entry** generated. I should prioritize using the `Orders Data` as the master list and performing **left joins** to retain all original orders.

3.  **Order Item Duplication:** The `Orders Items Data` has $\approx 112.6k$ total rows but only **98,666** unique `order_id`s, confirming that **multiple items often belong to a single order**. This is a normal one-to-many relationship, requiring careful aggregation (summing `price` and `freight_value`) before joining.

4.  **Geolocation Strategy:** The GeoLocation data cannot be joined directly using a single foreign key. It is a large, descriptive table ($\approx 1$ million rows) that must be merged using the **zip code prefix** (`customer_zip_code_prefix` or `seller_zip_code_prefix`).