### DAY 5 — UNDERSTANDING MISSING DATA (WHY IT EXISTS)

***Today’s Goal:*** Where data is missing - Why it might be missing - Which missing data matters

#### Load merged dataset

In [1]:
import pandas as pd

orders_df = pd.read_csv("DATA/olist_orders_dataset.csv")
customers_df = pd.read_csv("DATA/olist_customers_dataset.csv")

merged_df = pd.merge(
    orders_df,
    customers_df,
    on='customer_id',
    how='left'
)

### Count missing values column-wise

In [2]:
merged_df.isna().sum()

order_id                            0
customer_id                         0
order_status                        0
order_purchase_timestamp            0
order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
customer_unique_id                  0
customer_zip_code_prefix            0
customer_city                       0
customer_state                      0
dtype: int64

### Focus ONLY on date columns

In [3]:
date_cols = [
    'order_approved_at',
    'order_delivered_carrier_date',
    'order_delivered_customer_date',
    'order_estimated_delivery_date'
]

merged_df[date_cols].isna().sum()


order_approved_at                 160
order_delivered_carrier_date     1783
order_delivered_customer_date    2965
order_estimated_delivery_date       0
dtype: int64

### Connect missing dates to order_status
- Which order_status has missing delivery date

In [4]:
# only for one column order_delivered_customer_date
merged_df[merged_df['order_delivered_customer_date'].isna()]['order_status'].value_counts()

order_status
shipped        1107
canceled        619
unavailable     609
invoiced        314
processing      301
delivered         8
created           5
approved          2
Name: count, dtype: int64

In [5]:
# check for all date columns
for col in date_cols:
    print(f"\nMissing values in {col}:")
    print(
        merged_df[merged_df[col].isna()]['order_status'].value_counts()
    )



Missing values in order_approved_at:
order_status
canceled     141
delivered     14
created        5
Name: count, dtype: int64

Missing values in order_delivered_carrier_date:
order_status
unavailable    609
canceled       550
invoiced       314
processing     301
created          5
approved         2
delivered        2
Name: count, dtype: int64

Missing values in order_delivered_customer_date:
order_status
shipped        1107
canceled        619
unavailable     609
invoiced        314
processing      301
delivered         8
created           5
approved          2
Name: count, dtype: int64

Missing values in order_estimated_delivery_date:
Series([], Name: count, dtype: int64)


No missing values in order_estimated_delivery_date

#### Conclusion

- Majority of missing delivery dates belong to shipped, canceled, and unavailable orders
- This matches real-world business logic
- Very few delivered orders have missing delivery dates (possible data issue)
- Missing delivery dates are informative and should not be blindly filled


#### Learnings:
- We check missing values to make sure data is complete and correct so computers, analysis, and models give accurate results.
- We first focus on date columns because they tell us when things happened, and almost all analysis depends on correct time information.
- you should check missing values for all date columns against order_status to validate order flow and data correctness.