### DAY 8 â€” VALIDATING THE FEATURE
***Today's Goal:***
- Check negative values
- Check very large values
- Decide what is right or wrong using logic (No fixing blindly.)

### Load data

In [1]:
# load datasets 
import pandas as pd

orders = pd.read_csv("DATA/olist_orders_dataset.csv")
customers = pd.read_csv("DATA/olist_customers_dataset.csv")

merged_df = pd.merge(
    orders,
    customers,
    on='customer_id',
    how='left'
)

# convert datatyoe object into datetime 
date_cols = [
    'order_purchase_timestamp',
    'order_approved_at',
    'order_delivered_carrier_date',
    'order_delivered_customer_date',
    'order_estimated_delivery_date'
]
for col in date_cols:
    merged_df[col] = pd.to_datetime(merged_df[col],errors='coerce')

# Create delivery time feature
merged_df['delivery_time_days'] = (
    merged_df['order_delivered_customer_date'] - # date when the customer received the order
    merged_df['order_purchase_timestamp'] # date when the customer placed the order
).dt.days

### Check negative delivery times
How can delivery be before purchase?

Data error or time issue?

In [2]:
merged_df[merged_df['delivery_time_days'] < 0][
    ['order_id', 'order_status', 'delivery_time_days']
].head()

Unnamed: 0,order_id,order_status,delivery_time_days


### Count negative values

In [3]:
(merged_df['delivery_time_days'] < 0).sum()

np.int64(0)

### Check very large values

In [4]:
merged_df['delivery_time_days'].max()

np.float64(209.0)

In [5]:
merged_df['delivery_time_days'].mean()

np.float64(12.094085575687217)

### Connect with order_status

In [6]:
merged_df.groupby('order_status')['delivery_time_days'].describe()


Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
order_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
approved,0.0,,,,,,,
canceled,6.0,19.833333,13.136463,7.0,7.75,20.0,30.0,35.0
created,0.0,,,,,,,
delivered,96470.0,12.093604,9.55138,0.0,6.0,10.0,15.0,209.0
invoiced,0.0,,,,,,,
processing,0.0,,,,,,,
shipped,0.0,,,,,,,
unavailable,0.0,,,,,,,


### Observations

- Some delivery_time_days values may be negative
- Negative values are logically incorrect
- Extremely high values may be data errors or edge cases
- Feature needs cleaning before using in ML or analysis


#### Learnings

- Calculated values can have mistakes
- Negative delivery time is not possible
- Very large values need checking
- Always validate data before using it
