### DAY 7 — CREATING FIRST MEANINGFUL FEATURE
- Feature creation helps decide missing value strategy
- ***Today’s Goal:***
    - What a feature is
    - How to create delivery time
    - Why feature creation is important

- Understand missing values  →  Create features  →  Decide how to handle missing

- You know order date & You know delivery date
- If you subtract : You get how many days it took
- That number is more useful than two dates.
- That number = feature.

### Load merged + cleaned dataset

In [1]:
import pandas as pd

orders = pd.read_csv("DATA/olist_orders_dataset.csv")
customers = pd.read_csv("DATA/olist_customers_dataset.csv")

merged_df = pd.merge(
    orders,
    customers,
    on='customer_id',
    how='left'
)
merged_df.head()

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,customer_unique_id,customer_zip_code_prefix,customer_city,customer_state
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00,7c396fd4830fd04220f754e42b4e5bff,3149,sao paulo,SP
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00,af07308b275d755c9edb36a90c618231,47813,barreiras,BA
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00,3a653a41f6f9fc3d2a113cf8398680e8,75265,vianopolis,GO
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00,7c142cf63193a1473d2e66489a9ae977,59296,sao goncalo do amarante,RN
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00,72632f0f9dd73dfee390c9b22eb56dd6,9195,santo andre,SP


In [2]:
date_cols = [
    'order_purchase_timestamp',
    'order_approved_at',
    'order_delivered_carrier_date',
    'order_delivered_customer_date',
    'order_estimated_delivery_date'
]
for col in date_cols:
    merged_df[col] = pd.to_datetime(merged_df[col],errors='coerce')

In [3]:
merged_df.dtypes

order_id                                 object
customer_id                              object
order_status                             object
order_purchase_timestamp         datetime64[ns]
order_approved_at                datetime64[ns]
order_delivered_carrier_date     datetime64[ns]
order_delivered_customer_date    datetime64[ns]
order_estimated_delivery_date    datetime64[ns]
customer_unique_id                       object
customer_zip_code_prefix                  int64
customer_city                            object
customer_state                           object
dtype: object

### Create delivery time feature
How many days did it take to deliver each order? ---> Delivery date MINUS order date

In [4]:
merged_df['delivery_time_days'] = (
    merged_df['order_delivered_customer_date'] - # date when the customer received the order
    merged_df['order_purchase_timestamp'] # date when the customer placed the order
).dt.days

- .dt is used to work with date & time columns in pandas.(dt = date-time accessor)
- .dt.days extracts only days insted od days:hours:minutes:seconds
- Subtract dates - Get number of days - New column created

### Look at the new feature

In [5]:
merged_df[['delivery_time_days']].head()


Unnamed: 0,delivery_time_days
0,8.0
1,13.0
2,9.0
3,13.0
4,2.0


### Basic understanding

In [8]:
merged_df['delivery_time_days'].describe()

count    96476.000000
mean        12.094086
std          9.551746
min          0.000000
25%          6.000000
50%         10.000000
75%         15.000000
max        209.000000
Name: delivery_time_days, dtype: float64

#### Conclusion

- created a new column to calculate delivery time
- This helps understand how long orders take to reach customers
- Missing values are normal for orders that were not delivered
- Creating features before cleaning helps understand data better

#### Learnings

- A feature is created by combining existing columns
- Dates can be used to calculate meaningful values like delivery time
- Missing values can be valid and should not be filled blindly
- Understanding business logic is important before fixing data
