### DAY 2 — “CHECKING DATA QUALITY
***Todays Goal:*** Performem basic data quality checks on orders dataset


### Import packages

In [1]:
import pandas as pd

### Load csv 

In [2]:
orders_df = pd.read_csv("DATA/olist_orders_dataset.csv")
orders_df

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,delivered,2017-10-02 10:56:33,2017-10-02 11:07:15,2017-10-04 19:55:00,2017-10-10 21:25:13,2017-10-18 00:00:00
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,delivered,2018-07-24 20:41:37,2018-07-26 03:24:27,2018-07-26 14:31:00,2018-08-07 15:27:45,2018-08-13 00:00:00
2,47770eb9100c2d0c44946d9cf07ec65d,41ce2a54c0b03bf3443c3d931a367089,delivered,2018-08-08 08:38:49,2018-08-08 08:55:23,2018-08-08 13:50:00,2018-08-17 18:06:29,2018-09-04 00:00:00
3,949d5b44dbf5de918fe9c16f97b45f8a,f88197465ea7920adcdbec7375364d82,delivered,2017-11-18 19:28:06,2017-11-18 19:45:59,2017-11-22 13:39:59,2017-12-02 00:28:42,2017-12-15 00:00:00
4,ad21c59c0840e6cb83a9ceb5573f8159,8ab97904e6daea8866dbdbc4fb7aad2c,delivered,2018-02-13 21:18:39,2018-02-13 22:20:29,2018-02-14 19:46:34,2018-02-16 18:17:02,2018-02-26 00:00:00
...,...,...,...,...,...,...,...,...
99436,9c5dedf39a927c1b2549525ed64a053c,39bd1228ee8140590ac3aca26f2dfe00,delivered,2017-03-09 09:54:05,2017-03-09 09:54:05,2017-03-10 11:18:03,2017-03-17 15:08:01,2017-03-28 00:00:00
99437,63943bddc261676b46f01ca7ac2f7bd8,1fca14ff2861355f6e5f14306ff977a7,delivered,2018-02-06 12:58:58,2018-02-06 13:10:37,2018-02-07 23:22:42,2018-02-28 17:37:56,2018-03-02 00:00:00
99438,83c1379a015df1e13d02aae0204711ab,1aa71eb042121263aafbe80c1b562c9c,delivered,2017-08-27 14:46:43,2017-08-27 15:04:16,2017-08-28 20:52:26,2017-09-21 11:24:17,2017-09-27 00:00:00
99439,11c177c8e97725db2631073c19f07b62,b331b74b18dc79bcdf6532d51e1637c1,delivered,2018-01-08 21:28:27,2018-01-08 21:36:21,2018-01-12 15:35:03,2018-01-25 23:32:54,2018-02-15 00:00:00


### Check for duplicate rows

In [3]:
orders_df.duplicated().sum()  #How many duplicate rows exist

np.int64(0)

In [4]:
orders_df[orders_df.duplicated()]  #See the duplicate rows (if any)

Unnamed: 0,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date


### Check unique identifier

In [5]:
orders_df['order_id'].nunique()

99441

In [6]:
len(orders_df)

99441

##### both numbers are equal → order_id is unique
- If not equal, find duplicates

### Explore important categorical column

In [7]:
orders_df['order_status'].value_counts()

order_status
delivered      96478
shipped         1107
canceled         625
unavailable      609
invoiced         314
processing       301
created            5
approved           2
Name: count, dtype: int64

We use value_counts() to count how many times each order status appears so we can understand distribution, detect issues, and make correct decisions.

### Explore date columns
- We take only required columns to focus on useful information, reduce confusion, improve model performance, and work faster with clean data.

In [8]:
orders_df[['order_purchase_timestamp',
        'order_delivered_customer_date']].head()

Unnamed: 0,order_purchase_timestamp,order_delivered_customer_date
0,2017-10-02 10:56:33,2017-10-10 21:25:13
1,2018-07-24 20:41:37,2018-08-07 15:27:45
2,2018-08-08 08:38:49,2018-08-17 18:06:29
3,2017-11-18 19:28:06,2017-12-02 00:28:42
4,2018-02-13 21:18:39,2018-02-16 18:17:02


check missing values:

In [9]:
orders_df[['order_delivered_customer_date']].isna().sum()

order_delivered_customer_date    2965
dtype: int64

#### Conclusion

- No duplicate rows found in orders dataset
- order_id appears to be unique
- order_status has 8 distinct values
- Some delivery dates are missing
- Missing delivery dates may relate to canceled orders


#### Learnings: 
- We check duplicates to make sure each record appears only once so our analysis, calculations, and models stay accurate and trustworthy.
- We check unique data to find duplicates, understand categories, clean mistakes, and make correct decisions from data.