## Setting up

### Import pandas and read in the csv file and set it to a dataframe called baskets

In [8]:
import pandas as pd
import numpy as np

### questions that were not answered from last notebook:

 - why are the count on top_cat_id and sub_cat_id different from others? 
 - ID columns' statistics make sense other than count, min, max, since they are supposed to be identifiers, should we treat them as categorical?
 - why would some items have price of 0?
 - unique placed_at is one greater than unique order_id, is it possible that two orders are made on exactly the same milisecond? In theory it is possible, but might there be potential fraud?
 - how can we find out which two orders happened on the exact same millisecond? 
 - how many merchant transacted on a particular day, say December 31, 2021?
 - how do we work with a string object and get the date, hour, min, second, millisecond? 

### plan for this notebook to work on:
 - why are the count on top_cat_id and sub_cat_id different from others? 
 - how can we find out which two orders happened on the exact same millisecond? 
 - how many merchant transacted on a particular day, say December 31, 2021?
 - how do we work with a string object and get the date, hour, min, second, millisecond? 
 
### the remaining questions may need business answer
 - ID columns' statistics make sense other than count, min, max, since they are supposed to be identifiers, should we treat them as categorical?
 - why would some items have price of 0?
 - unique placed_at is one greater than unique order_id, is it possible that two orders are made on exactly the same milisecond? In theory it is possible, but might there be potential fraud?


In [9]:
baskets = pd.read_csv('new_baskets_sample_random_10.csv')
baskets.count()

id             30803
order_id       30803
placed_at      30803
merchant_id    30803
sku_id         30803
top_cat_id     30803
sub_cat_id     30803
qty            30803
price          30803
dtype: int64

###  why are the count on top_cat_id and sub_cat_id different from others?
 - check the python reference on "count" function
 - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.count.html
 - function count documentation says "Count non-NA cells for each column or row" 
 - usually count value difference could be due to some null value in the column
 - use isnull, or isna to find out 

In [10]:
baskets.isnull().sum()

id             0
order_id       0
placed_at      0
merchant_id    0
sku_id         0
top_cat_id     0
sub_cat_id     0
qty            0
price          0
dtype: int64

In [11]:
baskets.isna().sum()

id             0
order_id       0
placed_at      0
merchant_id    0
sku_id         0
top_cat_id     0
sub_cat_id     0
qty            0
price          0
dtype: int64

In [12]:
baskets[baskets['top_cat_id'].isnull()]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price


- what observations can you make from the above results?

 - top_cat_id and sub_cat_id null happened on the same rows, because there are 7 null for top_cat_id and 7 null for sub-cat_id
 - this happened on different SKUs, different merchants, different 
 - we see at least one record has price 0 where top_cat_id and sub_cat_id are null, not sure if this is coincident 

## working with datatime

In [13]:
baskets[baskets.duplicated(keep=False)]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price


In [14]:
from datetime import datetime, timedelta

baskets = pd.read_csv('new_baskets_sample_random_10.csv')
baskets['datetime'] = pd.to_datetime(baskets['placed_at'])

baskets['date'] = baskets['datetime'].apply(lambda x: datetime.date(x))
baskets['year'] = baskets['datetime'].apply(lambda x: x.year)
baskets['month'] = baskets['datetime'].apply(lambda x: x.month)
baskets['day'] = baskets['datetime'].apply(lambda x: x.day)
baskets['hour'] = baskets['datetime'].apply(lambda x: x.hour)
baskets['weekday'] = baskets['datetime'].apply(lambda x: datetime.isoweekday(x))
baskets.head(3)


Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price,datetime,date,year,month,day,hour,weekday
0,43,8,2021-04-17 20:16:49.181,84,59,4.0,51.0,1500,12000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,6
1,44,8,2021-04-17 20:16:49.181,84,102,3.0,94.0,100,169000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,6
2,45,8,2021-04-17 20:16:49.181,84,101,3.0,94.0,100,134000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,6


### another way to do the date convertions

In [15]:
baskets = pd.read_csv('new_baskets_sample_random_10.csv')
baskets['datetime'] = baskets['placed_at'].apply(lambda x: datetime.fromisoformat(x))

#pandas.Series.dt is an interface on a pandas series that gives you convenient access to operations on data stored as a pandas datetime. 
baskets['date'] = baskets['datetime'].dt.date
baskets['year'] = baskets['datetime'].dt.year
baskets['month'] = baskets['datetime'].dt.month
baskets['day'] = baskets['datetime'].dt.day
baskets['hour'] = baskets['datetime'].dt.hour
baskets['weekday'] = baskets['datetime'].dt.weekday
baskets.head(3)

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price,datetime,date,year,month,day,hour,weekday
0,43,8,2021-04-17 20:16:49.181,84,59,4.0,51.0,1500,12000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,5
1,44,8,2021-04-17 20:16:49.181,84,102,3.0,94.0,100,169000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,5
2,45,8,2021-04-17 20:16:49.181,84,101,3.0,94.0,100,134000.0,2021-04-17 20:16:49.181,2021-04-17,2021,4,17,20,5


 - now we can answer the question of how many merchant transacted on a particular day, say December 31, 2021?

In [16]:
baskets[baskets['date']== pd.to_datetime('2021-12-31').date()].count()

id             90
order_id       90
placed_at      90
merchant_id    90
sku_id         90
top_cat_id     90
sub_cat_id     90
qty            90
price          90
datetime       90
date           90
year           90
month          90
day            90
hour           90
weekday        90
dtype: int64

 - this above looks like it is counting the number of rows

In [17]:
baskets[baskets['date']== pd.to_datetime('2021-12-31').date()].nunique()

id             90
order_id       27
placed_at      27
merchant_id    20
sku_id         71
top_cat_id     13
sub_cat_id     28
qty            14
price          67
datetime       27
date            1
year            1
month           1
day             1
hour            7
weekday         1
dtype: int64

 - what can be observed from the above result?
 - 23 merchants transacted on Dec 31, 2021, making a total 27 orders, over 9 hours
 - ......

### answer the question of how to find the two orders that are placed on the exact same millisecond

In [18]:
# we only need to look at columns of order_id and placed_at, and find which rows have different order_id but same placed_at time
# any rows that have exact same order_id and same placed_at are considered duplicate and we can drop them
df = baskets.drop_duplicates(subset = ['order_id','placed_at'])
# and then look at the all rows with same placed_at, find the duplicates 
# the order_id's in those rows would be the orders that have antoher order that have the same placed_at time
o_id = df[df.duplicated(subset = ['placed_at'],keep=False)]['order_id'].reset_index()
baskets[baskets['order_id'].isin(o_id['order_id'])]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price,datetime,date,year,month,day,hour,weekday


 - what can be observed from the above results?
 - do the rows look normal? it is not impossible for two orders being placed at exactly same millisecond, isn't it? 
 - should we check on other columns whether they have exactly same but should not? 
 - check for duplicates?

In [19]:
baskets = pd.read_csv('new_baskets_sample_random_10.csv')
baskets[baskets.duplicated(keep=False)]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price


In [20]:
baskets[baskets.duplicated(subset=['id'],keep=False)]


Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price


In [21]:
baskets.count()

id             30803
order_id       30803
placed_at      30803
merchant_id    30803
sku_id         30803
top_cat_id     30803
sub_cat_id     30803
qty            30803
price          30803
dtype: int64

In [22]:
baskets.nunique()

id             30803
order_id        5440
placed_at       5440
merchant_id      216
sku_id          1220
top_cat_id        33
sub_cat_id        89
qty              129
price           1036
dtype: int64

### gather all observations and questions

 - top_cat_id and sub_cat_id null happened on the same rows, because there are 7 null for top_cat_id and 7 null for sub-cat_id, this happened on different SKUs, different merchants, different dates
 - we see at least one record has price 0 where top_cat_id and sub_cat_id are null, not sure if this is coincident 
 - 23 merchants transacted on Dec 31, 2021, making a total 27 orders, over 9 hours
 - we found two worders two orders being placed at exactly same millisecond, they look legit
 - there are two rows in the data that are duplicated once, making them doubled
###  new question
 - should we remove the duplicates? 
 - what should we do about nulls in the data?

### the remaining questions may need business answer that we are not able to answer by data alone
 - ID columns' statistics make sense other than count, min, max, since they are supposed to be identifiers, should we treat them as categorical?
 - why would some items have price of 0?
 - unique placed_at is one greater than unique order_id, is it possible that two orders are made on exactly the same milisecond? In theory it is possible, but might there be potential fraud?

