## Setting up

### Import pandas and read in the csv file and set it to a dataframe called baskets

In [1]:
import pandas as pd
import numpy as np

In [2]:
baskets = pd.read_csv('new_baskets_full.csv')

## Conduct basic data inspection

 - take a look at the first three rows, and last few rows

In [3]:
baskets.head(3)

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price
0,1,2,2021-04-09 16:19:27.998,46,101,3.0,94.0,400,134000.0
1,2,2,2021-04-09 16:19:27.998,46,100,3.0,94.0,400,137000.0
2,3,2,2021-04-09 16:19:27.998,46,102,3.0,94.0,400,169000.0


In [4]:
baskets.tail()

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price
336467,336466,61811,2022-08-15 22:08:20.176,122,684,32.0,85.0,2,127000.0
336468,336467,61960,2022-08-15 22:43:04.251,552,1326,9.0,23.0,2,424500.0
336469,330841,59500,2022-08-15 23:43:03.647,1906,530,14.0,86.0,10,56000.0
336470,330848,59500,2022-08-15 23:43:03.647,1906,564,14.0,86.0,5,28000.0
336471,336468,59500,2022-08-15 23:43:03.647,1906,200,4.0,7.0,20,14750.0


### dataframe dimensions, column names, column data types, ranges of column values

In [5]:
baskets.shape

(336472, 9)

In [6]:
baskets.columns

Index(['id', 'order_id', 'placed_at', 'merchant_id', 'sku_id', 'top_cat_id',
       'sub_cat_id', 'qty', 'price'],
      dtype='object')

In [7]:
baskets.dtypes

id               int64
order_id         int64
placed_at       object
merchant_id      int64
sku_id           int64
top_cat_id     float64
sub_cat_id     float64
qty              int64
price          float64
dtype: object

 - noticed columns "placed_at" is not numeric and the rest are numerical columns

In [8]:
baskets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 336472 entries, 0 to 336471
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   id           336472 non-null  int64  
 1   order_id     336472 non-null  int64  
 2   placed_at    336472 non-null  object 
 3   merchant_id  336472 non-null  int64  
 4   sku_id       336472 non-null  int64  
 5   top_cat_id   336461 non-null  float64
 6   sub_cat_id   336461 non-null  float64
 7   qty          336472 non-null  int64  
 8   price        336472 non-null  float64
dtypes: float64(3), int64(5), object(1)
memory usage: 23.1+ MB


 - question: what can you observe from the above result?

 - why are the count on top_cat_id and sub_cat_id different from others? 

In [9]:
baskets.describe()

Unnamed: 0,id,order_id,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price
count,336472.0,336472.0,336472.0,336472.0,336461.0,336461.0,336472.0,336472.0
mean,168236.5,29079.405656,798.592706,525.308685,10.319098,45.395065,37.89684,137895.6
std,97131.244225,18909.738357,550.271799,304.262943,7.906257,27.767388,10358.73,174468.9
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.04375
25%,84118.75,11485.0,352.0,322.0,4.0,27.0,1.0,46000.0
50%,168236.5,28436.0,664.0,438.0,8.0,43.0,2.0,107000.0
75%,252354.25,46193.25,1217.0,589.0,14.0,69.0,5.0,184500.0
max,336472.0,62048.0,2138.0,1617.0,33.0,96.0,4800000.0,58750000.0


 - noticed that the "placed_at" column was not shown in the above result, maybe due to its type?  
 - wondering that ID columns' statistics may not make sense other than count, min, max, since they are supposed to be identifiers

## Conduct some more data inspection

 - take a look at 3 random rows

In [10]:
# set seed for random function so that we get same rows when re-run the cell
np.random.seed(17)
baskets.iloc[np.random.randint(0, baskets.shape[0],3)]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price
64753,64790,8373,2021-10-28 14:46:48.302,516,622,15.0,48.0,1,223000.0
297103,297212,54238,2022-05-30 13:12:56.574,1628,1541,4.0,28.0,1,205000.0
304441,304575,55813,2022-06-14 10:39:34.228,1646,1016,14.0,48.0,5,18500.0


 - take a look at transactions for a specific merchant_id, say the first merchant

In [20]:
m_id = baskets.merchant_id[0]
baskets[baskets['merchant_id'] == m_id]

Unnamed: 0,id,order_id,placed_at,merchant_id,sku_id,top_cat_id,sub_cat_id,qty,price
0,1,2,2021-04-09 16:19:27.998,46,101,3.0,94.0,400,134000.0
1,2,2,2021-04-09 16:19:27.998,46,100,3.0,94.0,400,137000.0
2,3,2,2021-04-09 16:19:27.998,46,102,3.0,94.0,400,169000.0
3,4,2,2021-04-09 16:19:27.998,46,99,3.0,94.0,400,129000.0
4,5,2,2021-04-09 16:19:27.998,46,68,3.0,70.0,360,148500.0
5,6,2,2021-04-09 16:19:27.998,46,21,4.0,21.0,200,552500.0
6,7,2,2021-04-09 16:19:27.998,46,103,3.0,94.0,170,103000.0
7,8,2,2021-04-09 16:19:27.998,46,98,3.0,94.0,50,104000.0
72,65,15,2021-05-05 14:42:19.221,46,21,4.0,21.0,200,567500.0
73,66,15,2021-05-05 14:42:19.221,46,68,3.0,70.0,400,158000.0


 - how much did it cost in total for this particular merchant?

In [21]:
baskets[baskets['merchant_id'] == m_id].price.sum()

4485500.0

 - what is the average price for the first order?

In [22]:
o_id = baskets.order_id[0]
baskets[baskets['order_id'] == o_id].price.mean()

184625.0

 - what are the average price, min and max prices for all rows in this dataset?

In [23]:
baskets['price'].mean(), baskets['price'].min(), baskets['price'].max(), 

(137895.60407784148, 0.04375, 58750000.0)

 - how many rows have price of 0
 - question: Why would some items have price of 0? 

*** TODO: find out why would some items have price of 0?
  

In [24]:
baskets[baskets['price']==0].count()

id             0
order_id       0
placed_at      0
merchant_id    0
sku_id         0
top_cat_id     0
sub_cat_id     0
qty            0
price          0
dtype: int64

 - check columns' number of unique values

In [25]:
baskets.nunique()

id             336472
order_id        62048
placed_at       62047
merchant_id      2138
sku_id           1617
top_cat_id         33
sub_cat_id         96
qty               355
price            2171
dtype: int64

- question: what can you observe from the above result? what might seem to be peculiar? 

 - notice unique placed_at is one greater than unique order_id
 - question: is it possible that two orders are made on exactly the same milisecond? In theory it is possible, but might there be potential fraud?

  *** TODO: how can we find out which two orders happened on the exact same millisecond? 


 - can we check the min and max of "date" column?

In [26]:
baskets['placed_at'].min(), baskets['placed_at'].max()

('2021-04-09 16:19:27.998', '2022-08-15 23:43:03.647')

 - how many merchant transacted on a particular day, say December 31, 2021?
 - what is the type "object" anyways?

In [27]:
baskets['placed_at'][1], type(baskets['placed_at'][1])

('2021-04-09 16:19:27.998', str)

 - how do we work with a string object and get the date, hour, min, second, millisecond?

### save some data to a file

In [28]:
baskets[baskets['merchant_id'] == 1004].to_csv("test_dave.csv", sep = ",", index=False)

### gather all observations, questions, and TODOs

 - columns "placed_at" and "supplier_id" are not numeric and the rest are numerical columns
 - why are the count on top_cat_id and sub_cat_id different from others? 
 - ID columns' statistics make sense other than count, min, max, since they are supposed to be identifiers, should we treat them as categorical?
 - why would some items have price of 0?
 - unique placed_at is one greater than unique order_id
 - is it possible that two orders are made on exactly the same milisecond? In theory it is possible, but might there be potential fraud?
 - how can we find out which two orders happened on the exact same millisecond? 
 - how many merchant transacted on a particular day, say December 31, 2021?
 - how do we work with a string object and get the date, hour, min, second, millisecond? 