# Shopify 2022 Data Science Intern Challenge



* Think about what could be going wrong with our calculation. Think about a better way to evaluate this data.

The dataset is most likely contains many outliers, as it is not logical for an average person to shop more than $3000 on sneakers, let alone any product.

* What metric would you report for this dataset?

I would still use the average order amount.

* What is its value?

The true AOV is most likely around $151.7 dollars.

Below is my thought process as I worked with this dataset.

In [1]:
import pandas as pd
shop_data = pd.read_csv("2019 Winter Data Science Intern Challenge Data Set - Sheet1.csv")

In [2]:
#we first check for any missing data
shop_data.isnull().values.any()

#basic summary statisitcs to see what is wrong with our data
shop_data.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,50.0788,849.0924,3145.128,8.7872
std,1443.520003,29.006118,87.798982,41282.539349,116.32032
min,1.0,1.0,607.0,90.0,1.0
25%,1250.75,24.0,775.0,163.0,1.0
50%,2500.5,50.0,849.0,284.0,2.0
75%,3750.25,75.0,925.0,390.0,3.0
max,5000.0,100.0,999.0,704000.0,2000.0


We immediately see many problems with the dataset from the summary statistics. We know that the sneaker shop only sells one model of shoe, so purchase amount of 7040000 is highly unliklely.

In [3]:
shop_data.sort_values(by = "order_amount", ascending = False)

shop_data[shop_data["total_items"] == 2000]

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1104,1105,42,607,704000,2000,credit_card,2017-03-24 4:00:00
1362,1363,42,607,704000,2000,credit_card,2017-03-15 4:00:00
1436,1437,42,607,704000,2000,credit_card,2017-03-11 4:00:00
1562,1563,42,607,704000,2000,credit_card,2017-03-19 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
2297,2298,42,607,704000,2000,credit_card,2017-03-07 4:00:00


We observe that the order amount of 704000 and 2000 total items ordered all come from user_id 607, on shop_id 42, and these transactions all occurred exactly at 4:00 AM on different dates. This is most likely an error in the system, as it is quite unlikely that a person would order $11968000 worth of same shoes.

To clean these data, it is quite safe to delete these outliers and calculate the AOV, since there are around 4000 samples. However, investigating these data points and finding out from shop_id 42 and user_id 607 to find the correct order_amount would be a normal step as well.

To proceed further with the calculation of AOV, after filtering these data points out.

In [4]:
shop_data_clean = shop_data[shop_data.user_id != 607]
shop_data_clean.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,4983.0,4983.0,4983.0,4983.0,4983.0
mean,2501.060405,50.106362,849.918322,754.091913,1.99398
std,1443.090253,29.051718,86.800308,5314.092293,0.98318
min,1.0,1.0,700.0,90.0,1.0
25%,1250.5,24.0,776.0,163.0,1.0
50%,2502.0,50.0,850.0,284.0,2.0
75%,3750.5,75.0,925.0,390.0,3.0
max,5000.0,100.0,999.0,154350.0,8.0


We immediately see that the max order amount is still very high at 154350. At this point, we realize there are many more data points with strange values, and decide to filter out the data to order amount less than 10,000 dollars.

In [5]:
shop_data_clean.sort_values(by= 'order_amount', ascending = False)

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
691,692,78,878,154350,6,debit,2017-03-27 22:51:43
2492,2493,78,834,102900,4,debit,2017-03-04 4:37:34
3724,3725,78,766,77175,3,credit_card,2017-03-16 14:13:26
1259,1260,78,775,77175,3,credit_card,2017-03-27 9:27:20
4420,4421,78,969,77175,3,debit,2017-03-09 15:21:35
...,...,...,...,...,...,...,...
3871,3872,92,818,90,1,debit,2017-03-18 9:10:08
3363,3364,92,730,90,1,credit_card,2017-03-11 23:20:31
4923,4924,92,965,90,1,credit_card,2017-03-09 5:05:11
158,159,92,795,90,1,credit_card,2017-03-29 3:07:12


In [6]:
final_data = shop_data[shop_data["order_amount"] < 10000]
final_data.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,4937.0,4937.0,4937.0,4937.0,4937.0
mean,2499.551347,49.846465,849.752279,302.580514,1.994734
std,1444.069407,29.061131,86.840313,160.804912,0.982821
min,1.0,1.0,700.0,90.0,1.0
25%,1248.0,24.0,775.0,163.0,1.0
50%,2497.0,50.0,850.0,284.0,2.0
75%,3751.0,74.0,925.0,387.0,3.0
max,5000.0,100.0,999.0,1760.0,8.0


The summary statistics of the data with order amount less than 10,000 dollars gives us a more reasonable number for the max order_amount and total_items, and we compute the AOV.

In [7]:
final_data.order_amount.sum() / final_data.total_items.sum()

151.68968318440292

This is a  simplified way to show how I would compute the AOV for shopify sneaker shops with the information given. If given more information, such as what other products the sneaker shops sell other than the one specific shoe model, we would be able to calculate a more accurate AOV.

In [9]:
final_data.sort_values("total_items")

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
2687,2688,69,776,131,1,debit,2017-03-25 19:26:09
2686,2687,91,883,160,1,cash,2017-03-11 15:07:44
3834,3835,7,986,112,1,debit,2017-03-18 3:51:08
3836,3837,10,971,148,1,cash,2017-03-17 18:51:42
2265,2266,56,737,117,1,cash,2017-03-26 5:54:00
...,...,...,...,...,...,...,...
4847,4848,13,993,960,6,cash,2017-03-27 11:00:45
2307,2308,61,723,948,6,credit_card,2017-03-26 11:29:37
4711,4712,86,883,780,6,cash,2017-03-18 14:18:19
1563,1564,91,934,960,6,debit,2017-03-23 8:25:49
