# Fall 2022 Data Science Intern Challenge

In [2]:
import pandas as pd

## Analyzing Data

In [3]:
df = pd.read_csv("q1.csv")
print(df.count())
df.head()

order_id          5000
shop_id           5000
user_id           5000
order_amount      5000
total_items       5000
payment_method    5000
created_at        5000
dtype: int64


Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
0,1,53,746,224,2,cash,2017-03-13 12:36:56
1,2,92,925,90,1,cash,2017-03-03 17:38:52
2,3,44,861,144,1,cash,2017-03-14 4:23:56
3,4,18,935,156,1,credit_card,2017-03-26 12:43:37
4,5,18,883,156,1,credit_card,2017-03-01 4:35:11


## Calculating Average Order Value

The formula for average order value is listed as follows: $$AOV = \frac{revenue}{orders}$$

In [4]:
revenue = df["order_amount"].sum()
number_orders = df["total_items"].sum()
average_order_value = revenue / number_orders
print("The average order value is $%.2f" % average_order_value)

The average order value is $357.92


Let's see if we can figure out what the miscalculation is. There are two possible mistakes: the revenue, or the number of orders. Let's try figuring out what each one might be.

In [5]:
false_aov = 3145.13
false_orders = revenue / false_aov
false_revenue = false_aov * number_orders
print("Falsely calculated revenue: $%.2f, falsely calculated orders %d" % (false_revenue, false_orders))

Falsely calculated revenue: $138184431.68, falsely calculated orders 4999


The falsely calculated orders look suspiciously similar to the total number of orders placed which could be the mistake that was made in the calculation.

## Looking at Another Metric

Instead of looking at the average order value, we might want to look at the average of the shoe prices themselves. Since we know each shop only sells one brand of shoe, this is very easy to calculate.

In [32]:
unique_shops = df.drop_duplicates(["shop_id"])
shoe_prices = unique_shops["order_amount"] / unique_shops["total_items"]
shoe_prices.mean()

407.99

This price still seems awfully high for a shoe. Let's see if we can figure out what's going on.

In [36]:
shoe_prices.describe()

count      100.000000
mean       407.990000
std       2557.462906
min         90.000000
25%        132.750000
50%        153.000000
75%        168.250000
max      25725.000000
dtype: float64

It seems that the average is being significantly boosted by a certain shoe costing $25725.00. In cases like these, it's better to look at the median.

In [34]:
print("The median shoe price is $%.2f." % shoe_prices.median())

The median shoe price is $153.00.


This looks much more affordable!