## Shopify Data Science Challenge Problems:


#### *Please view the notebook in NBviewer to see all the plotly graphs of this notebook

#### Question 1: Given some sample data, write a program to answer the following:

On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.

1. Think about what could be going wrong with our calculation. 
2. Think about a better way to evaluate this data. 
3. What metric would you report for this dataset?
4. What is its value?




#### Question 2:

For this question you’ll need to use SQL. Follow this link: https://www.w3schools.com/SQL/TRYSQL.ASP?FILENAME=TRYSQL_SELECT_ALL to access the data set required for the challenge. Please use queries to answer the following questions. Paste your queries along with your final numerical answers below.
1. How many orders were shipped by Speedy Express in total?
2. What is the last name of the employee with the most orders?¶
3. What product was ordered the most by customers in Germany?¶

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv("../input/shopify-data-science-internship-challenge/Shopify.csv")
df.head()

## Doing basic data exploration

In [None]:
df.describe()

1. Clearly the mean order amount is 3145.13 which is way too high for a sneaker store -> This indicates that some stores or some users might be associated with fraud (outliers)
2. The median on the other hand is 284 which is much more reasonable and the actual mean should lie between those 2

In [None]:
df.info() # we can safely proceed as there are no nulls in the dataset

In [None]:
duplicateRows = df[df.duplicated()]
duplicateRows

There are no missing values or duplicates in order id. Our dataset is clean and we can proceed

In [None]:
df.head()

In [None]:
df['shop_id'].nunique()

In [None]:
df['shop_id'].value_counts()

In [None]:
df['order_amount'].value_counts()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(10,10))
sns.histplot(df['order_amount'], color='red', kde=True)

In [None]:
df['user_id'].value_counts()

In [None]:
df['user_id'].nunique()

In [None]:
df_group_shop = df.groupby('shop_id')['order_id'].mean()
df_group_shop.plot.bar(figsize=(20,10))

Both median and mean of shops have pretty normal distribution, we would have to dig deeper to find outliers

In [None]:
df_group_shop = df.groupby('shop_id')['order_id'].median()
df_group_shop.plot.bar(figsize=(20,10))

In [None]:
df_group_user = pd.DataFrame({'mean_amount': df.groupby('user_id')['order_amount'].mean()}).reset_index()
df_group_user

In [None]:
subset_df = df_group_user[df_group_user['mean_amount']>2000]
#subset_df
fig = plt.figure(figsize=(20,10))
plt.bar(subset_df['user_id'], subset_df['mean_amount'])

Clearly user 607 has an insane average amount of purchase and should be highly suspected for fraud

In [None]:
df[df['user_id']==607]

## User 607 shopping from shop 42 purchased 2000 items every time worth 70400 dollars each time and a net of 12 million dollars over a span of just 18 days with a credit card. This is a huge alert for fraud.

Let's remove the user 607 and plot again

In [None]:
subset_df = df_group_user[df_group_user['user_id']!=607]
subset_df.head()

In [None]:
fig = plt.figure(figsize=(20,10))
plt.bar(subset_df['user_id'], subset_df['mean_amount'])

In [None]:
# plotting users with more than 2000 dollars of mean purchases
subset_df = subset_df[subset_df['mean_amount']>2000]
fig = plt.figure(figsize=(20,10))
plt.bar(subset_df['user_id'], subset_df['mean_amount'], color='red')

In [None]:
!pip install chart_studio
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable = True)
import chart_studio
import plotly.express as px
import chart_studio.plotly as py
import plotly.graph_objs as go
from plotly.offline import iplot

In [None]:
import plotly.express as px

px.bar(subset_df, x='user_id', y='mean_amount', title='Middle-High average spending Users (Hover to see the ID)')

In [None]:
subset_df[subset_df['user_id']==878]

In [None]:
df[df['user_id']==878]

User 878 has a very high one time spending which is through a debit card, shop id is $78$

In [None]:
df[df['user_id']==766]

User 766 also has a one time high spending at shop $78$ 

In [None]:
df[df['user_id']==834]

Interestingly, user 834 also has a one time high spending using debit card at shop $78$

#### Shop 78 clearly looks suspicious

In [None]:
df[df['shop_id']==78]

In [None]:
df_group_shop = pd.DataFrame({'mean_amount': df.groupby('shop_id')['order_amount'].mean()}).reset_index()
df_group_shop

In [None]:
px.bar(df_group_shop, x='shop_id', y='mean_amount')

Wow, shop id 42 and 78 seem to be the culprit of the fraud

In [None]:
df[df['shop_id']==42]

## Let's remove the 2 outliers: user id 607 and shop id 78 and observe the data again

In [None]:
clean_df = df[df['user_id']!=607]
clean_df = clean_df[clean_df['shop_id']!=78]
clean_df

In [None]:
clean_df.describe()

In [None]:
clean_df['order_amount'].iplot(kind='hist',
                              title='Cleaned dataframe Order Amount plot',
                              xTitle='Order Amount',
                              yTitle='Count',
                              theme='solar',
                              showgrid=False)

There still seems to be an outlier, let's explore further

In [None]:
fig = plt.figure(figsize=(20,10))
sns.histplot(clean_df['order_amount'], color='green', kde=True)

In [None]:
fig = plt.figure(figsize=(20,10))
plt.bar(clean_df['shop_id'], clean_df['order_amount'], color='#008000')

In [None]:
clean_df.head()

In [None]:
px.scatter(clean_df, x='created_at', y='order_amount', color='payment_method')

In [None]:
px.scatter(clean_df, x='created_at', y='order_amount', color='shop_id')

In [None]:
clean_df.describe()

## Conclusion - Answer to Question 1:

1. There are no missing values
2. There are no duplicates
3. There are 2 outliers: user id 607 and shop id 78 : These 2 are the main reasons of fraud
4. Net mean of the cleaned (with no outliers) data is $302.58 $$
5. Here median (284) could also be used but after removing outliers, mean is a better representative of the spread

#### Please view the notebook in NBviewer to see all the plotly graphs of this notebook

## Question 2

#### Q1. How many orders were shipped by Speedy Express in total?

<br> SELECT Count(OrderID) FROM
<br> Orders o Inner Join Shippers s
<br> ON o.ShipperID = s.ShipperID
<br> Where ShipperName = 'Speedy Express'

<br> <b> Answer: 54 </b>

#### Q2. What is the last name of the employee with the most orders?

<br> SELECT LastName, MAX(NetOrders) FROM
<br> (Select *, COUNT(DISTINCT OrderID) as NetOrders FROM
<br> (SELECT o.OrderID, e.EmployeeID, e.LastName, e.FirstName
<br> FROM Orders o Inner Join Employees e
<br> ON o.EmployeeID = e.EmployeeID)
<br> GROUP BY EmployeeID
<br> ORDER BY COUNT(DISTINCT OrderID) DESC)


<br> <b> Answer: Peacock (number of orders are 40) </b>

#### Q3. What product was ordered the most by customers in Germany?

<br> SELECT ProductName, MaxOrders FROM
<br> (SELECT ProductID, MAX(NetOrders) as MaxOrders FROM
<br> (Select *, Count(DISTINCT OrderID) AS NetOrders from 
<br> (SELECT *
<br> From Orders o Inner Join OrderDetails od
<br> On o.OrderID = od.OrderID
<br> Where CustomerID IN
<br> (Select CustomerID From
<br> Customers Where Country = 'Germany'))
<br> Group by ProductID
<br> Order by COUNT(DISTINCT OrderID) DESC)) t1 Inner Join Products p
<br> WHERE t1.ProductID = p.ProductID



<br> <b> Answer: Gorgonzola Telino (number of orders are 5) </b>