# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [3]:
# your code here

orders = pd.read_csv("./Orders/Orders.csv")
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [None]:
# your code here
#1 aggregate the amount_spent for unique customers

In [47]:
problem_1 = orders.groupby(['CustomerID']).agg({'amount_spent':'sum'})
problem_1

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.60
12347,4310.00
12348,1797.24
12349,1757.55
12350,334.40
...,...
18280,180.60
18281,80.82
18282,178.05
18283,2094.88


In [None]:
#2 check dataframe to have an idea the amounts expend in general

In [7]:
orders.describe()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,year,month,day,hour,Quantity,UnitPrice,CustomerID,amount_spent
count,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0,397924.0
mean,278465.221859,560617.126645,2010.934259,7.612537,3.614555,12.728247,13.021823,3.116174,15294.315171,22.394749
std,152771.368303,13106.167695,0.247829,3.416527,1.928274,2.273535,180.42021,22.096788,1713.169877,309.055588
min,0.0,536365.0,2010.0,1.0,1.0,6.0,1.0,0.0,12346.0,0.0
25%,148333.75,549234.0,2011.0,5.0,2.0,11.0,2.0,1.25,13969.0,4.68
50%,284907.5,561893.0,2011.0,8.0,3.0,13.0,6.0,1.95,15159.0,11.8
75%,410079.25,572090.0,2011.0,11.0,5.0,14.0,12.0,3.75,16795.0,19.8
max,541908.0,581587.0,2011.0,12.0,7.0,20.0,80995.0,8142.75,18287.0,168469.6


In [14]:
#Create 3 categories for the type of client
labels = ["Small", "Medium","Big"]
bins = pd.qcut(orders['amount_spent'],3, labels = labels)
bins.value_counts()

Medium    133753
Small     132875
Big       131296
Name: amount_spent, dtype: int64

In [15]:
#Incorporate those categories to the dataframe
#Now we can segregate customers based on the amounts spent
orders['purchase type'] = bins

In [16]:
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,purchase type
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,Medium
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,Big
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big


In [46]:
problem_2 = orders.groupby(['purchase type']).agg({'amount_spent':['sum','mean']})
problem_2

Unnamed: 0_level_0,amount_spent,amount_spent
Unnamed: 0_level_1,sum,mean
purchase type,Unnamed: 1_level_2,Unnamed: 2_level_2
Small,422187.814,3.177331
Medium,1597781.93,11.945765
Big,6891438.16,52.4878


In [None]:
# 3 Filter "VIP" or "Preferred" clients
# Since in here we need a more segregated distinction, we include an aditional category
# Small = Regular /  Medium = Mediun / Big = Preferred / Exclusive = VIP

In [29]:
#Create 4 categories for the type of client
labels = ["Regular", "Medium","Preferred", "VIP"]
bins_2 = pd.qcut(orders['amount_spent'],4, labels = labels)
bins_2.value_counts()

Preferred    109190
Regular       99669
Medium        99443
VIP           89622
Name: amount_spent, dtype: int64

In [30]:
orders['client type'] = bins_2

In [31]:
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,purchase type,client type
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,Medium,Preferred
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,Big,VIP
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP


In [45]:
problem_3 = orders.groupby(['client type']).agg({'amount_spent':['sum','mean']})
problem_3

Unnamed: 0_level_0,amount_spent,amount_spent
Unnamed: 0_level_1,sum,mean
client type,Unnamed: 1_level_2,Unnamed: 2_level_2
Regular,239833.604,2.406301
Medium,791648.98,7.960832
Preferred,1747446.5,16.003723
VIP,6132478.82,68.426043


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [40]:
# your code here
#Filter dataframe to get only VIP clients
VIP_custumers = orders[orders['client type']=='VIP']
VIP_custumers

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,purchase type,client type
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00,Big,VIP
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Big,VIP
6,6,536365,21730,2010,12,3,8,glass star frosted t-light holder,6,2010-12-01 08:26:00,4.25,17850,United Kingdom,25.50,Big,VIP
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397906,541891,581586,23275,2011,12,5,12,set of 3 hanging owls ollie beak,24,2011-12-09 12:49:00,1.25,13113,United Kingdom,30.00,Big,VIP
397907,541892,581586,21217,2011,12,5,12,red retrospot round cake tins,24,2011-12-09 12:49:00,8.95,13113,United Kingdom,214.80,Big,VIP
397908,541893,581586,20685,2011,12,5,12,doormat red retrospot,10,2011-12-09 12:49:00,7.08,13113,United Kingdom,70.80,Big,VIP
397909,541894,581587,22631,2011,12,5,12,circus parade lunch box,12,2011-12-09 12:50:00,1.95,12680,France,23.40,Big,VIP


In [44]:
problem_4 = VIP_custumers.groupby(['Country']).agg({'client type':'count'})
problem_4.sort_values('client type', ascending=False)

Unnamed: 0_level_0,client type
Country,Unnamed: 1_level_1
United Kingdom,72944
Germany,2893
France,2736
EIRE,2662
Netherlands,1927
Australia,843
Switzerland,771
Spain,599
Belgium,581
Norway,503


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [54]:
# your code here

#Filter dataframe to get only VIP and Preferred Customers

especial_custumers =  orders.loc[(orders['client type'] == 'vip') | (orders['client type'] == 'Preferred')]
especial_custumers

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,purchase type,client type
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30,Medium,Preferred
5,5,536365,22752,2010,12,3,8,set 7 babushka nesting boxes,2,2010-12-01 08:26:00,7.65,17850,United Kingdom,15.30,Medium,Preferred
10,10,536367,22745,2010,12,3,8,poppy's playhouse bedroom,6,2010-12-01 08:34:00,2.10,13047,United Kingdom,12.60,Medium,Preferred
11,11,536367,22748,2010,12,3,8,poppy's playhouse kitchen,6,2010-12-01 08:34:00,2.10,13047,United Kingdom,12.60,Medium,Preferred
15,15,536367,22623,2010,12,3,8,box of vintage jigsaw blocks,3,2010-12-01 08:34:00,4.95,13047,United Kingdom,14.85,Medium,Preferred
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397918,541903,581587,23256,2011,12,5,12,childrens cutlery spaceboy,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Big,Preferred
397920,541905,581587,22899,2011,12,5,12,children's apron dolly girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60,Medium,Preferred
397921,541906,581587,23254,2011,12,5,12,childrens cutlery dolly girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Big,Preferred
397922,541907,581587,23255,2011,12,5,12,childrens cutlery circus parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60,Big,Preferred


In [55]:
problem_5 = especial_custumers.groupby(['Country']).agg({'client type':'count'})
problem_5.sort_values('client type', ascending=False)

Unnamed: 0_level_0,client type
Country,Unnamed: 1_level_1
United Kingdom,91400
Germany,4168
France,3551
EIRE,3276
Spain,1048
Belgium,1034
Switzerland,726
Portugal,564
Norway,435
Italy,393
