# Challenge 3

In this challenge we will work on the `Orders.csv` data set in the previous [Subsetting and Descriptive Stats lab](../../lab-subsetting-and-descriptive-stats/your-code/main.ipynb). In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

# Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders.csv` from the "subsetting" lab folder into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# enter your code here
orders = pd.read_csv("../../lab-subsetting-and-descriptive-stats/your-code/Orders.csv")
orders

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397919,541904,581587,22613,2011,12,5,12,pack of 20 spaceboy napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20
397920,541905,581587,22899,2011,12,5,12,children's apron dolly girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60
397921,541906,581587,23254,2011,12,5,12,childrens cutlery dolly girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60
397922,541907,581587,23255,2011,12,5,12,childrens cutlery circus parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [5]:
# Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?
#1st: group by customer ID and the sum of amount spent
customer_class = orders.groupby('CustomerID').agg({'amount_spent':'sum'})
customer_class

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.60
12347,4310.00
12348,1797.24
12349,1757.55
12350,334.40
...,...
18280,180.60
18281,80.82
18282,178.05
18283,2094.88


In [6]:
# Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?
##2nd:sort the values that we just aggregate
customer_class_ordered = customer_class.sort_values(by='amount_spent', ascending=False)
customer_class_ordered

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
14646,280206.02
18102,259657.30
17450,194550.79
16446,168472.50
14911,143825.06
...,...
17956,12.75
16454,6.90
14792,6.20
16738,3.75


In [8]:
##3rd: know the quantiles
q_75 = customer_class_ordered.quantile(0.75)[0]
print(q_75)
q_95 = customer_class_ordered.quantile(0.95)[0]
print(q_95)

1661.6400000000003
5840.181999999983


In [9]:
#4th: identify VIP customers
VIP_customers = list(customer_class_ordered[customer_class_ordered['amount_spent'] > q_95].index)
VIP_customers

[14646,
 18102,
 17450,
 16446,
 14911,
 12415,
 14156,
 17511,
 16029,
 12346,
 16684,
 14096,
 13694,
 15311,
 13089,
 17949,
 15769,
 15061,
 14298,
 14088,
 15749,
 12931,
 17841,
 15098,
 13798,
 16013,
 16422,
 12748,
 15838,
 17404,
 17389,
 13098,
 14680,
 13081,
 13408,
 17857,
 16333,
 13777,
 12753,
 12744,
 16210,
 17675,
 17381,
 15039,
 12471,
 12731,
 15159,
 12901,
 12678,
 14031,
 17428,
 13767,
 13881,
 16839,
 12921,
 14607,
 15856,
 17677,
 15189,
 14051,
 15513,
 16133,
 14866,
 16705,
 12681,
 12621,
 12540,
 12433,
 15498,
 12477,
 17735,
 16525,
 14258,
 13078,
 12536,
 15640,
 16000,
 17340,
 12682,
 13113,
 14606,
 12557,
 12939,
 15125,
 14194,
 12971,
 14895,
 12409,
 15482,
 17581,
 13319,
 17107,
 13340,
 14769,
 17139,
 16779,
 17865,
 17706,
 15251,
 14062,
 15615,
 15502,
 16180,
 16843,
 12590,
 13001,
 13199,
 15078,
 12709,
 13458,
 14733,
 16523,
 12567,
 14367,
 14667,
 18092,
 13969,
 12451,
 13488,
 13090,
 16033,
 17017,
 17243,
 17306,
 16656,


In [10]:
#5th: identify preferred customers
preferred_customers = list(customer_class_ordered[(customer_class_ordered['amount_spent'] > q_75) & (customer_class_ordered['amount_spent'] <= q_95)].index)
preferred_customers

[13050,
 12720,
 15218,
 17686,
 13178,
 16553,
 13468,
 14110,
 14049,
 17049,
 17716,
 13004,
 18118,
 14688,
 17757,
 12481,
 12839,
 16767,
 16161,
 12539,
 15805,
 18225,
 16609,
 16985,
 16265,
 18198,
 12490,
 17809,
 16701,
 17719,
 17850,
 16303,
 13269,
 15150,
 17730,
 17061,
 18226,
 12362,
 15046,
 16258,
 13756,
 16191,
 15727,
 17858,
 13599,
 15903,
 13941,
 17602,
 16700,
 14562,
 15301,
 13139,
 12432,
 15719,
 17426,
 14329,
 14755,
 12444,
 15032,
 15547,
 12437,
 12700,
 15786,
 14709,
 17133,
 12664,
 12688,
 14292,
 17652,
 15738,
 13013,
 15555,
 17365,
 15298,
 15187,
 17690,
 12955,
 14180,
 15152,
 16722,
 12714,
 14159,
 14135,
 14191,
 15291,
 14189,
 14390,
 14286,
 16931,
 13802,
 15093,
 14004,
 17613,
 17068,
 12484,
 13267,
 15299,
 17975,
 12524,
 14426,
 16984,
 17364,
 12627,
 14085,
 14534,
 13069,
 15955,
 16626,
 15874,
 12712,
 13505,
 12752,
 16945,
 14016,
 14800,
 14525,
 17567,
 18251,
 12347,
 15034,
 17937,
 14422,
 12500,
 17419,
 14277,


In [16]:
# Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?
#6. Create a column name "customer label" with the conditional of both lists (Vip & preferred)
orders['customer_label'] = np.where(orders['CustomerID'].isin(VIP_customers),'VIP', np.where(orders['CustomerID'].isin(preferred_customers),'Preferred',''))
orders

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,customer_label
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30,Preferred
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Preferred
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00,Preferred
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Preferred
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,Preferred
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397919,541904,581587,22613,2011,12,5,12,pack of 20 spaceboy napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20,
397920,541905,581587,22899,2011,12,5,12,children's apron dolly girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60,
397921,541906,581587,23254,2011,12,5,12,childrens cutlery dolly girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60,
397922,541907,581587,23255,2011,12,5,12,childrens cutlery circus parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60,


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

# Q2: How to identify which country has the most VIP Customers?

# Q3: How to identify which country has the most VIP+Preferred Customers combined?

Provide your solution for Q2 below:

In [22]:
# Q2: How to identify which country has the most VIP Customers?
#1st: group by country , counting the times the values of customer_label appear, only for the VIP customers
df_countries = orders[orders['customer_label'] == 'VIP'].groupby('Country').agg({'customer_label':'count'})
df_countries

Unnamed: 0_level_0,customer_label
Country,Unnamed: 1_level_1
Australia,898
Belgium,54
Channel Islands,364
Cyprus,248
Denmark,36
EIRE,7077
Finland,294
France,3290
Germany,3127
Japan,205


In [42]:
#2nd sort the countris and select the first one 
df_countries_sorted = df_countries.sort_values('customer_label', ascending=False)
df_countries_sorted.iloc[0]


customer_label    84185
Name: United Kingdom, dtype: int64

In [43]:
#Q3: How to identify which country has the most VIP+Preferred Customers combined?
#same steps but with one more condition
df_countries_2 = orders[orders['customer_label'].isin(['VIP','Preferred'])].groupby('Country').agg({'customer_label':'count'})
df_countries_2

Unnamed: 0_level_0,customer_label
Country,Unnamed: 1_level_1
Australia,1028
Austria,158
Belgium,1557
Canada,135
Channel Islands,589
Cyprus,451
Denmark,217
EIRE,7238
Finland,504
France,6301


In [44]:
#2nd sort the countris and select the first one 
df_countries_2_sorted = df_countries_2.sort_values('customer_label', ascending=False)
df_countries_2_sorted.iloc[0]


customer_label    221635
Name: United Kingdom, dtype: int64