# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# your code here

orders = pd.read_csv('Orders.csv')
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [3]:
orders.shape

(397924, 14)

---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [4]:
# sub 1. aggregate the amount_spent for unique customers
agg_amount_spent = orders.groupby(['CustomerID'])['amount_spent'].sum()
agg_amount_spent

CustomerID
12346    77183.60
12347     4310.00
12348     1797.24
12349     1757.55
12350      334.40
           ...   
18280      180.60
18281       80.82
18282      178.05
18283     2094.88
18287     1837.28
Name: amount_spent, Length: 4339, dtype: float64

In [6]:
orders['sum_amount_spent'] = orders.groupby('CustomerID')['amount_spent'].transform(np.sum)
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,sum_amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,5391.21
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,5391.21
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21


#### VIP costumers

In [7]:
# sub 2. select customers whose aggregated amount_spent is in a given quantile range
vip_threshold = np.percentile(agg_amount_spent, 95)
print(vip_threshold)

5840.181999999983


In [15]:
vip_costumers = orders.loc[(orders['sum_amount_spent'] > vip_threshold)]
vip_costumers.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,sum_amount_spent
26,26,536370,22728,2010,12,3,8,alarm clock bakelike pink,24,2010-12-01 08:45:00,3.75,12583,France,90.0,7281.38
27,27,536370,22727,2010,12,3,8,alarm clock bakelike red,24,2010-12-01 08:45:00,3.75,12583,France,90.0,7281.38
28,28,536370,22726,2010,12,3,8,alarm clock bakelike green,12,2010-12-01 08:45:00,3.75,12583,France,45.0,7281.38
29,29,536370,21724,2010,12,3,8,panda and bunnies sticker sheet,12,2010-12-01 08:45:00,0.85,12583,France,10.2,7281.38
30,30,536370,21883,2010,12,3,8,stars gift tape,24,2010-12-01 08:45:00,0.65,12583,France,15.6,7281.38


In [9]:
vip_costumers.shape

(104484, 15)

In [51]:
#List of cosumers ID in the 95th percentile 
vip_costumers_IDs = list(vip_costumers.CustomerID.unique())
print(vip_costumers_IDs)

[12583, 15311, 16029, 12431, 17511, 13408, 13767, 15513, 13694, 14849, 16210, 12748, 12433, 14911, 17841, 13093, 12921, 13777, 18229, 14606, 13576, 13090, 15694, 17017, 15601, 13418, 14060, 17381, 17581, 15061, 15640, 14031, 12971, 13798, 17396, 14156, 14680, 12557, 16013, 17949, 12682, 15769, 13081, 17243, 15465, 13089, 16033, 18055, 18109, 16839, 16814, 12567, 16353, 14527, 15023, 12472, 16422, 15502, 17677, 17428, 15039, 15078, 14667, 15194, 17450, 12681, 17735, 15838, 14733, 13488, 17675, 18102, 13078, 12709, 16779, 14796, 13199, 17706, 16525, 16558, 15498, 14051, 16713, 13113, 12766, 15005, 14866, 17340, 18092, 15358, 13319, 12621, 12683, 13854, 17857, 15856, 13102, 13969, 12471, 12731, 16656, 14952, 12989, 17865, 16873, 14062, 16923, 12753, 13668, 15044, 14505, 12540, 13225, 13209, 17338, 12476, 15159, 13324, 14961, 14057, 14298, 17404, 14415, 13097, 13458, 15290, 15615, 15482, 16705, 12980, 16746, 13534, 14735, 18223, 16684, 12931, 14769, 17315, 12705, 14646, 13027, 12678, 17306

#### Preferred costumers


In [11]:
preferred_threshold = np.percentile(agg_amount_spent, 75)

print(preferred_threshold)

1661.6400000000003


In [59]:
preferred_costumers = orders.loc[(orders['sum_amount_spent'] > preferred_threshold) & (orders['sum_amount_spent'] <= vip_threshold)]
preferred_costumers.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,sum_amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,5391.21
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,5391.21
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21


In [13]:
preferred_costumers.shape

(151781, 15)

In [50]:
#List of cosumer IDs in the 75th-95th percentile ()
preferred_costumers_IDs = list(preferred_costumers.CustomerID.unique())
print(preferred_costumers_IDs)

[17850, 13047, 15291, 14688, 17809, 16098, 17924, 13448, 16218, 14307, 17920, 13758, 17377, 14001, 12662, 15485, 18144, 16456, 17346, 17873, 13468, 16928, 14696, 17690, 17069, 15235, 15752, 13941, 14135, 14388, 18041, 15955, 14390, 15260, 13305, 15544, 15738, 15827, 14180, 14466, 16186, 17685, 17567, 17838, 17228, 17659, 15299, 17757, 16754, 14395, 15093, 16150, 13520, 12841, 16905, 13013, 14210, 16477, 12600, 12779, 17787, 17954, 17819, 12712, 15373, 17238, 12395, 16455, 13069, 16241, 14800, 15708, 16168, 16931, 15351, 13269, 14810, 18118, 13831, 16983, 17059, 16327, 17211, 15570, 15808, 17858, 16393, 17863, 17402, 12647, 15867, 14506, 15555, 16143, 12720, 12747, 17965, 13174, 16161, 18219, 16638, 13094, 12708, 14189, 16719, 15301, 14825, 17596, 14085, 16919, 16722, 16710, 15984, 17682, 16550, 17068, 15356, 17191, 14409, 13495, 17519, 17218, 14215, 12913, 13564, 17091, 14907, 13756, 17491, 14282, 14673, 13769, 16904, 13880, 12347, 14739, 16293, 17419, 16775, 17591, 12839, 17870, 13267

Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [65]:
# your code here
vip_country = vip_costumers.groupby(['Country'])['CustomerID'].nunique().sort_values(ascending = False)
vip_country.head(1)

Country
United Kingdom    177
Name: CustomerID, dtype: int64

In [28]:
print('The Country with most VIP costumers is', vip_country.index[0])

The Country with most VIP costumers is United Kingdom


In [44]:
#to check: the sum of the above with the unique list of VIPcostumers IDs created in Q1 

vip_country.sum() == len(vip_costumers_IDs)+1

True

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [62]:
# dataframes together

vip_preferred = preferred_costumers.append(vip_costumers)
vip_preferred.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,sum_amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3,5391.21
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0,5391.21
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34,5391.21


In [67]:
vip_preferred.shape

(256265, 15)

In [76]:
#confirm
len(vip_costumers_IDs) + len(preferred_costumers_IDs) == len(list(vip_preferred.CustomerID.unique()))

True

In [78]:
vip_preferred_country = vip_preferred.groupby(['Country'])['CustomerID'].nunique().sort_values(ascending = False)
vip_preferred_country.head(1)

Country
United Kingdom    932
Name: CustomerID, dtype: int64

In [79]:
print('The Country with most VIP and Preferred costumers combined is', vip_preferred_country.index[0])

The Country with most VIP and Preferred costumers combined is United Kingdom
