# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [3]:
orders = pd.read_csv("Orders.csv")
orders

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365.0,85123A,2010.0,12.0,3.0,8.0,white hanging heart t-light holder,6.0,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.30
1,1,536365.0,71053,2010.0,12.0,3.0,8.0,white metal lantern,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,2,536365.0,84406B,2010.0,12.0,3.0,8.0,cream cupid hearts coat hanger,8.0,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.00
3,3,536365.0,84029G,2010.0,12.0,3.0,8.0,knitted union flag hot water bottle,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,4,536365.0,84029E,2010.0,12.0,3.0,8.0,red woolly hottie white heart.,6.0,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
388661,526725,580677.0,85034C,2011.0,12.0,1.0,14.0,3 rose morris boxed candles,1.0,2011-12-05 14:40:00,1.25,16200.0,United Kingdom,1.25
388662,526726,580677.0,22055,2011.0,12.0,1.0,14.0,mini cake stand hanging strawbery,1.0,2011-12-05 14:40:00,0.39,16200.0,United Kingdom,0.39
388663,526727,580677.0,84991,2011.0,12.0,1.0,14.0,60 teatime fairy cake cases,1.0,2011-12-05 14:40:00,0.55,16200.0,United Kingdom,0.55
388664,526728,580677.0,23210,2011.0,12.0,1.0,14.0,white rocking horse hand painted,1.0,2011-12-05 14:40:00,1.25,16200.0,United Kingdom,1.25


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [7]:
Agg_amount_spent = orders.groupby(["CustomerID"])["amount_spent"].agg(sum)
Agg_amount_spent

CustomerID
12346.0    77183.60
12347.0     4085.18
12348.0     1797.24
12349.0     1757.55
12350.0      334.40
             ...   
18280.0      180.60
18281.0       80.82
18282.0      178.05
18283.0     1886.88
18287.0     1837.28
Name: amount_spent, Length: 4314, dtype: float64

In [8]:
Agg_amount_spent.max()

268478.0

In [9]:
Agg_amount_spent.min()

0.0

In [16]:
percentiles = Agg_amount_spent.quantile([0.75, 0.95])

vip_customers = Agg_amount_spent[Agg_amount_spent > percentiles[0.95]]
preferred_customers = Agg_amount_spent[(Agg_amount_spent >= percentiles[0.75]) & (Agg_amount_spent <= percentiles[0.95])]

display(vip_customers)
display(preferred_customers)

CustomerID
12346.0     77183.60
12357.0      6207.67
12359.0      6372.58
12409.0     11072.67
12415.0    124914.53
             ...    
18109.0      7360.30
18139.0      8438.34
18172.0      7561.68
18223.0      6484.54
18229.0      7276.90
Name: amount_spent, Length: 216, dtype: float64

CustomerID
12347.0    4085.18
12348.0    1797.24
12349.0    1757.55
12352.0    2506.04
12356.0    2811.43
            ...   
18259.0    2338.60
18260.0    2643.20
18272.0    2710.70
18283.0    1886.88
18287.0    1837.28
Name: amount_spent, Length: 863, dtype: float64

Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [18]:
vip_customer_counts = orders[orders['CustomerID'].isin(vip_customers.index)].groupby('Country')['CustomerID'].nunique()

country_with_most_vip_customers = vip_customer_counts.idxmax()
num_vip_customers_in_country = vip_customer_counts.max()

print(f"The country with the most VIP Customers is {country_with_most_vip_customers} with {num_vip_customers_in_country} VIP Customers.")


The country with the most VIP Customers is United Kingdom with 176 VIP Customers.


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [21]:
preferred_customer_counts = orders[orders["CustomerID"].isin(preferred_customers.index)].groupby("Country")["CustomerID"].nunique()

country_with_most_both = vip_customer_counts.idxmax() + preferred_customer_counts.idxmax()
num_both_in_country = (vip_customer_counts + preferred_customer_counts).max()

print(f"The country with the most VIP+Preferred Customers is {country_with_most_both} with {num_both_in_country} Customers.")

The country with the most VIP+Preferred Customers is United KingdomUnited Kingdom with 922.0 Customers.
