# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from the file `Oders.csv` located in the `data` folder into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# your code here
orders = pd.read_csv('../data/Orders.csv')
orders.head(10)

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
5,5,536365,22752,2010,12,3,8,set 7 babushka nesting boxes,2,2010-12-01 08:26:00,7.65,17850,United Kingdom,15.3
6,6,536365,21730,2010,12,3,8,glass star frosted t-light holder,6,2010-12-01 08:26:00,4.25,17850,United Kingdom,25.5
7,7,536366,22633,2010,12,3,8,hand warmer union jack,6,2010-12-01 08:28:00,1.85,17850,United Kingdom,11.1
8,8,536366,22632,2010,12,3,8,hand warmer red polka dot,6,2010-12-01 08:28:00,1.85,17850,United Kingdom,11.1
9,9,536367,84879,2010,12,3,8,assorted colour bird ornament,32,2010-12-01 08:34:00,1.69,13047,United Kingdom,54.08


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [3]:
# your code here
unique_costum = orders.groupby(['CustomerID']).agg({'amount_spent':'sum'})
unique_costum['Country'] = orders['Country']
unique_costum

Unnamed: 0_level_0,amount_spent,Country
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
12346,77183.60,United Kingdom
12347,4310.00,United Kingdom
12348,1797.24,United Kingdom
12349,1757.55,United Kingdom
12350,334.40,United Kingdom
...,...,...
18280,180.60,United Kingdom
18281,80.82,United Kingdom
18282,178.05,United Kingdom
18283,2094.88,United Kingdom


In [4]:
quantiles = unique_costum.amount_spent.quantile([0, 0.25, 0.5, 0.75, 0.95, 1])
quantiles

0.00         0.000
0.25       307.245
0.50       674.450
0.75      1661.640
0.95      5840.182
1.00    280206.020
Name: amount_spent, dtype: float64

In [12]:
unique_costum["Quantiles"] = pd.qcut(unique_costum["amount_spent"], q=[0, 0.25, 0.5, 0.75, 0.95, 1], labels=["Very Low", "Low", "Medium", "Preferred", "VIP"])
unique_costum

Unnamed: 0_level_0,amount_spent,Country,Quantiles
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,77183.60,United Kingdom,VIP
12347,4310.00,United Kingdom,Preferred
12348,1797.24,United Kingdom,Preferred
12349,1757.55,United Kingdom,Preferred
12350,334.40,United Kingdom,Low
...,...,...,...
18280,180.60,United Kingdom,Very Low
18281,80.82,United Kingdom,Very Low
18282,178.05,United Kingdom,Very Low
18283,2094.88,United Kingdom,Preferred


In [8]:
unique_costum['Quantiles'].value_counts()

Low          1085
Very Low     1085
Medium       1084
Preferred     868
VIP           217
Name: Quantiles, dtype: int64

In [9]:
unique_costum['amount_spent'].sort_values(axis=0, ascending=False)

CustomerID
14646    280206.02
18102    259657.30
17450    194550.79
16446    168472.50
14911    143825.06
           ...    
17956        12.75
16454         6.90
14792         6.20
16738         3.75
13256         0.00
Name: amount_spent, Length: 4339, dtype: float64

Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [10]:
# your code here
vip_costum = unique_costum[unique_costum['Quantiles'].isin(['VIP'])]
vip_costum['Country'].value_counts()

United Kingdom    191
Germany             6
France              6
Norway              6
EIRE                3
Spain               2
Portugal            2
Japan               1
Name: Country, dtype: int64

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [11]:
# your code here
vip_pref_costum = unique_costum[unique_costum['Quantiles'].isin(['VIP', 'Preferred'])]
vip_pref_costum['Country'].value_counts()

United Kingdom     971
Norway              24
Germany             22
France              19
Spain               15
EIRE                12
Denmark              8
Portugal             6
Japan                6
Channel Islands      2
Name: Country, dtype: int64