# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# your code here

orders = pd.read_csv('Orders.csv')
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [3]:
# your code here

#aggregate amount spent for unique costumers
unique_customers = orders['amount_spent'].groupby(orders['CustomerID']).sum()

unique_customers = pd.DataFrame(unique_customers)
unique_customers.head()

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.6
12347,4310.0
12348,1797.24
12349,1757.55
12350,334.4


In [20]:
# select costumers with amount spent: 95th percentile (VIP), 75th and 95th percentile (Preferred)

labels = ['No preferred status', 'Preferred', 'VIP'] 

unique_customers['Customer group'] = pd.qcut(unique_customers['amount_spent'], q = [0, 0.75, 0.95, 1], labels=labels)

In [21]:
unique_customers.head()

Unnamed: 0_level_0,amount_spent,Customer group
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
12346,77183.6,VIP
12347,4310.0,Preferred
12348,1797.24,Preferred
12349,1757.55,Preferred
12350,334.4,No preferred status


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [23]:
# your code here

countries = unique_customers.join(orders['Country'], on = 'CustomerID')

In [24]:
countries.head()

Unnamed: 0_level_0,amount_spent,Customer group,Country
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,77183.6,VIP,United Kingdom
12347,4310.0,Preferred,United Kingdom
12348,1797.24,Preferred,United Kingdom
12349,1757.55,Preferred,United Kingdom
12350,334.4,No preferred status,United Kingdom


In [32]:
#count customer groups by country and sort them in descending order
(countries['Customer group'].groupby(countries['Country']).value_counts()).sort_values(ascending=False)

Country          Customer group     
United Kingdom   No preferred status    2977
                 Preferred               780
                 VIP                     191
France           No preferred status      55
EIRE             No preferred status      48
Germany          No preferred status      44
Norway           No preferred status      36
Japan            No preferred status      33
Spain            No preferred status      31
Norway           Preferred                18
Germany          Preferred                16
Spain            Preferred                13
France           Preferred                13
Portugal         No preferred status      11
Channel Islands  No preferred status      10
EIRE             Preferred                 9
Denmark          No preferred status       9
                 Preferred                 8
Germany          VIP                       6
Norway           VIP                       6
France           VIP                       6
Japan            P

In [101]:
(countries[countries['Customer group'] == 'VIP']).groupby(countries['Country']).count().idxmax()

amount_spent      United Kingdom
Customer group    United Kingdom
Country           United Kingdom
dtype: object

In [102]:
(countries[countries['Customer group'] == 'VIP']).groupby(countries['Country']).count()#.idxmax()

Unnamed: 0_level_0,amount_spent,Customer group,Country
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
EIRE,3,3,3
France,6,6,6
Germany,6,6,6
Japan,1,1,1
Norway,6,6,6
Portugal,2,2,2
Spain,2,2,2
United Kingdom,191,191,191


In [83]:
countries.groupby('Country').agg({'Customer group': 'value_counts'}).sort_values(ascending=False, by='Country')

Unnamed: 0_level_0,Unnamed: 1_level_0,Customer group
Country,Customer group,Unnamed: 2_level_1
United Kingdom,VIP,191
United Kingdom,Preferred,780
United Kingdom,No preferred status,2977
Spain,VIP,2
Spain,Preferred,13
Spain,No preferred status,31
Portugal,VIP,2
Portugal,Preferred,4
Portugal,No preferred status,11
Norway,VIP,6


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [107]:
# your code here
counts = countries[(countries['Customer group'] == 'VIP')|(countries['Customer group'] == 'Preferred')]
counts_preferred_VIP = counts.groupby(by=['Country'])['Customer group'].count()
counts_preferred_VIP.sort_values(ascending=False)


Country
United Kingdom     971
Norway              24
Germany             22
France              19
Spain               15
EIRE                12
Denmark              8
Portugal             6
Japan                6
Channel Islands      2
Name: Customer group, dtype: int64

In [None]:

'''
counts_Preferred = customers_vip[(customers_vip['customer_classification'] == 'Preferred')|( customers_vip['customer_classification'] == 'VIP') 
customers_VIP_Preferred = counts_Preferred.groupby(by=['Country'])['customer_classification'].count()
customers_VIP_Preferred.sort_values(ascending=False)
'''