# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [13]:
orders.columns

Index(['Unnamed: 0', 'InvoiceNo', 'StockCode', 'year', 'month', 'day', 'hour',
       'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID',
       'Country', 'amount_spent'],
      dtype='object')

In [17]:
# Dropping the index column from before
orders = orders.drop(columns={"Unnamed: 0"})

In [18]:
# your code here
orders = pd.read_csv("./Orders.csv")
orders.sort_values('amount_spent', ascending = False)
# orders.head(10)

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
397451,540421,581483,23843,2011,12,5,9,"paper craft , little birdie",80995,2011-12-09 09:15:00,2.08,16446,United Kingdom,168469.60
37126,61619,541431,23166,2011,1,2,10,medium ceramic top storage jar,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom,77183.60
155418,222680,556444,22502,2011,6,5,15,picnic basket wicker 60 pieces,60,2011-06-10 15:28:00,649.50,15098,United Kingdom,38970.00
118352,173382,551697,POST,2011,5,2,13,postage,1,2011-05-03 13:46:00,8142.75,16029,United Kingdom,8142.75
248706,348325,567423,23243,2011,9,2,11,set of tea coffee sugar tins pantry,1412,2011-09-20 11:05:00,5.06,17450,United Kingdom,7144.72
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
334242,454463,575579,22437,2011,11,4,11,set of 9 black skull balloons,20,2011-11-10 11:49:00,0.00,13081,United Kingdom,0.00
25379,40089,539722,22423,2010,12,2,13,regency cakestand 3 tier,10,2010-12-21 13:45:00,0.00,14911,EIRE,0.00
273926,379913,569716,22778,2011,10,4,8,glass cloche small,2,2011-10-06 08:17:00,0.00,15804,United Kingdom,0.00
353097,479546,577168,M,2011,11,5,10,manual,1,2011-11-18 10:42:00,0.00,12603,Germany,0.00


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [20]:
# Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

customers = orders.groupby('CustomerID', as_index=False).agg({'amount_spent':'sum'})
customers.sort_values('amount_spent', ascending = False)

Unnamed: 0,CustomerID,amount_spent
1690,14646,280206.02
4202,18102,259657.30
3729,17450,194550.79
3009,16446,168472.50
1880,14911,143825.06
...,...,...
4099,17956,12.75
3015,16454,6.90
1794,14792,6.20
3218,16738,3.75


In [41]:
# Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?
## Preferred > 0.75 < 0.95
## VIP > 0.95
preferred = customers[(customers['amount_spent'] < customers['amount_spent'].quantile(.95)) & (customers['amount_spent'] > customers['amount_spent'].quantile(.75))]
vip = customers[(customers['amount_spent'] > customers['amount_spent'].quantile(.95))]

# merging

# preferred_customers = pd.merge(orders, preferred, on=['CustomerID'])
# preferred_customers
preferred_full = preferred.merge(orders, how = 'left')
vip_full = vip.merge(orders, how = 'left')
vip_full

Unnamed: 0.1,CustomerID,amount_spent,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,Country
0,12346,77183.60,61619.0,541431.0,23166,2011.0,1.0,2.0,10.0,medium ceramic top storage jar,74215.0,2011-01-18 10:01:00,1.04,United Kingdom
1,12357,6207.67,,,,,,,,,,,,
2,12359,6372.58,,,,,,,,,,,,
3,12409,11072.67,,,,,,,,,,,,
4,12415,124914.53,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
212,18109,8052.97,,,,,,,,,,,,
213,18139,8438.34,,,,,,,,,,,,
214,18172,7561.68,,,,,,,,,,,,
215,18223,6484.54,,,,,,,,,,,,


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [42]:
# your code here

vip_full['Country'].value_counts()

United Kingdom    1
Name: Country, dtype: int64

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [43]:
# your code here
result = vip_full.append(preferred)
result['Country'].value_counts()

United Kingdom    1
Name: Country, dtype: int64