# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
orders = pd.read_csv('orders.csv')
orders

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.30
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.00
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397919,541904,581587,22613,2011,12,5,12,pack of 20 spaceboy napkins,12,2011-12-09 12:50:00,0.85,12680,France,10.20
397920,541905,581587,22899,2011,12,5,12,children's apron dolly girl,6,2011-12-09 12:50:00,2.10,12680,France,12.60
397921,541906,581587,23254,2011,12,5,12,childrens cutlery dolly girl,4,2011-12-09 12:50:00,4.15,12680,France,16.60
397922,541907,581587,23255,2011,12,5,12,childrens cutlery circus parade,4,2011-12-09 12:50:00,4.15,12680,France,16.60


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [42]:
groupedCID = orders.groupby(['CustomerID']).sum()
groupedCID
groupedCID['amount_spent'].quantile(0.75)
groupedCID['amount_spent'].quantile(0.95)
#any costumer that spent above 1661.64$ but less than 5840.18$ is Preferred Customers
#any costumer that spent above 5840.18$ is VIP Customers



5840.181999999983

In [41]:
groupedCID['TypeOfCostumer'] = np.where(groupedCID['amount_spent'] >= groupedCID['amount_spent'].quantile(0.95), 'VIP Costumer','')
groupedCID['TypeOfCostumer'] = np.where((groupedCID['amount_spent'] >= groupedCID['amount_spent'].quantile(0.75)) & (groupedCID['amount_spent'] < groupedCID['amount_spent'].quantile(0.95)), 'Preferred Costumer', groupedCID['TypeOfCostumer'])
groupedCID
whales = groupedCID['TypeOfCostumer']

result = pd.concat([orders, groupedCID], axis=1, join='inner')
result

Unnamed: 0.2,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,...,Unnamed: 0.1,InvoiceNo.1,year.1,month.1,day.1,hour.1,Quantity.1,UnitPrice,amount_spent,TypeOfCostumer
12346,19174,537843,22776,2010,12,3,15,sweetheart cakestand 3 tier,2,2010-12-08 15:16:00,...,61619,541431,2011,1,2,10,74215,1.04,77183.60,VIP Costumer
12347,19175,537844,22112,2010,12,3,15,chocolate hot water bottle,1,2010-12-08 15:17:00,...,42441700,101296926,365971,1383,441,2219,2458,481.21,4310.00,Preferred Costumer
12348,19176,537844,21587,2010,12,3,15,cosy hour giant tube matches,1,2010-12-08 15:17:00,...,2807120,16869685,62324,257,111,472,2341,178.71,1797.24,Preferred Costumer
12349,19177,537844,22502,2010,12,3,15,picnic basket wicker small,1,2010-12-08 15:17:00,...,35444274,42165457,146803,803,73,657,631,605.10,1757.55,Preferred Costumer
12350,19178,537844,21935,2010,12,3,15,suki shoulder bag,1,2010-12-08 15:17:00,...,1365627,9231629,34187,34,51,272,197,65.30,334.40,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18280,28680,538652,21821,2010,12,1,15,glitter star garland with bells,2,2010-12-13 15:12:00,...,1110495,5457120,20110,30,10,90,45,47.65,180.60,
18281,28681,538652,20826,2010,12,1,15,silver aperitif glass,6,2010-12-13 15:12:00,...,1560699,3895248,14077,42,49,70,54,39.36,80.82,
18282,28682,538652,84947,2010,12,1,15,antique silver tea glass engraved,6,2010-12-13 15:12:00,...,4642134,6838540,24132,116,60,146,103,62.39,178.05,
18283,28683,538652,21145,2010,12,1,15,antique glass place setting,24,2010-12-13 15:12:00,...,233950830,425704048,1520316,5503,2489,10346,1397,1220.93,2094.88,Preferred Costumer


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [61]:
result['isVIP'] = np.where(groupedCID['amount_spent'] >= groupedCID['amount_spent'].quantile(0.95), 1,0)
result['isPRE'] = np.where((groupedCID['amount_spent'] >= groupedCID['amount_spent'].quantile(0.75)) & (groupedCID['amount_spent'] < groupedCID['amount_spent'].quantile(0.95)), 1, 0)
result
finaltable = result.groupby('Country').sum()
finaltable['isVIP'].nlargest(1)


Country
United Kingdom    191
Name: isVIP, dtype: int32

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [50]:
finaltable

Unnamed: 0_level_0,Unnamed: 0,InvoiceNo,year,month,day,hour,Quantity,UnitPrice,CustomerID,amount_spent,...,InvoiceNo,year,month,day,hour,Quantity,UnitPrice,amount_spent,isVIP,isPRE
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Channel Islands,240097,6456024,24120,144,48,132,45,133.79,179184,269.88,...,392451583,1421677,4936,2940,9256,7152,2012.62,10968.98,0,2
Denmark,340436,9146051,34170,204,68,204,424,81.32,211293,1171.4,...,1061395207,3802752,13790,8535,22757,18303,5474.39,29092.94,0,8
EIRE,1341960,32289396,120600,720,276,720,452,286.05,894660,1529.8,...,3369316245,12142011,41848,21439,78844,82126,17122.1,135149.87,3,9
France,1555179,39815774,148740,888,307,897,1072,228.28,938478,2173.13,...,3782239881,13577661,52649,23481,85263,84402,18979.011,139736.201,6,13
Germany,1430560,35513833,132660,792,296,637,1060,230.09,826373,1764.6,...,4344806141,15560792,58592,27763,99838,94171,25548.12,173815.71,6,16
Japan,1005026,20997483,78390,468,273,429,2001,72.87,497367,2937.55,...,1838839935,6511549,29634,10714,42330,22502,8944.24,39354.69,1,5
Norway,1173767,32272080,120600,720,180,960,1394,117.71,745980,1531.86,...,3552173869,12763402,46640,22035,75686,86923,23380.46,158395.64,6,18
Portugal,375630,9148119,34170,204,77,197,151,99.7,217441,413.8,...,686640617,2451304,10494,3669,15625,35227,3744.97,39042.81,2,4
Spain,1089644,24759680,92460,552,133,559,308,191.16,708993,892.63,...,2232312876,7983442,31492,13604,49137,65104,10546.13,83184.88,2,13
United Kingdom,94937843,2125146220,7935480,47376,17339,51061,37826,11974.94,62465871,72279.98,...,201822833149,723984414,2739136,1304138,4586139,4685786,1124248.573,8102666.183,191,780


In [59]:
finaltable2 = finaltable['isVIP']+finaltable['isPRE']
finaltable2.nlargest(1)

Country
United Kingdom    971
dtype: int32