# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [3]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [4]:
orders = pd.read_csv('../Orders.csv', index_col=0) #import file
orders2 = orders.copy() #make a security copy
orders.head() #print the 5 first occurrences in the DF

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [5]:
orders.info() #displaying possible nulls

<class 'pandas.core.frame.DataFrame'>
Int64Index: 397924 entries, 0 to 541908
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   InvoiceNo     397924 non-null  int64  
 1   StockCode     397924 non-null  object 
 2   year          397924 non-null  int64  
 3   month         397924 non-null  int64  
 4   day           397924 non-null  int64  
 5   hour          397924 non-null  int64  
 6   Description   397924 non-null  object 
 7   Quantity      397924 non-null  int64  
 8   InvoiceDate   397924 non-null  object 
 9   UnitPrice     397924 non-null  float64
 10  CustomerID    397924 non-null  int64  
 11  Country       397924 non-null  object 
 12  amount_spent  397924 non-null  float64
dtypes: float64(2), int64(7), object(4)
memory usage: 36.4+ MB


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [6]:
sorted_orders = orders.sort_values(by='amount_spent', ascending=False)
#sort values by amount_spent
sorted_orders

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
540421,581483,23843,2011,12,5,9,"paper craft , little birdie",80995,2011-12-09 09:15:00,2.08,16446,United Kingdom,168469.60
61619,541431,23166,2011,1,2,10,medium ceramic top storage jar,74215,2011-01-18 10:01:00,1.04,12346,United Kingdom,77183.60
222680,556444,22502,2011,6,5,15,picnic basket wicker 60 pieces,60,2011-06-10 15:28:00,649.50,15098,United Kingdom,38970.00
173382,551697,POST,2011,5,2,13,postage,1,2011-05-03 13:46:00,8142.75,16029,United Kingdom,8142.75
348325,567423,23243,2011,9,2,11,set of tea coffee sugar tins pantry,1412,2011-09-20 11:05:00,5.06,17450,United Kingdom,7144.72
...,...,...,...,...,...,...,...,...,...,...,...,...,...
454463,575579,22437,2011,11,4,11,set of 9 black skull balloons,20,2011-11-10 11:49:00,0.00,13081,United Kingdom,0.00
40089,539722,22423,2010,12,2,13,regency cakestand 3 tier,10,2010-12-21 13:45:00,0.00,14911,EIRE,0.00
379913,569716,22778,2011,10,4,8,glass cloche small,2,2011-10-06 08:17:00,0.00,15804,United Kingdom,0.00
479546,577168,M,2011,11,5,10,manual,1,2011-11-18 10:42:00,0.00,12603,Germany,0.00


In [7]:
orders_VIP = sorted_orders.amount_spent.quantile(0.95) #find the quantile 0.95 where VIP customers are. 
#Result is customers spending 67.5eu or more are VIPs
orders_VIP

67.5

In [8]:
orders_pref = sorted_orders.amount_spent.quantile([0.75, 0.95]) #find the quantile 0.75-0.95 where preferred customers are. 
#Result is customers spending between 19.8-67.5 eu or more are VIPs
orders_pref

0.75    19.8
0.95    67.5
Name: amount_spent, dtype: float64

Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [14]:
orders_VIP = orders[orders.amount_spent >= 67.5] #first filter the data just for the customers VIP, the ones spending 67.5 or more
print(len(orders_VIP)) #check the filter works
orders_VIP['Country'].value_counts() #count the number of people by Nationality

#UK and Netherlands are the countries with more VIP customers

20321


United Kingdom          15366
Netherlands              1469
EIRE                      741
Australia                 579
Germany                   522
France                    407
Switzerland               181
Sweden                    177
Japan                     172
Spain                     130
Norway                    116
Denmark                    63
Finland                    62
Singapore                  48
Portugal                   41
Belgium                    40
Channel Islands            31
Israel                     28
Austria                    27
Cyprus                     24
Greece                     24
Italy                      19
Poland                     10
Iceland                     7
Lithuania                   7
Lebanon                     6
Malta                       4
United Arab Emirates        4
Brazil                      4
USA                         4
Canada                      3
Bahrain                     2
Unspecified                 2
Czech Repu

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [22]:
orders_pref_VIP = orders[orders.amount_spent >= 19.8] 
#first filter the data just for the customers VIP, the ones spending 19.8 or more
print(len(orders_pref_VIP)) #check the filter works

orders_pref_VIP['Country'].value_counts() #count the number of people by Nationality

#UK and Germany are the countries with more VIP customers and Preferred Customers

100655


United Kingdom          81785
Germany                  3470
France                   3198
EIRE                     3038
Netherlands              1940
Switzerland               875
Australia                 859
Belgium                   714
Spain                     698
Norway                    550
Portugal                  504
Channel Islands           332
Finland                   322
Italy                     310
Sweden                    283
Japan                     241
Denmark                   232
Cyprus                    216
Singapore                 156
Austria                   148
Poland                    135
Israel                    122
Iceland                    79
USA                        66
Greece                     52
Canada                     50
United Arab Emirates       47
Unspecified                44
Malta                      38
Lithuania                  29
Lebanon                    27
European Community         24
RSA                        20
Czech Repu