# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [239]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [240]:
# your code here
orders = pd.read_csv("Orders\Orders.csv")

In [241]:
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [242]:
# your code here
#Sub Problem 1: How to aggregate the amount_spent for unique customers?
orders_by_customer = orders.groupby(['CustomerID']).agg({'amount_spent':'sum'})

In [243]:
orders_by_customer = orders_by_customer.sort_values(by='amount_spent', ascending=True)

In [244]:
len(orders_by_customer)

4339

In [245]:
import numpy as np

orders_by_customer['row_num'] = np.arange(len(orders_by_customer))
orders_by_customer['quantile'] = np.arange(len(orders_by_customer))/(len(orders_by_customer)-1)

print (orders_by_customer)


            amount_spent  row_num  quantile
CustomerID                                 
13256               0.00        0  0.000000
16738               3.75        1  0.000231
14792               6.20        2  0.000461
16454               6.90        3  0.000692
17956              12.75        4  0.000922
...                  ...      ...       ...
14911          143825.06     4334  0.999078
16446          168472.50     4335  0.999308
17450          194550.79     4336  0.999539
18102          259657.30     4337  0.999769
14646          280206.02     4338  1.000000

[4339 rows x 3 columns]


In [246]:
#How to select customers whose aggregated amount_spent is in a given quantile range?

#VIP Customers whose aggregated expenses at your global chain stores are above the 95th percentile
##############
orders_by_customer[orders_by_customer['quantile']>0.95] 

#Preferred Customers whose aggregated expenses are between the 75th and 95th percentile.
##############
#orders_by_customer[orders_by_customer['quantile']>0.75 & orders_by_customer['quantile']<0.95] #this doesn't work!
#Use between() method instead:
orders_by_customer[orders_by_customer['quantile'].between(0.75, 0.95, inclusive='both')]

Unnamed: 0_level_0,amount_spent,row_num,quantile
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
13012,1661.84,3254,0.750115
12530,1662.28,3255,0.750346
12912,1662.30,3256,0.750576
16115,1667.97,3257,0.750807
17656,1674.69,3258,0.751037
...,...,...,...
13178,5725.47,4117,0.949055
17686,5739.46,4118,0.949285
15218,5756.89,4119,0.949516
12720,5781.73,4120,0.949746


In [247]:
#Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

#VIP
#[orders_by_customer[orders_by_customer['quantile']>0.95]]

#This dowsn't work because of the orders_by_customer['CustomerType']=XXX
def CustomerType1(quantile):
    
    if quantile>0.95:
        print(quantile,'T1')
        orders_by_customer['CustomerType']='VIP'
    elif quantile>=0.75:
        print(quantile,'T2')
        orders_by_customer['CustomerType']='Preferred'
    else:
        print(quantile,'T3')
        orders_by_customer['CustomerType']='Standard'


In [248]:
#orders_by_customer['quantile'].apply(quantile)

In [249]:
def CustomerType(row):
    
    if row['quantile']>0.95:
        row['CustomerType']='VIP'
    elif row['quantile']>=0.75:
        row['CustomerType']='Preferred'
    else:
        row['CustomerType']='Standard'

In [250]:
#orders_by_customer.apply(CustomerType, axis=1)

In [251]:
def CustomerType3(quantile): #THIS IS THE GOOD ONE!
    
    if quantile>0.95:
        return 'VIP'
    elif quantile>=0.75:
        return 'Preferred'
    else:
        return 'Standard'

In [252]:
orders_by_customer['CustomerType']=orders_by_customer['quantile'].apply(CustomerType3)

In [253]:
orders_by_customer

Unnamed: 0_level_0,amount_spent,row_num,quantile,CustomerType
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
13256,0.00,0,0.000000,Standard
16738,3.75,1,0.000231,Standard
14792,6.20,2,0.000461,Standard
16454,6.90,3,0.000692,Standard
17956,12.75,4,0.000922,Standard
...,...,...,...,...
14911,143825.06,4334,0.999078,VIP
16446,168472.50,4335,0.999308,VIP
17450,194550.79,4336,0.999539,VIP
18102,259657.30,4337,0.999769,VIP


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [254]:
# your code here
#orders_by_customer = orders.groupby(['CustomerID']).agg({'amount_spent':'sum'})
orders_by_customer.shape #(4339, 4)

(4339, 4)

In [255]:
orders_by_customer=orders_by_customer.reset_index() #YOU NEED TO RESET INDEXES OTHERWISE CUSTOMERID WILL BE THE INDEX!!!!!

In [256]:
orders_by_customerCountry = orders.groupby(['CustomerID','Country']).agg({'amount_spent':'sum'}) #(4347, 1)

In [257]:
orders_by_customerCountry.shape #there are 8 more rows, some customers are in 2 countries and I AM CURIUS WHERE THEY COME FROM
#THERE MUST BE CUSTOMERS IN MORE THAN 1 COUNTRY, AND I WANT TO KNOW WHICH ONES:

(4347, 1)

In [258]:
#orders_by_customerCountry=orders_by_customerCountry.reset_index() #DO THIS ONLY ONCE!!
orders_by_customerCountry

Unnamed: 0,CustomerID,Country,amount_spent
0,12346,United Kingdom,77183.60
1,12347,Iceland,4310.00
2,12348,Finland,1797.24
3,12349,Italy,1757.55
4,12350,Norway,334.40
...,...,...,...
4342,18280,United Kingdom,180.60
4343,18281,United Kingdom,80.82
4344,18282,United Kingdom,178.05
4345,18283,United Kingdom,2094.88


In [238]:
orders_by_customerCountry

Unnamed: 0,level_0,index,CustomerID,Country,amount_spent
0,0,0,12346,United Kingdom,77183.60
1,1,1,12347,Iceland,4310.00
2,2,2,12348,Finland,1797.24
3,3,3,12349,Italy,1757.55
4,4,4,12350,Norway,334.40
...,...,...,...,...,...
4342,4342,4342,18280,United Kingdom,180.60
4343,4343,4343,18281,United Kingdom,80.82
4344,4344,4344,18282,United Kingdom,178.05
4345,4345,4345,18283,United Kingdom,2094.88


In [232]:
orders_by_customerCountry['CustomerID'].value_counts()>1

12417    False
12429    False
12455    False
12394    False
12422    False
         ...  
14333    False
14334    False
14335    False
14336    False
18287    False
Name: CustomerID, Length: 4339, dtype: bool

In [262]:
orders_by_customerCountry[orders_by_customerCountry.duplicated(['CustomerID'])]

Unnamed: 0,CustomerID,Country,amount_spent
21,12370,Cyprus,3268.49
40,12394,Denmark,891.4
59,12417,Spain,436.3
64,12422,Switzerland,417.36
72,12429,Denmark,3312.42
75,12431,Belgium,972.78
97,12455,Spain,767.96
100,12457,Switzerland,1970.53


In [263]:
orders_by_customerCountry[orders_by_customerCountry['CustomerID']==12370] #this customer is in Austria an Cyprus
#He/She may have done purchases in different countries. I treat him as sepparated purchases for that customer:

Unnamed: 0,CustomerID,Country,amount_spent
20,12370,Austria,277.2
21,12370,Cyprus,3268.49


In [265]:
#I reapeat everything I did, but now in the df group by customer and country:
orders_by_customerCountry = orders_by_customerCountry.sort_values(by='amount_spent', ascending=True)

orders_by_customerCountry['row_num'] = np.arange(len(orders_by_customerCountry))
orders_by_customerCountry['quantile'] = np.arange(len(orders_by_customerCountry))/(len(orders_by_customerCountry)-1)

#VIP Customers whose aggregated expenses at your global chain stores are above the 95th percentile
##############
orders_by_customerCountry[orders_by_customerCountry['quantile']>0.95] 
orders_by_customerCountry['CustomerType']=orders_by_customerCountry['quantile'].apply(CustomerType3)
orders_by_customerCountry

Unnamed: 0,CustomerID,Country,amount_spent,row_num,quantile,CustomerType
693,13256,United Kingdom,0.00,0,0.00000,Standard
3226,16738,United Kingdom,3.75,1,0.00023,Standard
1802,14792,United Kingdom,6.20,2,0.00046,Standard
3023,16454,United Kingdom,6.90,3,0.00069,Standard
4107,17956,United Kingdom,12.75,4,0.00092,Standard
...,...,...,...,...,...,...
1888,14911,EIRE,143825.06,4342,0.99908,VIP
3017,16446,United Kingdom,168472.50,4343,0.99931,VIP
3737,17450,United Kingdom,194550.79,4344,0.99954,VIP
4210,18102,United Kingdom,259657.30,4345,0.99977,VIP


In [274]:
#Q2: How to identify which country has the most VIP Customers?
orders_CstCntVIP=orders_by_customerCountry[orders_by_customerCountry['CustomerType']=='VIP']
orders_CstCntVIP #218 rows
orders_CstCntVIP.groupby(['Country']).agg({'amount_spent':'sum','CustomerID':'count'}).sort_values('CustomerID',ascending=False)

#United Kingdom has more VIP customers. Total 178. Follow by Germany with 11 and France with 9

Unnamed: 0_level_0,amount_spent,CustomerID
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United Kingdom,3423635.81,178
Germany,106369.06,11
France,106383.09,9
Switzerland,26315.86,3
Spain,25391.2,2
Portugal,14846.73,2
Japan,28406.43,2
EIRE,261204.69,2
Finland,7956.46,1
Channel Islands,8137.02,1


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [289]:
# your code here
orders_CstCntVIPPref=orders_by_customerCountry[orders_by_customerCountry['CustomerType'].isin(['VIP','Preferred'])]
orders_CstCntVIPPref



Unnamed: 0,CustomerID,Country,amount_spent,row_num,quantile,CustomerType
1960,15024,United Kingdom,1661.33,3260,0.750115,Preferred
2110,15214,United Kingdom,1661.44,3261,0.750345,Preferred
517,13012,United Kingdom,1661.84,3262,0.750575,Preferred
155,12530,Germany,1662.28,3263,0.750805,Preferred
446,12912,United Kingdom,1662.30,3264,0.751035,Preferred
...,...,...,...,...,...,...
1888,14911,EIRE,143825.06,4342,0.999080,VIP
3017,16446,United Kingdom,168472.50,4343,0.999310,VIP
3737,17450,United Kingdom,194550.79,4344,0.999540,VIP
4210,18102,United Kingdom,259657.30,4345,0.999770,VIP


In [291]:
orders_CstCntVIPPref.groupby(['Country']).agg({'amount_spent':'sum','CustomerID':'count'}).sort_values('CustomerID',ascending=False)

#United Kingdom has more VIP+Preferred customers. Total 934. Follow by Germany with 39 and France with 29

Unnamed: 0_level_0,amount_spent,CustomerID
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
United Kingdom,5627835.481,934
Germany,194078.54,39
France,174694.67,29
Belgium,31184.22,11
Switzerland,45381.99,9
Spain,46564.7,7
Portugal,26905.1,7
Norway,35065.63,7
Italy,12327.97,5
Finland,18686.81,5
