# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [90]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [91]:
# your code here
orders = pd.read_csv('Orders.csv')

In [92]:
orders.head(10)

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
5,5,536365,22752,2010,12,3,8,set 7 babushka nesting boxes,2,2010-12-01 08:26:00,7.65,17850,United Kingdom,15.3
6,6,536365,21730,2010,12,3,8,glass star frosted t-light holder,6,2010-12-01 08:26:00,4.25,17850,United Kingdom,25.5
7,7,536366,22633,2010,12,3,8,hand warmer union jack,6,2010-12-01 08:28:00,1.85,17850,United Kingdom,11.1
8,8,536366,22632,2010,12,3,8,hand warmer red polka dot,6,2010-12-01 08:28:00,1.85,17850,United Kingdom,11.1
9,9,536367,84879,2010,12,3,8,assorted colour bird ornament,32,2010-12-01 08:34:00,1.69,13047,United Kingdom,54.08


In [93]:
orders.shape

(397924, 14)

In [94]:
orders.value_counts('Country')

Country
United Kingdom          354345
Germany                   9042
France                    8342
EIRE                      7238
Spain                     2485
Netherlands               2363
Belgium                   2031
Switzerland               1842
Portugal                  1462
Australia                 1185
Norway                    1072
Italy                      758
Channel Islands            748
Finland                    685
Cyprus                     614
Sweden                     451
Austria                    398
Denmark                    380
Poland                     330
Japan                      321
Israel                     248
Unspecified                244
Singapore                  222
Iceland                    182
USA                        179
Canada                     151
Greece                     145
Malta                      112
United Arab Emirates        68
European Community          60
RSA                         58
Lebanon                     45


In [95]:
list(orders.columns)

['Unnamed: 0',
 'InvoiceNo',
 'StockCode',
 'year',
 'month',
 'day',
 'hour',
 'Description',
 'Quantity',
 'InvoiceDate',
 'UnitPrice',
 'CustomerID',
 'Country',
 'amount_spent']

In [96]:
relevant_info = orders[['CustomerID', 'amount_spent','UnitPrice','Quantity','Country']]

---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [97]:
# taking relevant columns from orders.csv
relevant_info.head()

Unnamed: 0,CustomerID,amount_spent,UnitPrice,Quantity,Country
0,17850,15.3,2.55,6,United Kingdom
1,17850,20.34,3.39,6,United Kingdom
2,17850,22.0,2.75,8,United Kingdom
3,17850,20.34,3.39,6,United Kingdom
4,17850,20.34,3.39,6,United Kingdom


In [98]:
# How to aggregate the amount_spent for unique customers?
# consolidating CustomerID and sum() of ammount_spent
consolidate_customers = relevant_info.groupby('CustomerID', as_index=False)['amount_spent'].sum()

In [99]:
# How to select customers whose aggregated amount_spent is in a given quantile range?
consolidate_customers.head()

Unnamed: 0,CustomerID,amount_spent
0,12346,77183.6
1,12347,4310.0
2,12348,1797.24
3,12349,1757.55
4,12350,334.4


In [100]:
# every customer who spent above 5840 is VIP
consolidate_customers['amount_spent'].quantile(0.95)

5840.181999999982

In [101]:
# created a boolean mask identifying all VIP customers
consolidate_customers['Mask'] = (consolidate_customers['amount_spent'] >= consolidate_customers['amount_spent'].quantile(0.95))

In [102]:
consolidate_customers.head()

Unnamed: 0,CustomerID,amount_spent,Mask
0,12346,77183.6,True
1,12347,4310.0,False
2,12348,1797.24,False
3,12349,1757.55,False
4,12350,334.4,False


In [103]:
# filter the dataframe to display only VIP customers
consolidate_customers[consolidate_customers['amount_spent']>5840.18]

Unnamed: 0,CustomerID,amount_spent,Mask
0,12346,77183.60,True
10,12357,6207.67,True
12,12359,6372.58,True
50,12409,11072.67,True
55,12415,124914.53,True
...,...,...,...
4207,18109,8052.97,True
4229,18139,8438.34,True
4253,18172,7561.68,True
4292,18223,6484.54,True


In [104]:
preferred_filter = relevant_info.groupby('CustomerID', as_index=False)['amount_spent'].sum()

In [105]:
preferred_filter.head()

Unnamed: 0,CustomerID,amount_spent
0,12346,77183.6
1,12347,4310.0
2,12348,1797.24
3,12349,1757.55
4,12350,334.4


In [106]:
# getting a new dataframe for calculating preferred customers
# Don't want to mix up the previous calculations thus taking new variable 'preferred'
preferred = relevant_info.groupby('CustomerID', as_index=False)['amount_spent'].sum()

In [107]:
preferred.head()

Unnamed: 0,CustomerID,amount_spent
0,12346,77183.6
1,12347,4310.0
2,12348,1797.24
3,12349,1757.55
4,12350,334.4


In [108]:
# getting both quantiles and determining the range between 
preferred['amount_spent'].quantile(0.95)

5840.181999999982

In [109]:
preferred['amount_spent'].quantile(0.75)

1661.64

In [110]:
# filtering by preferred customer and displaying the dataframe
preferred[(preferred['amount_spent']<5840.18) & (preferred['amount_spent']>1661.64)]

Unnamed: 0,CustomerID,amount_spent
1,12347,4310.00
2,12348,1797.24
3,12349,1757.55
5,12352,2506.04
9,12356,2811.43
...,...,...
4319,18259,2338.60
4320,18260,2643.20
4328,18272,3078.58
4337,18283,2094.88


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [111]:
# your code here
# taking the initial dataframe and extracting CustomerID and Country
# Next step is to create a dictionary to eliminate duplicate CustomerIDs
# Next step is to take our VIP-filtered DataFrame and create a list of CustomerIDs and then compare the dict with the list
df = orders[['CustomerID', 'Country']]
   

In [112]:
df.head()

Unnamed: 0,CustomerID,Country
0,17850,United Kingdom
1,17850,United Kingdom
2,17850,United Kingdom
3,17850,United Kingdom
4,17850,United Kingdom


In [113]:
#our dictionary
country_dict = dict(zip(df.CustomerID, df.Country))
type(country_dict)

dict

In [114]:
df2 = consolidate_customers[consolidate_customers['amount_spent']>5840.18]

In [115]:
df2.head()

Unnamed: 0,CustomerID,amount_spent,Mask
0,12346,77183.6,True
10,12357,6207.67,True
12,12359,6372.58,True
50,12409,11072.67,True
55,12415,124914.53,True


In [116]:
vip_id_list = []
for i in df2['CustomerID']:
    vip_id_list.append(i)
print(vip_id_list)

[12346, 12357, 12359, 12409, 12415, 12428, 12431, 12433, 12435, 12451, 12471, 12472, 12474, 12476, 12477, 12536, 12540, 12557, 12567, 12583, 12590, 12621, 12626, 12637, 12678, 12681, 12682, 12683, 12705, 12709, 12731, 12744, 12748, 12753, 12757, 12766, 12798, 12830, 12901, 12921, 12931, 12939, 12971, 12980, 12989, 13001, 13018, 13027, 13078, 13081, 13089, 13090, 13093, 13097, 13098, 13102, 13113, 13199, 13209, 13225, 13263, 13316, 13319, 13324, 13340, 13408, 13418, 13458, 13488, 13534, 13576, 13629, 13668, 13694, 13709, 13767, 13777, 13798, 13854, 13871, 13881, 13969, 13985, 14031, 14051, 14056, 14057, 14060, 14062, 14088, 14096, 14101, 14156, 14194, 14258, 14298, 14367, 14415, 14505, 14527, 14606, 14607, 14646, 14667, 14680, 14733, 14735, 14769, 14796, 14849, 14866, 14895, 14911, 14936, 14944, 14952, 14961, 15005, 15023, 15039, 15044, 15061, 15078, 15098, 15125, 15144, 15159, 15189, 15194, 15249, 15251, 15290, 15311, 15358, 15382, 15465, 15482, 15498, 15502, 15513, 15601, 15615, 15640

In [117]:
# iterating over the dict and the list to find countries matching VIP CustomerID
country = []
for k,v in country_dict.items():
    for i in vip_id_list:
        if i == k:
            print(i,v)
            country.append(v)
            

12583 France
15311 United Kingdom
16029 United Kingdom
12431 Australia
17511 United Kingdom
13408 United Kingdom
13767 United Kingdom
15513 United Kingdom
13694 United Kingdom
14849 United Kingdom
16210 United Kingdom
12748 United Kingdom
12433 Norway
14911 EIRE
17841 United Kingdom
13093 United Kingdom
12921 United Kingdom
13777 United Kingdom
18229 United Kingdom
14606 United Kingdom
13576 United Kingdom
13090 United Kingdom
15694 United Kingdom
17017 United Kingdom
15601 United Kingdom
13418 United Kingdom
14060 United Kingdom
17381 United Kingdom
17581 United Kingdom
15061 United Kingdom
15640 United Kingdom
14031 United Kingdom
12971 United Kingdom
13798 United Kingdom
17396 United Kingdom
14156 EIRE
14680 United Kingdom
12557 Spain
16013 United Kingdom
17949 United Kingdom
12682 France
15769 United Kingdom
13081 United Kingdom
17243 United Kingdom
15465 United Kingdom
13089 United Kingdom
16033 United Kingdom
18055 United Kingdom
18109 United Kingdom
16839 United Kingdom
16814 Un

In [118]:
from collections import Counter
  
# Counter variable
Counter = Counter(country)
  
# input values and their respective counts.
most_occur = Counter.most_common(5) # taking top 5
  
print(most_occur)

# answer: 177 unique customers from United Kingdom

[('United Kingdom', 177), ('Germany', 10), ('France', 9), ('Switzerland', 3), ('Australia', 2)]


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [119]:
# your code here
# I will repeat the code just like VIP members. Then I will create two lists vip_list, preferred_list and merge these
# to find out the prefered country

In [120]:
country_preferred = dict(zip(df.CustomerID, df.Country))

In [121]:
# creating a fresh DataFrame for preferred customers
df3 = preferred[(preferred['amount_spent']<5840.18) & (preferred['amount_spent']>1661.64)]

In [122]:
preferred_id_list = []
for i in df3['CustomerID']:
    preferred_id_list.append(i)
print(preferred_id_list)

[12347, 12348, 12349, 12352, 12356, 12360, 12362, 12370, 12371, 12378, 12380, 12381, 12383, 12388, 12395, 12397, 12405, 12406, 12407, 12408, 12417, 12423, 12424, 12429, 12432, 12437, 12438, 12444, 12449, 12454, 12455, 12456, 12457, 12473, 12480, 12481, 12483, 12484, 12490, 12500, 12501, 12502, 12517, 12518, 12520, 12523, 12524, 12528, 12530, 12539, 12553, 12560, 12562, 12569, 12578, 12584, 12585, 12594, 12597, 12598, 12600, 12610, 12613, 12615, 12619, 12625, 12627, 12633, 12635, 12643, 12645, 12647, 12653, 12656, 12662, 12664, 12668, 12669, 12670, 12674, 12684, 12685, 12688, 12700, 12704, 12708, 12712, 12714, 12720, 12721, 12726, 12727, 12747, 12749, 12752, 12754, 12755, 12758, 12762, 12764, 12779, 12782, 12783, 12823, 12836, 12839, 12840, 12841, 12843, 12853, 12856, 12867, 12876, 12906, 12909, 12910, 12912, 12913, 12916, 12928, 12935, 12948, 12949, 12950, 12955, 12957, 12963, 13004, 13012, 13013, 13014, 13015, 13021, 13047, 13048, 13050, 13069, 13082, 13094, 13115, 13124, 13126, 13134

In [123]:
# iterating over the dict and the list to find countries matching VIP CustomerID
country_pref = []
for k,v in country_preferred.items():
    for i in preferred_id_list:
        if i == k:
            print(i,v)
            country_pref.append(v)

17850 United Kingdom
13047 United Kingdom
15291 United Kingdom
14688 United Kingdom
17809 United Kingdom
16098 United Kingdom
17924 United Kingdom
13448 United Kingdom
16218 United Kingdom
14307 United Kingdom
17920 United Kingdom
13758 United Kingdom
17377 United Kingdom
14001 United Kingdom
12662 Germany
15485 United Kingdom
18144 United Kingdom
16456 United Kingdom
17346 United Kingdom
17873 United Kingdom
13468 United Kingdom
16928 United Kingdom
14696 United Kingdom
17690 United Kingdom
17069 United Kingdom
15235 United Kingdom
15752 United Kingdom
13941 United Kingdom
14135 United Kingdom
14388 United Kingdom
18041 United Kingdom
15955 United Kingdom
14390 United Kingdom
15260 United Kingdom
13305 United Kingdom
15544 United Kingdom
15738 United Kingdom
15827 United Kingdom
14180 United Kingdom
14466 United Kingdom
16186 United Kingdom
17685 United Kingdom
17567 United Kingdom
17838 United Kingdom
17228 United Kingdom
17659 United Kingdom
15299 United Kingdom
17757 United Kingdom

16128 United Kingdom
15622 United Kingdom
17738 United Kingdom
17426 United Kingdom
16709 United Kingdom
17652 United Kingdom
17700 United Kingdom
13014 United Kingdom
15220 United Kingdom
12584 Italy
14903 United Kingdom
15572 United Kingdom
13344 United Kingdom
17092 United Kingdom
16242 United Kingdom
15834 United Kingdom
15764 United Kingdom
15819 United Kingdom
14407 United Kingdom
13908 United Kingdom
12758 Portugal
13975 United Kingdom
14730 United Kingdom
13373 United Kingdom
16712 United Kingdom
16426 United Kingdom
17049 United Kingdom
14112 United Kingdom
15367 United Kingdom
14291 United Kingdom
15632 United Kingdom
18173 United Kingdom
14854 United Kingdom
14930 Channel Islands
13265 United Kingdom
12520 Germany
13654 United Kingdom
17164 United Kingdom
15671 United Kingdom
16871 United Kingdom
15493 United Kingdom
15150 United Kingdom
14226 United Kingdom
16152 United Kingdom
12597 Spain
12955 United Kingdom
17716 United Kingdom
13650 United Kingdom
13268 United Kingdom
1

In [124]:
from collections import Counter
  
# Counter variable
Count = Counter(country_pref)
  
# input values and their respective counts.
most_occ = Count.most_common(5) # taking top 5
  
print(most_occ)

# answer: 177 unique customers from United Kingdom

[('United Kingdom', 755), ('Germany', 29), ('France', 20), ('Belgium', 11), ('Spain', 6)]


In [125]:
vip_dict = {}
preferred_dict = {}

vip_dict = most_occur
preferred_dict = most_occ


In [126]:
print(vip_dict)
print(preferred_dict)

[('United Kingdom', 177), ('Germany', 10), ('France', 9), ('Switzerland', 3), ('Australia', 2)]
[('United Kingdom', 755), ('Germany', 29), ('France', 20), ('Belgium', 11), ('Spain', 6)]
