# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
#Importing Orders from DataSet folder
orders = pd.read_csv("../Datasets as CSV/Orders.csv", index_col = 0)
orders.head()

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [3]:
"""To identify VIP and Preferred Customers we'll have to sort the table to see the maximum and minimum values sorted
After that, if we calculate the median, the average amount spent as well as the min and max value to then get to 
know the interquartiles and be able to filter them by them by two groups."""

"To identify VIP and Preferred Customers we'll have to sort the table to see the maximum and minimum values sorted\nAfter that, if we calculate the median, the average amount spent as well as the min and max value to then get to \nknow the interquartiles and be able to filter them by them by two groups."

---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [4]:
#The VIP consumer are those whose aggregated expenses at your global chain stores are above the 95th percentile. 

"""In the dataset above we can see that the Customer ID is unique, we have to groupby 
all the consumers and then calculate their average amount spent"""

amount_customer_spent = orders.groupby("CustomerID").agg({"amount_spent": "mean"}).sort_values("amount_spent", ascending = False)

#Now we have a dataframe were we can see what each of the customers spent on an average basis. 
#We have aggregated the amount spent by mean to calculate what is the average quantity spent in our business. 

amount_customer_spent

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.600000
16446,56157.500000
15098,13305.500000
15749,4453.430000
15195,3861.000000
...,...
13271,2.264375
13684,2.241000
17816,2.150588
15503,2.101286


In [5]:
#Now we have to identify the 0.95 percentile and the 75th percentile.
#To calculate the interquartile percentages, we need to know the median, the max and min values. 

maxvalue = orders.amount_spent.max()
minvalue = orders.amount_spent.min()
meanvalue = orders.amount_spent.mean()
medianvalue = orders.amount_spent.median()

print(meanvalue)
print(maxvalue)


22.394748504739596
168469.6


In [6]:
#Since there is such a difference from the average value to the median, we will take the median to print the percentiles.

"""The Q3 percentile / 75th percentile will be those orders that are in the median between the median and the maximum value """
Q3 = orders.amount_spent.quantile(0.75)
"""The Q3 percentile / 95th percentile will be those orders that are in the median between the median and the maximum value """
Q95 = orders.amount_spent.quantile(0.95)

print(Q3)
print(Q95)

#Customers that have spent between 19.8 and 67.5 belong to the Preferred Customers
#Customers that have spent between 19.8 and 67.5 belong to the Preferred Customers

19.8
67.5


In [7]:
#Now that we have aggregated it and we now the values, let's find those customers who comply with these conditions. 
#We need to find the Customer ID for each of the customers whose average spent amount is over 19.8 and below 67.5 to identify the Preferred Customers

Preferred_customers = amount_customer_spent.loc[(amount_customer_spent["amount_spent"]>= Q3) & (amount_customer_spent["amount_spent"]<= Q95),["amount_spent"]]
Preferred_customers.count()

#There are 1372 customers inside our preferred customers that we can see in our dataframe 
Preferred_customers

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
18144,67.180233
13216,66.748000
17679,66.403667
14185,65.883333
16696,65.790000
...,...
18257,19.810424
13813,19.809565
12610,19.806981
17324,19.804600


In [8]:
#The same to find the VIP customers. We need to find those customers that have spent more than 67.5.

VIP_customers = amount_customer_spent.loc[(amount_customer_spent["amount_spent"]> Q95),["amount_spent"]]
VIP_customers.count()

#There are 320 VIP customers who are in the 95th to 100th percentile. 
VIP_customers

Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.600000
16446,56157.500000
15098,13305.500000
15749,4453.430000
15195,3861.000000
...,...
18273,68.000000
18142,67.968000
18173,67.962581
14816,67.962500


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [9]:
# We have the customers that are the VIP Customers, if we add the column of Country on count the amount of times a country appears, we will now what country has the mosth VIP Customers
"""for CustomerID in VIP_customers, give us the correspondent Country there are from"""
CustomerID = VIP_customers.index.tolist()

Country_VIP = orders.loc[orders["CustomerID"].isin(CustomerID)]

#Since a lot of the customers are the same, we need to only count the amount of VIP Customers that are unique per each country
#We drop the customer IDs that are duplicate
Country_VIP = Country_VIP.drop_duplicates("CustomerID")
#We ask to count how many times each country appears.
Country_VIP.Country.value_counts()

#The United Kingdom has the largest number of VIP customers.


United Kingdom     290
Japan                4
Germany              4
France               3
Denmark              2
Spain                2
Channel Islands      2
Netherlands          2
Switzerland          2
Greece               1
Norway               1
Sweden               1
Finland              1
EIRE                 1
Canada               1
Singapore            1
Austria              1
Australia            1
Name: Country, dtype: int64

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [10]:
#We are going to Add a Column tto the original Database with the tabs VIP & Preferred 
orders["VIP"]= 0
orders["Preferred"]= 0 
orders.head()
#Now that we have created the two columns lets populate them 
orders.loc[orders["CustomerID"].isin(CustomerID),"VIP"]= 1
CustomerIDpreferred = Preferred_customers.index.tolist()
orders.loc[orders["CustomerID"].isin(CustomerIDpreferred),"Preferred"]= 1


In [11]:
orders.loc[(orders["VIP"]== 1)]

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,VIP,Preferred
65,536374,21258,2010,12,3,9,victorian sewing box large,32,2010-12-01 09:09:00,10.95,15100,United Kingdom,350.40,1,0
105,536380,22961,2010,12,3,9,jam making set printed,24,2010-12-01 09:41:00,1.45,17809,United Kingdom,34.80,1,0
175,536386,84880,2010,12,3,9,white wire egg holder,36,2010-12-01 09:57:00,4.95,16029,United Kingdom,178.20,1,0
176,536386,85099C,2010,12,3,9,jumbo bag baroque black white,100,2010-12-01 09:57:00,1.65,16029,United Kingdom,165.00,1,0
177,536386,85099B,2010,12,3,9,jumbo bag red retrospot,100,2010-12-01 09:57:00,1.65,16029,United Kingdom,165.00,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
541702,581566,23404,2011,12,5,11,home sweet home blackboard,144,2011-12-09 11:50:00,3.26,18102,United Kingdom,469.44,1,0
541865,581583,20725,2011,12,5,12,lunch bag red retrospot,40,2011-12-09 12:23:00,1.45,13777,United Kingdom,58.00,1,0
541866,581583,85038,2011,12,5,12,6 chocolate love heart t-lights,36,2011-12-09 12:23:00,1.85,13777,United Kingdom,66.60,1,0
541867,581584,20832,2011,12,5,12,red flock love heart photo frame,72,2011-12-09 12:25:00,0.72,13777,United Kingdom,51.84,1,0


In [12]:
orders.loc[(orders["Preferred"]== 1)]

Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent,VIP,Preferred
26,536370,22728,2010,12,3,8,alarm clock bakelike pink,24,2010-12-01 08:45:00,3.75,12583,France,90.0,0,1
27,536370,22727,2010,12,3,8,alarm clock bakelike red,24,2010-12-01 08:45:00,3.75,12583,France,90.0,0,1
28,536370,22726,2010,12,3,8,alarm clock bakelike green,12,2010-12-01 08:45:00,3.75,12583,France,45.0,0,1
29,536370,21724,2010,12,3,8,panda and bunnies sticker sheet,12,2010-12-01 08:45:00,0.85,12583,France,10.2,0,1
30,536370,21883,2010,12,3,8,stars gift tape,24,2010-12-01 08:45:00,0.65,12583,France,15.6,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
541864,581582,23498,2011,12,5,12,classic bicycle clips,12,2011-12-09 12:21:00,1.45,17581,United Kingdom,17.4,0,1
541890,581586,22061,2011,12,5,12,large cake stand hanging strawbery,8,2011-12-09 12:49:00,2.95,13113,United Kingdom,23.6,0,1
541891,581586,23275,2011,12,5,12,set of 3 hanging owls ollie beak,24,2011-12-09 12:49:00,1.25,13113,United Kingdom,30.0,0,1
541892,581586,21217,2011,12,5,12,red retrospot round cake tins,24,2011-12-09 12:49:00,8.95,13113,United Kingdom,214.8,0,1


In [13]:
#Now that we have populated those columns, lets find the countries that have more VIP and Preferred Customers

VIP_Preferred = orders.groupby(["CustomerID","Country"]).agg({"VIP":"sum", "Preferred":"sum"})

In [14]:
VIP_Preferred["Sum"]= VIP_Preferred["VIP"] + VIP_Preferred["Preferred"]

In [16]:
VIP_Preferred.Sum.sort_values(ascending = False)

CustomerID  Country       
14911       EIRE              5677
15311       United Kingdom    2379
14646       Netherlands       2080
13089       United Kingdom    1818
14298       United Kingdom    1637
                              ... 
14762       United Kingdom       0
16710       United Kingdom       0
14760       United Kingdom       0
14759       United Kingdom       0
15292       United Kingdom       0
Name: Sum, Length: 4347, dtype: int64

In [None]:
#Again the United Kingdom has the higher concentration on preferred and VIP customers