# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [2]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [3]:
from sqlalchemy import create_engine
import pandas as pd

driver = 'mysql+pymysql:'
user = 'ironhacker_read'
password = 'ir0nhack3r'
ip = '35.239.232.23'
database = 'orders'

In [4]:
connection_string = f'{driver}//{user}:{password}@{ip}/{database}'

In [5]:
engine = create_engine(connection_string)

In [6]:

query = """
        SELECT * FROM orders.orders;
"""


In [7]:
orders = pd.read_sql(query, engine)

In [8]:
orders.head()

Unnamed: 0,index,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [9]:
orders.to_csv('../orders.csv') #Here I save the csv file in my local repo in order to have it at home. 

In [10]:
## Data cleaning and adjusting types. 

orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397924 entries, 0 to 397923
Data columns (total 14 columns):
index           397924 non-null int64
InvoiceNo       397924 non-null int64
StockCode       397924 non-null object
year            397924 non-null int64
month           397924 non-null int64
day             397924 non-null int64
hour            397924 non-null int64
Description     397924 non-null object
Quantity        397924 non-null int64
InvoiceDate     397924 non-null object
UnitPrice       397924 non-null float64
CustomerID      397924 non-null int64
Country         397924 non-null object
amount_spent    397924 non-null float64
dtypes: float64(2), int64(8), object(4)
memory usage: 42.5+ MB


1. Here we can see that we don't have null values here.
2. If we take a look at the dataframes types, we can see that we can drop the column that is called index
3. we can change the day, month and year to datetime(ns)
4. We can change the columns names to invoice_no, stock_code, and so on. 

In [11]:
## remove index column

orders = orders.drop(['index'], axis=1)


In [12]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397924 entries, 0 to 397923
Data columns (total 13 columns):
InvoiceNo       397924 non-null int64
StockCode       397924 non-null object
year            397924 non-null int64
month           397924 non-null int64
day             397924 non-null int64
hour            397924 non-null int64
Description     397924 non-null object
Quantity        397924 non-null int64
InvoiceDate     397924 non-null object
UnitPrice       397924 non-null float64
CustomerID      397924 non-null int64
Country         397924 non-null object
amount_spent    397924 non-null float64
dtypes: float64(2), int64(7), object(4)
memory usage: 39.5+ MB


In [13]:
orders.columns #We want to change the columns names in order to follow standarized guidelines such as PEP8. 

Index(['InvoiceNo', 'StockCode', 'year', 'month', 'day', 'hour', 'Description',
       'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country',
       'amount_spent'],
      dtype='object')

In [14]:
new_columns_name = ['invoice_no', 'stock_code', 'year', 'month', 'day', 'hour', 'description',
       'quantity', 'invoice_date', 'unit_price', 'customer_ID', 'Country','amount_spent']

In [15]:
orders.columns = new_columns_name

In [16]:
orders.head()

Unnamed: 0,invoice_no,stock_code,year,month,day,hour,description,quantity,invoice_date,unit_price,customer_ID,Country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [17]:
orders['invoice_date']= pd.to_datetime(orders['invoice_date']) #We change the type to datetime

In [18]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397924 entries, 0 to 397923
Data columns (total 13 columns):
invoice_no      397924 non-null int64
stock_code      397924 non-null object
year            397924 non-null int64
month           397924 non-null int64
day             397924 non-null int64
hour            397924 non-null int64
description     397924 non-null object
quantity        397924 non-null int64
invoice_date    397924 non-null datetime64[ns]
unit_price      397924 non-null float64
customer_ID     397924 non-null int64
Country         397924 non-null object
amount_spent    397924 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(7), object(3)
memory usage: 39.5+ MB


Now our data is ready for the analysis. 

---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [19]:
## subproblem 1: How to aggregate the amount_spent for unique customers?

#First we groupby each customer id and we sum the amount spent

customers_agg = orders.groupby('customer_ID').sum()

In [20]:
customers_agg.head()

Unnamed: 0_level_0,invoice_no,year,month,day,hour,quantity,unit_price,amount_spent
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12346,541431,2011,1,2,10,74215,1.04,77183.6
12347,101296926,365971,1383,441,2219,2458,481.21,4310.0
12348,16869685,62324,257,111,472,2341,178.71,1797.24
12349,42165457,146803,803,73,657,631,605.1,1757.55
12350,9231629,34187,34,51,272,197,65.3,334.4


In [21]:
#here, we can drop the columns that are not of our interest. The only columns we want are customer_ID, that 
#acts as the index and the amount_spent

customers_agg = customers_agg[['amount_spent']]

In [22]:
customers_agg.columns = ['total_amount_cust']

In [23]:
customers_agg.head()

Unnamed: 0_level_0,total_amount_cust
customer_ID,Unnamed: 1_level_1
12346,77183.6
12347,4310.0
12348,1797.24
12349,1757.55
12350,334.4


In [24]:
#Sub Problem 2: How to select customers whose aggregated amount_spent is in a given quantile range?¶
#We define the IQ range we want, and we calculate the amount spent per quantile. 

customers_quantile = customers_agg.quantile([.95, .75]) 

In [25]:
customers_quantile

Unnamed: 0,total_amount_cust
0.95,5840.182
0.75,1661.64


In [41]:
customers_quantile.iloc[0,0] #To access the total amount per customer spent belonging to 95% quantile. 

5840.181999999983

The 95% quantile is achieved when the total amount spent by customer is greater than 5840,18€, and the 75% quantile is achieved when a greater spent than 1661€ is done by customer. 

Hence, the 0,95% will be assumed for VIP customers and the 75%-95 are preferred. 

In [42]:
#Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?
#quantile95= customers_quantile['0.95']
#quantile75= customers_quantile['0.95']

VIP_customers = customers_agg[customers_agg['total_amount_cust']>= customers_quantile.iloc[0,0]]

In [43]:
VIP_customers.head()

Unnamed: 0_level_0,total_amount_cust
customer_ID,Unnamed: 1_level_1
12346,77183.6
12357,6207.67
12359,6372.58
12409,11072.67
12415,124914.53


In [44]:
Preferred_customers = customers_agg[(customers_agg['total_amount_cust']>= customers_quantile.iloc[1,0]) & 
                                    (customers_agg['total_amount_cust']< customers_quantile.iloc[0,0]) ]

In [45]:
Preferred_customers.head()

Unnamed: 0_level_0,total_amount_cust
customer_ID,Unnamed: 1_level_1
12347,4310.0
12348,1797.24
12349,1757.55
12352,2506.04
12356,2811.43


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [46]:
VIP_list = list(VIP_customers.index)
pref_list = list(Preferred_customers.index)

In [47]:
# For this question we need to filter the customers list and have those that 
# appear on VIP list, and then we can aggregate per country. 

VIP_cust_country = orders[orders['customer_ID'].isin(VIP_list)]

In [48]:
VIP_cust_country.head()

Unnamed: 0,invoice_no,stock_code,year,month,day,hour,description,quantity,invoice_date,unit_price,customer_ID,Country,amount_spent
26,536370,22728,2010,12,3,8,alarm clock bakelike pink,24,2010-12-01 08:45:00,3.75,12583,France,90.0
27,536370,22727,2010,12,3,8,alarm clock bakelike red,24,2010-12-01 08:45:00,3.75,12583,France,90.0
28,536370,22726,2010,12,3,8,alarm clock bakelike green,12,2010-12-01 08:45:00,3.75,12583,France,45.0
29,536370,21724,2010,12,3,8,panda and bunnies sticker sheet,12,2010-12-01 08:45:00,0.85,12583,France,10.2
30,536370,21883,2010,12,3,8,stars gift tape,24,2010-12-01 08:45:00,0.65,12583,France,15.6


In [49]:
VIP_cust_country = VIP_cust_country.groupby('Country').count()

In [50]:
VIP_cust_country.head()

Unnamed: 0_level_0,invoice_no,stock_code,year,month,day,hour,description,quantity,invoice_date,unit_price,customer_ID,amount_spent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,898,898,898,898,898,898,898,898,898,898,898,898
Belgium,54,54,54,54,54,54,54,54,54,54,54,54
Channel Islands,364,364,364,364,364,364,364,364,364,364,364,364
Cyprus,248,248,248,248,248,248,248,248,248,248,248,248
Denmark,36,36,36,36,36,36,36,36,36,36,36,36


In [51]:
VIP_cust_country = VIP_cust_country[['customer_ID']]

In [52]:
VIP_cust_country.columns = ['total_counts']

In [53]:
VIP_cust_country = VIP_cust_country.sort_values(by='total_counts', ascending=False )

In [54]:
VIP_cust_country

Unnamed: 0_level_0,total_counts
Country,Unnamed: 1_level_1
United Kingdom,84185
EIRE,7077
France,3290
Germany,3127
Netherlands,2080
Australia,898
Portugal,681
Switzerland,594
Spain,511
Norway,420


In [55]:
## The country with most VIP customers is United Kingdom 

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [56]:
#First of all we add both lists to have all the ids of customers that are in VIP and Pref lists. 

VIP_and_Pref_list = VIP_list + pref_list

In [57]:
VIP_Pref_cust_country = orders[orders['customer_ID'].isin(VIP_and_Pref_list)]

In [58]:
VIP_Pref_cust_country.head()

Unnamed: 0,invoice_no,stock_code,year,month,day,hour,description,quantity,invoice_date,unit_price,customer_ID,Country,amount_spent
0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [59]:
VIP_Pref_cust_country = VIP_Pref_cust_country.groupby('Country').count()

In [61]:
VIP_Pref_cust_country.head()

Unnamed: 0_level_0,invoice_no,stock_code,year,month,day,hour,description,quantity,invoice_date,unit_price,customer_ID,amount_spent
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Australia,1028,1028,1028,1028,1028,1028,1028,1028,1028,1028,1028,1028
Austria,158,158,158,158,158,158,158,158,158,158,158,158
Belgium,1557,1557,1557,1557,1557,1557,1557,1557,1557,1557,1557,1557
Canada,135,135,135,135,135,135,135,135,135,135,135,135
Channel Islands,589,589,589,589,589,589,589,589,589,589,589,589


In [62]:
VIP_Pref_cust_country = VIP_Pref_cust_country[['customer_ID']]

In [63]:
VIP_Pref_cust_country = VIP_Pref_cust_country.sort_values(by='customer_ID', ascending=False)

In [64]:
VIP_Pref_cust_country.columns = ['Total_customers']

In [66]:

VIP_Pref_cust_country.head(10)

Unnamed: 0_level_0,Total_customers
Country,Unnamed: 1_level_1
United Kingdom,221635
Germany,7349
EIRE,7238
France,6301
Netherlands,2080
Spain,1569
Belgium,1557
Switzerland,1370
Portugal,1093
Norway,1028


In [None]:
## The country with most VIP and Preferred clients are UK and Germany. 