# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, extract and import `Orders` dataset into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# your code here
orders = pd.read_csv("Orders.csv")
orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [3]:
orders.shape

(397924, 14)

In [4]:
orders.columns

Index(['Unnamed: 0', 'InvoiceNo', 'StockCode', 'year', 'month', 'day', 'hour',
       'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID',
       'Country', 'amount_spent'],
      dtype='object')

In [5]:
orders['amount_spent'].sum()

8911407.904

We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

In [6]:
# your code here
#Sub Problem 1: 

In [125]:
#Identify unique customers and sum the total amunt spent for each one
customer_spent = orders.groupby(['CustomerID']).agg({'amount_spent':'sum'})
customer_spent


Unnamed: 0_level_0,amount_spent
CustomerID,Unnamed: 1_level_1
12346,77183.60
12347,4310.00
12348,1797.24
12349,1757.55
12350,334.40
...,...
18280,180.60
18281,80.82
18282,178.05
18283,2094.88


In [127]:
#xam_scores.merge(sections, on='CustomerID')
a = customer_spent.merge(orders, on='CustomerID' )
a

Unnamed: 0.1,CustomerID,amount_spent_x,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,Country,amount_spent_y
0,12346,77183.60,61619,541431,23166,2011,1,2,10,medium ceramic top storage jar,74215,2011-01-18 10:01:00,1.04,United Kingdom,77183.60
1,12347,4310.00,14938,537626,85116,2010,12,2,14,black candelabra t-light holder,12,2010-12-07 14:57:00,2.10,Iceland,25.20
2,12347,4310.00,14939,537626,22375,2010,12,2,14,airline bag vintage jet set brown,4,2010-12-07 14:57:00,4.25,Iceland,17.00
3,12347,4310.00,14940,537626,71477,2010,12,2,14,colour glass. star t-light holder,12,2010-12-07 14:57:00,3.25,Iceland,39.00
4,12347,4310.00,14941,537626,22492,2010,12,2,14,mini paint set vintage,36,2010-12-07 14:57:00,0.65,Iceland,23.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
397919,18287,1837.28,392752,570715,22419,2011,10,3,10,lipstick pen red,12,2011-10-12 10:23:00,0.42,United Kingdom,5.04
397920,18287,1837.28,392753,570715,22866,2011,10,3,10,hand warmer scotty dog design,12,2011-10-12 10:23:00,2.10,United Kingdom,25.20
397921,18287,1837.28,423939,573167,23264,2011,10,5,9,set of 3 wooden sleigh decorations,36,2011-10-28 09:29:00,1.25,United Kingdom,45.00
397922,18287,1837.28,423940,573167,21824,2011,10,5,9,painted metal star with holly bells,48,2011-10-28 09:29:00,0.39,United Kingdom,18.72


In [8]:
#Sub Problem 2 : How to select customers whose aggregated amount_spent is in a given quantile range?¶
#Calculate percentiles 75 and 95
customer_spent.quantile([0.75,0.95])

Unnamed: 0,amount_spent
0.75,1661.64
0.95,5840.182


In [52]:
# Create a function to assign the Customer type to each type of client
def cust_type(amount_spent):
    if amount_spent >= 5840.182:
        return 'VIP Customers'
    elif amount_spent >= 1661.640 and amount_spent <5840.182:
        return 'Preferred Customers'
    return

In [53]:
#Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?
#Customer type assignment for each client
customer_spent['Customer Type'] = customer_spent['amount_spent'].apply(cust_type)


In [100]:
customer_spent.sample(10)


Unnamed: 0_level_0,amount_spent,Customer Type
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
16326,3110.96,Preferred Customers
12875,343.23,
15333,1028.56,
17736,377.44,
17238,3744.65,Preferred Customers
17293,1875.11,Preferred Customers
12623,305.1,
13591,1117.13,
16211,547.02,
16010,407.5,


In [12]:
customer_spent.shape

(4339, 2)

In [13]:
#Identify how many VIP Customers vs Preferred a
customer_spent.groupby(['Customer Type']).agg({'Customer Type':'count'})

Unnamed: 0_level_0,Customer Type
Customer Type,Unnamed: 1_level_1
Preferred Customers,868
VIP Customers,217


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [86]:
#One way to do i, need to work on it, the join is not working correctly
#customer_countries = customer_spent.reset_index().join(orders, on="CustomerID",lsuffix='_left')
#customer_countries.head(3)
#customer_countries.shape
#customer_countries1 = customer_countries[['CustomerID','Customer Type','Country','amount_spent_left']]
#customer_countries1
#customer_countries1['amount_spent_left']

In [99]:
# your code here
# Q2:Option1
#Create a dataframe for Countries
country_analysis = orders[['CustomerID','Country','amount_spent']]
#DUDA al hacer el grroupby que contiene el Country de arriba el shape muestra que se crearon un par de rows adicionales
#lo que modificar[ia un posible count de Country, lo ideal seria que a customer_spent se haga un vlookupy solo se le 
#agregue el pais pero no lo logor con join o merge sin crear mas rows al hacerlo vs orders
country_analysis = country_analysis.groupby(['CustomerID','Country']).agg({'amount_spent':'sum'}).reset_index()
country_analysis['Customer Type'] =country_analysis['amount_spent'].apply(cust_type)
print(country_analysis.shape)

country_analysis[country_analysis["Customer Type"] == "VIP Customers"].groupby(['Country']).agg({'CustomerID':'count'}).reset_index()


#Answer  United Kingdom with 177




(4347, 4)


Unnamed: 0,Country,CustomerID
0,Australia,1
1,Channel Islands,1
2,Cyprus,1
3,Denmark,1
4,EIRE,2
5,Finland,1
6,France,9
7,Germany,10
8,Japan,2
9,Netherlands,1


In [109]:
country_analysis1 

Unnamed: 0,CustomerID,Country,amount_spent
0,17850,United Kingdom,15.30
1,17850,United Kingdom,20.34
2,17850,United Kingdom,22.00
3,17850,United Kingdom,20.34
4,17850,United Kingdom,20.34
...,...,...,...
397919,12680,France,10.20
397920,12680,France,12.60
397921,12680,France,16.60
397922,12680,France,16.60


In [18]:
# Q2:Option2
vipdf = country_analysis[country_analysis["Customer Type"] == "VIP Customers"]
vipdf['Country'].value_counts()

United Kingdom     177
Germany             10
France               9
Switzerland          3
Spain                2
Japan                2
Portugal             2
EIRE                 2
Australia            1
Finland              1
Norway               1
Denmark              1
Cyprus               1
Singapore            1
Netherlands          1
Channel Islands      1
Sweden               1
Name: Country, dtype: int64

In [101]:
vipdf = country_analysis[country_analysis["Customer Type"] == "Preferred Customers"]
vipdf['Country'].value_counts()

United Kingdom     755
Germany             29
France              20
Belgium             11
Norway               6
Switzerland          6
Spain                5
Italy                5
Portugal             5
Finland              4
Channel Islands      3
Australia            3
Denmark              2
Cyprus               2
Israel               2
Japan                2
EIRE                 1
Canada               1
Greece               1
Iceland              1
Poland               1
Lebanon              1
Sweden               1
Austria              1
Malta                1
Name: Country, dtype: int64

## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [102]:
country_analysis

Unnamed: 0,CustomerID,Country,amount_spent,Customer Type
0,12346,United Kingdom,77183.60,VIP Customers
1,12347,Iceland,4310.00,Preferred Customers
2,12348,Finland,1797.24,Preferred Customers
3,12349,Italy,1757.55,Preferred Customers
4,12350,Norway,334.40,
...,...,...,...,...
4342,18280,United Kingdom,180.60,
4343,18281,United Kingdom,80.82,
4344,18282,United Kingdom,178.05,
4345,18283,United Kingdom,2094.88,Preferred Customers


In [103]:
#Validating Customer Type unique values, we have 3, None at the endo doesnt count as one because is empty but in 
#other cases this could be a problem to calculate.
country_analysis['Customer Type'].unique()

array(['VIP Customers', 'Preferred Customers', None], dtype=object)

In [104]:
#First way to do it:
# 1 Filter VIP Customers values and Customer Type in a new dataframe VIP_Preferred from the country_analysis and then
# count values for each country
VIP_Preferred = country_analysis[(country_analysis['Customer Type'] == 'VIP Customers') | (country_analysis['Customer Type'] == 'Preferred Customers')]
VIP_Preferred['Country'].value_counts()

United Kingdom     932
Germany             39
France              29
Belgium             11
Switzerland          9
Spain                7
Portugal             7
Norway               7
Italy                5
Finland              5
Channel Islands      4
Australia            4
Japan                4
Cyprus               3
Denmark              3
EIRE                 3
Israel               2
Sweden               2
Singapore            1
Lebanon              1
Poland               1
Iceland              1
Greece               1
Netherlands          1
Austria              1
Canada               1
Malta                1
Name: Country, dtype: int64

In [105]:
#Second way to do it:
# your code here
country_analysis.groupby(['Country']).agg({'Customer Type':'count'}).sort_values('Customer Type', ascending = False)
# Answer:  United King  with 932


Unnamed: 0_level_0,Customer Type
Country,Unnamed: 1_level_1
United Kingdom,932
Germany,39
France,29
Belgium,11
Switzerland,9
Spain,7
Portugal,7
Norway,7
Italy,5
Finland,5
