<a href="https://colab.research.google.com/github/sachinkun21/Customer_Segmentation-Cohort-Analysis/blob/master/RFM_Analysis(Recency_Frequency_Monetory).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Summary:
1. Introduction
2. Problem Statement

### Introduction
In this Data Tale, You will learn about customer segmentation technique known as RFM segmentation.

Specifically, you will get exposure to 
- Recency, frequency and monetary value
- Create customer segments based on these concepts
- Analyze your results.

#### What is RFM Segmentation?
RFM Segmentation technique is a very popular method of performing Customer Segmentation in Retail Domain.
It is based on Behavioral Customer segmentation approach and focuses following 3 customer behaviour Metrics:
1. **Recency(R)**: This the measurement of how recent the customers last purchase was.
2. **Frequency(F)**: This measurement is for how many times the customer has purchased in the specific time period for which data was collected
3. **Monetary(M)**: Measures the total amount the customer has spent during the data collection period.



# 2. Problem Statement
In this Data Tale, we will perform Time Cohort Analysis.
Time based Cohort analysis groups the customer by the time they completed their first activity.

The flow of this Analysis will be:

We will calculate these 3 metrics for each customer and then use them to group customers into predefined specfic categories based on Business requirement or Demand, such as: High, Low , Medium

## 3. Data Desription:
#### About the dataset.
We will use a randomly subsampled subset of the very popular transactional dataset provided by UCI machine Learning Laboratory.


**Data Set Information:**

This is a transactional data set which contains the transactions occurring between 01/12/2010 and 09/12/2011 for the UK-based a ndregistered non-store online retail firm and contains realistic customer Transaction information in a commonly used format in Industry.


#### Information related to attributes of Dataset is:

- **InvoiceNo**: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. 
- **StockCode**: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product. 
- **Description**: Product (item) name. Nominal. 
- **Quantity**: The quantities of each product (item) per transaction. Numeric.
- **InvoiceDate**: Invice Date and time. Numeric, the day and time when each transaction was generated. 
- **UnitPrice**: Unit price. Numeric, Product price per unit in sterling. 
- **CustomerID**: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer. 
- **Country**: Country name. Nominal, the name of the country where each customer resides.

#### Grouping criteria using RFM values

In Order to group our customers based on any business metric, we need to decide on number of groups as well as threshold or limits of business metric for each group.

There are many ways to decide this. Given below are some of the popular ways to achieve this:

- Equal Percentiles: We can divide our customers into groups of Equal size i.e percentiles and assign them corresponding labels. For example, if we select 4 groups than each group will have 25% of total customers.

- Pareto 80:20 cut: Pareto Principle states that 80 percent of  outcomes come from 20 percent of inputs. This can extended to E-Commerce domain as well with assumption that 80% of revenue is generated by top 20 percentile of Customers. This 80:20 rule can also be used as a criteria to segment customers into two groups the top 20 or high value customers and remaining 80 i.e Low Value Customers

- Custom-based on business knowledge: We can also use existing business knowledge or consult with Subject matter experts(SME's) to decide thresholds and number of segments to be created.

### Importing Libraries and transaction Dataset from Drive into Dataframe

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [10]:
ls /content/drive/'My Drive'/DataScience

Employee.csv  Employee_HR.csv  transaction.csv


`transaction.csv` dataset is present, we will use it for this data Tale.

Let's import the necessary libraries and Dataset:

In [18]:
# linear algebra
import numpy as np
# data processing
import pandas as pd

# data visualization
import seaborn as sns
%matplotlib inline
from matplotlib import pyplot as plt
from matplotlib import style

# Reading transaction.csv into a dataframe with invoice column parsed as DateTime type instead of Object(string) and '0' column as Index
df = pd.read_csv('drive/My Drive/DataScience/transaction.csv', parse_dates=['InvoiceDate'], index_col='Unnamed: 0')

# Printing top 5 rows of Dataframe
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
416792,572558,22745,POPPY'S PLAYHOUSE BEDROOM,6,2011-10-25 08:26:00,2.1,14286,United Kingdom
482904,577485,23196,VINTAGE LEAF MAGNETIC NOTEPAD,1,2011-11-20 11:56:00,1.45,16360,United Kingdom
263743,560034,23299,FOOD COVER WITH BEADS SET 2,6,2011-07-14 13:35:00,3.75,13933,United Kingdom
495549,578307,72349B,SET/6 PURPLE BUTTERFLY T-LIGHTS,1,2011-11-23 15:53:00,2.1,17290,United Kingdom
204384,554656,21756,BATH BUILDING BLOCK WORD,3,2011-05-25 13:36:00,5.95,17663,United Kingdom


Let's check the information related to datatypes and Columns in Data Set

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 70864 entries, 416792 to 312243
Data columns (total 8 columns):
InvoiceNo      70864 non-null int64
StockCode      70864 non-null object
Description    70864 non-null object
Quantity       70864 non-null int64
InvoiceDate    70864 non-null object
UnitPrice      70864 non-null float64
CustomerID     70864 non-null int64
Country        70864 non-null object
dtypes: float64(1), int64(3), object(4)
memory usage: 4.9+ MB


Now we will calculate fRFM value for each customer


Dataset and preparations
Same online dataset like in the previous lessons
Need to do some data preparation
New TotalSum column = Quantity x UnitPrice.

In [19]:
print('Min:{}; Max:{}'.format(min(df.InvoiceDate),max(df.InvoiceDate)))

Min:2010-12-01 08:26:00; Max:2011-12-09 12:49:00


### Let's create a hypothetical snapshot_day data as if we're doing analysis recently

In [20]:
import datetime
snapshot_date = max(df.InvoiceDate) + datetime.timedelta(days=1)
snapshot_date

Timestamp('2011-12-10 12:49:00')

### Calculating RFM metrics:
- First we will aggregate our dataset on CustomerID
- Then we will perform aggregatation for each unqiue customer ID to calculate RFM values as given below
  1. **Recency:** On InvoiceDate column, we will select the max innvoice date of each customer i.e the Latest Transaction Date and Subtract it from our Date of Analysis i.e. current_date value.

      This will give us the Recency in number of Days i.e Number of Days since last Transaction
  2. **Frequency:** For frequency, we will count the number of times invoice has been generated for each customer.

      This will give us total number of times the customer has shopped with us in last 12 Months.

  3. **MonetaryValue:** For Monetary Value metric, we will use the BillAmount column generated in previous section and take sum of billAmounts for each customer.

      This will give us the Total Revenue Generated per customer.

In [0]:
# Aggregate data on a customer level
datamart = online.groupby(['CustomerID']).agg({
    'InvoiceDate': lambda x: (snapshot_date - x.max()).days,
    'InvoiceNo': 'count',
    'TotalSum': 'sum'})

Process of calculating percentiles:

Sort customers based on that metric
Break customers into a pre-defined number of groups of equal size
Assign a label to each group


Highest score to the best metric - best is not always highest e.g. recency
In this case, the label is inverse - the more recent the customer, the better