  **Customer Segmentation using RFM analysis**
* Successful companies are the ones that know their customers so well that they can anticipate their needs.
* Segmenting the customers to better serve them.
* By using online retail ecommerce dataset containing anonymized customer transactions.
* Will prepare the segments created,by using RFM Analysis.
* Understanding of practical customer behavioral analytics and segmentation techniques.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# import required libraries
import pandas as pd
import numpy as np
from datetime import datetime as dt
import seaborn as sns # used for plot interactive graph. 
import matplotlib.pyplot as plt

In [None]:
# Importig dataset 

data = pd.read_csv('../input/online-retail-ii-uci/online_retail_II.csv')

#check first 5 rows 

data.head()

**Lets  do some basic data preparation **

In [None]:
# No of records and columns 
data.shape

In [None]:
#check for any duplicate record first

data.duplicated().sum()

In [None]:
# Remove duplicate items 
data = data[~data.duplicated()]

# check no of records now 

data.shape

In [None]:
# check for the missing values 

total = data.isnull().sum().sort_values(ascending=False)

percent = (data.isnull().sum()/data.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

missing_data.head(10)


There are 22 % for cust ID missing and quite few no of Discriptions , so could remove the missing cust ID as there is no way to impute in dataset.

In [None]:
#Don't need records with Null customer id,so deleting them from the dataframe  

data.dropna(axis = 0, subset = ['Customer ID'], inplace = True)

data.shape  # so no of records are reduced now

In [None]:
# check the data types of variables 
data.info()

In [None]:
# Change the datatypes of InvoiceDate to datetime object and Customer ID to string

# Convert remaining Customer Ids to string type

data['Customer ID']= data['Customer ID'].astype(str)

data['InvoiceDate'] = pd.to_datetime(data ['InvoiceDate'])

In [None]:
# check the descriptive statstics of numerical variables 

data.describe()

Data is not normalize and huge difference in min and maximum values and Std so before applying any machine learning technique has to unskew the variables.

Some value of Quantity is in negative,so need more investigation.

In [None]:
# There are negative quantities and corresponding Invoice no is started with string C (Cancelled items)

display(data.sort_values('Quantity')[:5])

In [None]:
# Check and remove transactions with cancelled items.

data_new = data[~data.Invoice.str.contains('C', na=False)]

# check no of records, further reduced 

data_new.shape 

In [None]:
# Now there is no negative value of Quantity variable 

display(data_new.sort_values('Quantity')[:3])

In [None]:
# lets check which are countries with maximum no of transactions 

customer_country=data_new[['Country','Customer ID']].drop_duplicates()

customer_country.groupby(['Country'])['Customer ID'].aggregate('count').reset_index().sort_values('Customer ID', ascending=False)

So more than 90% of the customers in the data are from the United Kingdom

**RFM analysis: How recently, How often , How much customer buy oftenly**.

In [None]:
# lets change the format of InvoiceDate 

import time
from datetime import datetime, date, time, timedelta

data_new['InvoiceDate'] = pd.to_datetime(data_new['InvoiceDate']).dt.date

In [None]:
# oldest and latest date 
print('Min:{}; Max:{}'.format(min(data_new.InvoiceDate),
max(data_new.InvoiceDate)))   

In [None]:
# Let's create a hypothetical snapshot_day data as if we're doing analysis recently.

snapshot_date = max(data_new.InvoiceDate) + timedelta(days=1)
print(snapshot_date)

In [None]:
#Create new columns called TotalSum column = Quantity x UnitPrice.

data_new['TotalSum'] = data_new['Quantity'] * data_new['Price']

In [None]:
data_new.head() # to check new column 

**RFM Customer Segmentation**

Lets calculate RFM metrics to built a dataset with Recency, Frequency, and Monetary values!

Create five customer-level features that will then use in predicting next month's customer transactions. 

These features capture highly predictive customer behavior patterns.

In [None]:
# Aggregate data on a customer level
datamart = data_new.groupby(['Customer ID']).agg({'InvoiceDate': lambda x: (snapshot_date - x.max()).days,'Invoice': 'count','TotalSum': 'sum'})

In [None]:
# Rename columns for easier interpretation # term used as recency, frequency and Monetary
datamart.rename(columns = {'InvoiceDate': 'Recency','Invoice': 'Frequency','TotalSum': 'MonetaryValue'}, inplace=True)

In [None]:
# check the first few rows
datamart.head()

* Will calculate quartile value for each column (Recency, Frequency & Monetary) and name then R, F, M
* Recency - days since last customer transaction
* Frequency - number of transactions in the last 12 months
* Monetary Value - total spend in the last 12 months

In [None]:
 
#Recency with a decreasing range of 4 through 1

r_labels = range(4, 0, -1)  # create generators 

# Create a spend quartile with 4 groups and pass the previously created labels 
recency_quartiles = pd.qcut(datamart['Recency'], q=4, labels=r_labels)

# Assign the quartile values to the Recency_Quartile column in `data`
datamart['Recency_Quartile'] = recency_quartiles 

# Print `data` with sorted Recency_Days values
print(datamart.sort_values('Recency'))              

So can check most recent has assigned 1 status in Recency_Quartile and least recent has status of 0(more depends on the business goal)

In [None]:
# Creating Frequency and Monetary quartiles  and Create labels for Frequency and monetray
f_labels = range(1,5)
m_labels = range(1,5)

#Assign these labels to 4 equal percentile groups based on Frequency.
f_quartiles = pd.qcut(datamart['Frequency'], 4, labels = f_labels)

## Assign these labels to 4 equal percentile groups
m_quartiles = pd.qcut(datamart['MonetaryValue'], 4, labels = m_labels)

#Create new columns F and M

datamart = datamart.assign(F = f_quartiles.values)
datamart = datamart.assign(M = m_quartiles.values)

In [None]:
# check first few rows with new columns   # 
datamart.head()

* Now Build RFM Segment and RFM Score
* Concatenate RFM quartile values to RFM_Segment
* Sum RFM quartiles values to RFM_Score : RFM Score which is a sum of the R (here Recency_Quartile), F, and M values.

In [None]:
#Concatenate RFM quartile values to RFM_Segment and converted to string

def join_rfm(x): return str(x['Recency_Quartile']) + str(x['F']) + str(x['M'])

In [None]:
#group the customers into three separate groups based on Recency,Frequency and monetary values

datamart['RFM_Segment'] = datamart.apply(join_rfm, axis=1)

#Sum RFM quartiles values to RFM_Score

datamart['RFM_Score'] = datamart[['Recency_Quartile','F','M']].sum(axis=1)

In [None]:

datamart.head()

So with final Results having 2 new columns : RFM_Segment (combine RFM value is RFM_segment) & RFM_Score is addition of RFM_segment for eg.(2+2+4):8 is RFM_score

In [None]:
# Analyzing RFM segments : Largest 10 RFM segments

datamart.groupby('RFM_Segment').size().sort_values(ascending=False)[:10]

In [None]:
# Filtering on RFM segments 

# Select bottom RFM segment "111" and view top 5 rows

datamart[datamart['RFM_Segment']=='111'][:5]

Simple selection of customer based on their score can be done to find out worth of different customers and Who are the top 10 of our best customers!

In [None]:
#Summary metrics per RFM Score

datamart.groupby('RFM_Score').agg({'Recency': 'mean','Frequency': 'mean','MonetaryValue': ['mean', 'count'] }).round(1)


* Each mean RFM value is better with higher RFM score

* For example with RFM_Score of 12.0 ,recency mean is 10.7 ,Frequecy mean 560.7 and Monetrary value 14708.7

* So to conclude those who have high frequency mean have high Monetrary value and higher RFM_Score and and  less mean of Recency.

* Therefore high RFM_score,very recent and high monetary value

* Now grouping into named segments finally.

* Creating custom segments based on RFM score.

* Creating a function to build segmentation and then assign it to each customer.

In [None]:
#Use RFM score to group customers into Gold, Silver and Bronze segments.

def segment_me(df):
    if df['RFM_Score'] >= 9:
       return 'Gold'
    elif (df['RFM_Score'] >= 5) and (df['RFM_Score'] < 9):
       return 'Silver'
    else:
       return 'Bronze'

In [None]:
# Create a new variable called Segment_Level

datamart['Segment_level'] = datamart.apply(segment_me, axis=1)

In [None]:
#Analyze average values of Recency, Frequency and MonetaryValue for the custom segments created.

datamart.groupby('Segment_level').agg({'Recency': 'mean','Frequency': 'mean','MonetaryValue': ['mean', 'count']}).round(1)

So by analyzing custom segments with different segement level with mean of RFM values can check in Gold segment customer has lowest mean of Recency (latest recency) and highest mean value frequency and Monetary.

For Further analysis RFM matrix can be used for Kmeans clustering.