# Cohort Analysis in R

Cohort Analysis is analysis on a group of people with shared characteristics. In most cases, cohort analysis is done on customers who joined a company's customer base at a given time and are grouped according to the month or a chort period in which they joined in. Through this, we mostly look for buying patterns of the cohort and how well is the customer retention. 

This is a simple step by step guide to do cohort analysis in R

We'll start by including basic libraries to read and alter data

In [1]:
library(data.table)
library(dplyr)

"package 'dplyr' was built under R version 3.5.1"
Attaching package: 'dplyr'

The following objects are masked from 'package:data.table':

    between, first, last

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



In [2]:
data<-fread("D:\\relay-foods.csv")
head(data)

OrderId,OrderDate,UserId,TotalCharges,CommonId,PupId,PickupDate
262,11/01/2009,47,$50.67,TRQKD,2,12/01/2009
278,20/01/2009,47,$26.60,4HH2S,3,20/01/2009
294,03/02/2009,47,$38.71,3TRDC,2,04/02/2009
301,06/02/2009,47,$53.38,NGAZJ,2,09/02/2009
302,06/02/2009,47,$14.28,FFYHD,2,09/02/2009
321,17/02/2009,47,$29.50,HA5R3,3,17/02/2009


In our data, the 'OrderDate' attribute is what we will be using for our Cohort Analysis. However, it is not in the date format and hence we will have to convert it and extract the month (Order period) from it.

In [3]:
data$Order_Period <- format(as.Date(data$OrderDate, "%d/%m/%Y"), "%Y-%m")
data$Order_Period<-as.Date(paste(data$Order_Period,"-01",sep=""))

In [4]:
head(data)

OrderId,OrderDate,UserId,TotalCharges,CommonId,PupId,PickupDate,Order_Period
262,11/01/2009,47,$50.67,TRQKD,2,12/01/2009,2009-01-01
278,20/01/2009,47,$26.60,4HH2S,3,20/01/2009,2009-01-01
294,03/02/2009,47,$38.71,3TRDC,2,04/02/2009,2009-02-01
301,06/02/2009,47,$53.38,NGAZJ,2,09/02/2009,2009-02-01
302,06/02/2009,47,$14.28,FFYHD,2,09/02/2009,2009-02-01
321,17/02/2009,47,$29.50,HA5R3,3,17/02/2009,2009-02-01


Now that we have our Order Period in our data, we can find the Cohort groups of users by taking the minimum of 'Order_Period' corresponding to every user

In [5]:
cohort_col<-as.data.frame(aggregate(data$Order_Period, by=list(data$UserId), min))
colnames(cohort_col)<-c('UserId','CohortGp')

# A new data frame is created as the aggregate function would have returned less rows than the original data frame
# Hence it was easier to merge

df<-merge(data, cohort_col, by = "UserId")

In [6]:
head(df)

UserId,OrderId,OrderDate,TotalCharges,CommonId,PupId,PickupDate,Order_Period,CohortGp
47,262,11/01/2009,$50.67,TRQKD,2,12/01/2009,2009-01-01,2009-01-01
47,278,20/01/2009,$26.60,4HH2S,3,20/01/2009,2009-01-01,2009-01-01
47,294,03/02/2009,$38.71,3TRDC,2,04/02/2009,2009-02-01,2009-01-01
47,301,06/02/2009,$53.38,NGAZJ,2,09/02/2009,2009-02-01,2009-01-01
47,302,06/02/2009,$14.28,FFYHD,2,09/02/2009,2009-02-01,2009-01-01
47,321,17/02/2009,$29.50,HA5R3,3,17/02/2009,2009-02-01,2009-01-01


Now that we have our Cohort Groups corresponding to every customer, we need to summarise the purchases made of every cohort group Order Period-wise.

In [7]:
grouped<-df %>% group_by("CohortGp", "Order_Period")

cohorts<-df %>% group_by(CohortGp, Order_Period) %>% summarise(Ret_cust=length(unique(UserId)),num_order=(length(unique(OrderId))))
cohorts<-cohorts %>% group_by(CohortGp) %>% mutate(counter = row_number(CohortGp))
user_retention<-dcast(cohorts, counter ~ CohortGp, value.var= "Ret_cust" )
user_retention$counter<-NULL

"package 'bindrcpp' was built under R version 3.5.1"

In [8]:
t(user_retention)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
2009-01-01,22,8.0,10.0,9.0,10.0,8.0,8.0,7.0,7.0,7.0,7.0,8.0,11.0,7.0,6.0
2009-02-01,15,3.0,5.0,1.0,4.0,4.0,4.0,5.0,5.0,4.0,3.0,3.0,5.0,,
2009-03-01,13,4.0,5.0,4.0,1.0,2.0,2.0,3.0,2.0,1.0,3.0,2.0,1.0,,
2009-04-01,39,13.0,10.0,13.0,6.0,7.0,4.0,6.0,2.0,4.0,3.0,2.0,,,
2009-05-01,50,13.0,12.0,5.0,4.0,6.0,3.0,5.0,5.0,4.0,3.0,,,,
2009-06-01,32,15.0,9.0,6.0,7.0,5.0,3.0,3.0,10.0,3.0,,,,,
2009-07-01,50,23.0,13.0,10.0,11.0,10.0,11.0,7.0,7.0,,,,,,
2009-08-01,31,11.0,9.0,7.0,6.0,8.0,4.0,4.0,,,,,,,
2009-09-01,37,15.0,14.0,8.0,13.0,9.0,8.0,,,,,,,,
2009-10-01,54,17.0,12.0,13.0,13.0,7.0,,,,,,,,,


In the above matrix, the first column denotes the number of new customers that came and the consecutive columns denote how many customers from a cohort group ordered in their next order period. 

The above can be better visualized using a retention graph as we do below

In [9]:
mat <- sapply(user_retention, function(x) as.numeric(gsub("\\.", "", x)))
user_perc<-t(t(mat) / mat[1, ])

Heatmaps in R are not suited for our purpose and are relatively more difficult articulate for our particular task (they are also less aesthetic). Hence I have used python Seaborn instead.

We use 'reticulate' library to bridge R and Python. 

In [10]:
library(reticulate)
use_python('C:\\Users\\Saket Singh\\AppData\\Local\\Programs\\Python\\Python36\\python.exe')
sns <- import('seaborn')
plt <- import('matplotlib.pyplot')

# Create axes and plot. Save fig 
a4_dims <-c(18, 12)
plt$subplots(figsize=a4_dims)
plt$xlabel('Cohort Period', fontsize=18)
plt$ylabel('Cohort Group', fontsize=16)
sns$set(font_scale=0.7)
sns$heatmap(t(user_perc), annot=TRUE, fmt='.0%')
#plt$show()
plt$savefig("Cohort.png")


"package 'reticulate' was built under R version 3.5.1"

[[1]]
Figure(1800x1200)

[[2]]
AxesSubplot(0.125,0.11;0.775x0.77)


Text(0.5, 0, 'Cohort Period')

Text(0, 0.5, 'Cohort Group')

AxesSubplot(0.125,0.11;0.62x0.77)

![alt text](Cohort.png)