# Cosmetics Marketing Analysis

A comprehensive customer segmentation and lifetime value analysis project.

## Overview
This notebook performs:
- RFM Analysis (Recency, Frequency, Monetary)
- Customer Segmentation (Statistical and Rule-based)
- Predictive Modeling (Retention and Revenue)
- Customer Lifetime Value (CLV) Calculation
- Marketing Strategy Optimization


# 1. Data Preparation

## 1.1 Transaction Data (X)


In [None]:
# Load transaction data
X = read_csv("C:/data/") %>% 
  data.frame %>% setNames(c(
    "date","type","prod","cat","code","brand","price","cid","session"))
X <- X %>% filter(type == "purchase")
X$date = as.Date(X$date)
summary(X)  # Number of transactions: 263797


In [None]:
# Visualize daily transactions
par(cex=0.8)
# Note: On Mac, use quartz() instead of windows()
# windows()  # Use quartz() on Mac
hist(X$date, "days", col=rainbow(4), las=2, freq=T, xlab="", main="Daily No. Transaction")
# "years" can be changed to "quarters", "months", "weeks", "days"


In [None]:
# Count distinct customers
# n_distinct() is a function in the dplyr package
n_distinct(X$cid)  # Number of customers: 28220


## 1.2 Customer Data (A)

Calculate RFM metrics for each customer:
- **Recency (R)**: Days since last purchase
- **Frequency (F)**: Purchase frequency
- **Monetary (M)**: Average purchase amount
- **Seniority (S)**: Days since first purchase


In [None]:
# Calculate customer-level metrics
A = X %>% 
  mutate(days = as.integer(as.Date("2022-02-01") - date)) %>% 
  group_by(cid) %>% summarise(
    recent = min(days),     # Days since last purchase (select minimum)
    freq = n(),             # Purchase frequency
    money = mean(price),   # Average purchase amount
    senior = max(days),     # Days since first purchase (select maximum)
    since = min(date)       # First purchase date
  ) %>% data.frame


## 1.4 Customer Data Summary


In [None]:
summary(A)


## 1.5 Variable Distribution


In [None]:
# Plot RFM distributions
# Note: On Mac, use quartz() instead of windows()
# windows()
par(cex=0.8, mfrow=c(2,2), mar=c(3,3,4,2))
hist(A$recent,20,main="recency(R Activity)",ylab="",xlab="")
hist(pmin(A$freq, 10),0:10,main="frequency(F Loyalty)",ylab="",xlab="")
hist(A$senior,20,main="seniority(S Tenure)",ylab="",xlab="")
hist(log(A$money,10),main="log(money)(M Contribution)",ylab="",xlab="")


In [None]:
# Save data
A0 = A; X0 = X
save(X0, A0, file="c:/data/tf0_W.rdata")


In [None]:
# Reload data (if needed)
rm(list = ls())
cat("\014") 
load("c:/data/tf0_W.rdata")
A = A0; X = X0


# 2. Hierarchical Cluster Analysis

## 2.1 RFM Customer Segmentation

Using K-means clustering to segment customers based on RFM metrics.


In [None]:
# K-means clustering
set.seed(111)
A$grp = kmeans(scale(A[,2:4]),10)$cluster
table(A$grp)  # Group size


## 2.2 Customer Group Attributes

Five-dimensional visualization:
1. **X-axis**: Purchase frequency (Frequency)
2. **Y-axis**: Average transaction amount (Money)
3. **Bubble size**: Revenue contribution
4. **Bubble color**: Recency (redder = longer since last purchase)
5. **Number in bubble**: Group size


In [None]:
# Create bubble chart
# Note: On Mac, use quartz() instead of windows()
# windows()
group_by(A, grp) %>% summarise(
  recent = mean(recent), 
  freq = mean(freq), 
  money = mean(money), 
  size = n() ) %>% 
  mutate( revenue = size*money/1000 )  %>% 
  filter(size > 1) %>% 
  ggplot(aes(x=freq, y=money)) +
  geom_point(aes(size=revenue, col=recent),alpha=0.5) +
  scale_size(range=c(4,30)) +
  scale_color_gradient(low="green",high="red") +
  scale_x_log10() + scale_y_log10() + 
  geom_text(aes(label = size ),size=3) +
  theme_bw() + guides(size="none") +
  labs(title="Customer Segements",
       subtitle="(bubble_size:revenue_contribution; text:group_size)",
       color="Recency") +
  xlab("Frequency (log)") + ylab("Average Transaction Amount (log)")


In [None]:
# Standardized data for visualization
AN = scale(A[, c(2,3,4,5)]) %>% data.frame
kg = A$grp

# Bar chart
# windows()
par(cex=0.8)
split(AN,kg) %>% sapply(colMeans) %>% barplot(beside=T,col=rainbow(4))
legend('topleft',legend=colnames(AN),fill=rainbow(4))


# 3. Rule-Based Segmentation

## 3.1 Customer Segmentation Rules

Customer segments:
- **N1**: New Customers
- **N2**: New Potential Customers
- **R1**: Key Customers
- **R2**: Core Customers
- **S1**: Drowsy Customers
- **S2**: Half-Asleep Customers
- **S3**: Deep-Sleep Customers


In [None]:
# Define segmentation function
STS = c("N1","N2","R1","R2","S1","S2","S3")
Status = function(rx,fx,mx,sx,K) {factor(
  ifelse(sx < 2*K,
         ifelse(fx*mx > 50, "N2", "N1"),
         ifelse(rx < 2*K,
                ifelse(sx/fx < 0.75*K,"R2","R1"),
                ifelse(rx < 3*K,"S1",
                       ifelse(rx < 4*K,"S2","S3")))), STS)}


## 3.2 Average Purchase Cycle

Calculate the average purchase cycle (K) for customers with more than one purchase.


In [None]:
# Calculate average purchase cycle
K = as.integer(sum(A$senior[A$freq>1]) / sum(A$freq[A$freq>1]))
K

# Need at least two visits to calculate purchase cycle
# Verification calculations
length(A$freq[A$freq>1])  # Customers with >1 purchase
length(A$freq[A$freq<2])  # Customers with ≤1 purchase
table(A$freq>1) %>% prop.table()  # Proportions


## 3.3 Sliding Data Window

Create annual customer snapshots to track customer lifecycle changes.


In [None]:
# Create annual customer data frames
Y = list()
for(y in 2022) {
  D = as.Date(paste0(c(y),"-01-31"))
  Y[[paste0("Y",y)]] = X %>%        
    filter(date <= D[1]) %>%        
    mutate(days = 1 + as.integer(D[1] - date)) %>%   
    group_by(cid) %>% summarise(    
      recent = min(days),           
      freq = n(),                   
      money = mean(price),         
      senior = max(days),           
      status = Status(recent,freq,money,senior,K),
      since = min(date),                      
      y_freq = sum(date > D[2]),              
      y_revenue = sum(price[date > D[2]])    
    ) %>% data.frame 
}

head(Y$Y2022)


## 3.4 Cumulative Customer Count at End of Each Year


In [None]:
# Count customers by year
sapply(Y, nrow)

# Count by status
sapply(Y, function(df) table(df$status))


## 3.5 Group Size Change Trends


In [None]:
# Bar chart of group sizes
# windows()
par(cex=0.8, mfrow=c(1,1))
cols = c("gold","orange","blue","green","pink","magenta","darkred")
sapply(Y, function(df) table(df$status)) %>% barplot(col=cols) 
legend("topleft",rev(STS),fill=rev(cols))


## 3.6 Dynamic Analysis of Group Attributes


In [None]:
# Aggregate customer segment statistics
CustSegments = do.call(rbind, lapply(Y, function(d) {
  group_by(d, status) %>% summarise(
    average_frequency = mean(freq),
    average_amount = mean(money),
    total_revenue = sum(y_revenue),
    total_no_orders = sum(y_freq),
    average_recency = mean(recent),
    average_seniority = mean(senior),
    group_size = n()
  )})) %>% ungroup %>% 
  mutate(year=rep(2022)) %>% data.frame
head(CustSegments)


In [None]:
# Prepare data for interactive visualization
df = CustSegments %>% transmute(
  `Group` = as.character(status), year = year, 
  `Average Purchase Frequency` = average_frequency, 
  `Average Unit Price` = average_amount,
  `Total Revenue Contribution` = total_revenue
)

# Interactive plot
ggplot(df, aes(
  x=`Average Purchase Frequency`,y=`Average Unit Price`,color=`Group`,group=`Group`,ids=year)) +
  geom_point(aes(size=`Total Revenue Contribution`,frame=year),alpha=0.8) +
  scale_size(range=c(2,12)) -> g
ggplotly(g)


In [None]:
# Focus on active customer groups
filter(df,`Group`%in%c('N1','N2','R1','R2')) %>% 
  ggplot(aes(
    x=`Average Purchase Frequency`,y=`Average Unit Price`,color=`Group`,group=`Group`,ids=year)) +
  geom_path(alpha=0.5,size=2) +
  geom_point(aes(size=`Total Revenue Contribution`),alpha=0.8) +
  scale_size(range=c(2,12)) -> g
ggplotly(g)


## 3.8 Interactive Flow Analysis

Chord diagram showing customer flow between segments.


In [None]:
# Create flow matrix
df = Y$Y2022[,c(1,6)]
tx = table(df$status.x, df$status.y) %>% 
  as.data.frame.matrix() %>% as.matrix()

# Flow matrix percentages
tx %>% prop.table(1) %>% round(3)

# Interactive chord diagram
# Note: May need to install: devtools::install_github("mattflor/chorddiag", force = TRUE)
chorddiag(tx, groupColors=cols)


# 4. Build Models

## 4.1 Prepare Data


In [None]:
# Prepare data for modeling
CX = left_join(Y$Y2022)
head(CX)

names(CX)[8:11] = c("freq0","revenue0","Retain", "Revenue") 
CX$Retain = CX$Retain > 0
head(CX)

# Average retention probability
table(CX$Retain) %>% prop.table()


## 4.2 Build Classification Model

Logistic regression to predict customer retention.


In [None]:
# Build retention model
mRet = glm(Retain ~ ., CX[,c(2:3,6,8:10)], family=binomial())
summary(mRet)


## 4.3 Estimate Classification Model Accuracy


In [None]:
# Predictions and evaluation
pred = predict(mRet,type="response")

# Confusion matrix
table(pred>0.5,CX$Retain)

# Accuracy at threshold 0.5
table(pred>0.5,CX$Retain) %>%
  {sum(diag(.))/sum(.)}

# AUC
colAUC(pred,CX$Retain)


In [None]:
# ROC curve
# windows()
prediction(pred, CX$Retain) %>% 
  performance("tpr", "fpr") %>% 
  plot(print.cutoffs.at=seq(0,1,0.1))


## 4.4 Build Quantity Model

Linear regression to predict purchase amount.


In [None]:
# Build revenue model (only for customers who purchased)
dx = subset(CX, Revenue > 0)
mRev = lm(log(Revenue) ~ recent + freq + log(1+money) + senior +
            status + freq0 + log(1+revenue0), dx)  
summary(mRev)  # R² = 0.713


In [None]:
# Model diagnostics
# windows()
plot(log(dx$Revenue), predict(mRev), col='pink', cex=0.65)
abline(0,1,col='red')


# 5. Estimate Customer Lifetime Value

## 5.1 Predictions

Generate predictions for retention rate and purchase amount.


In [None]:
# Prepare data for predictions
CX = Y$Y2022
names(CX)[8:9] = c("freq0","revenue0")

# Predict retention rate
CX$ProbRetain = predict(mRet,CX,type='response')

# Predict purchase amount
CX$PredRevenue = exp(predict(mRev,CX))


## 5.2 Estimate Customer Lifetime Value (CLV)

Formula: $CLV = g \times Predicted Revenue \times \sum_{t=0}^{N} \left(\frac{Retention Rate}{1+d}\right)^t$

Where:
- $g$ = Profit margin (0.5)
- $N$ = Number of periods (5)
- $d$ = Discount rate (0.1)


In [None]:
# Calculate CLV
g = 0.5   # (Pre-tax) profit margin
N = 5     # Number of periods
d = 0.1   # Interest rate = 10%

CX$CLV = g * CX$PredRevenue * rowSums(sapply(
  0:N, function(i) (CX$ProbRetain/(1+d))^i ) )

summary(CX$CLV)


In [None]:
# CLV distribution
par(mar=c(2,2,3,1), cex=0.8)
# windows()
hist(log(CX$CLV,10), xlab="", ylab="", main="Customer Lifetime Value Distribution")


In [None]:
# Average metrics by group
CX %>% group_by(status) %>% summarise_at(vars(ProbRetain:CLV), mean)


In [None]:
# Boxplot of CLV by group
# windows()
par(mar=c(3,3,4,2), cex=0.8)
boxplot(log(CLV,10)~status, CX, main="CLV by Groups")


# 7. Select Marketing Targets

## 7.1 Retention for R2 Group


In [None]:
# R2 group analysis
# windows()
par(mfrow=c(1,2), mar=c(4,3,3,2), cex=0.8)
hist(CX$ProbRetain[CX$status=="R2"],main="ProbRetain",xlab="")
hist(log(CX$PredRevenue[CX$status=="R2"],10),main="PredRevenue",xlab="")


## 7.2 Estimate Expected Return

Calculate expected return for marketing campaigns.


In [None]:
# Marketing tool parameters
cost = 10        # Cost per customer
effect = 0.75    # Benefit: next period's purchase probability

# Calculate expected return for S3 group
Target = subset(CX, status=="S3")
Target$ExpReturn = (effect - Target$ProbRetain) * Target$PredRevenue - cost
summary(Target$ExpReturn)


## 7.3 Select Marketing Targets


In [None]:
# Top targets
Target %>% arrange(desc(ExpReturn)) %>% select(cid, ExpReturn) %>% head(258)

# Count feasible targets
sum(Target$ExpReturn > 0)

# Total expected return
sum(Target$ExpReturn[Target$ExpReturn > 0])


In [None]:
# Expected return by all groups
Target = CX
Target$ExpReturn = (effect - Target$ProbRetain) * Target$PredRevenue - cost
filter(Target, Target$ExpReturn > 0) %>%
  group_by(status) %>% summarise(
    No.Target = n(),
    AvgROI = mean(ExpReturn),
    TotalROI = sum(ExpReturn)) %>% data.frame


# 8. Conclusion

This analysis enables:
- **Customer Retention**: Identify at-risk customers (S1, S2, S3)
- **Revenue Optimization**: Focus on high-value segments (R1, R2)
- **Marketing ROI**: Calculate expected returns for campaigns
- **Resource Allocation**: Prioritize customer segments based on CLV
- **Strategic Planning**: Understand customer lifecycle and migration patterns


In [None]:
# Final data check and save
is.na(X) %>% colSums
is.na(A) %>% colSums

A0 = A; X0 = X
save(Z0, X0, A0, file="c:/data/tf0.rdata")
