### 1-Preparing the Data
##### Online Retail II Dataset
###### InvoiceNo: Unique invoice code. Invoice numbers containing the letter C or A mean returns.
###### StockCode: Unique product code.
###### Description: Product name.
###### Quantity: Number of products.
###### InvoiceDate: Invoice date and time.
###### UnitPrice: Product price.
###### CustomerID: Unique customer code.
###### Country: Country name.

### Importing Libraries

In [6]:
import pandas as pd
import numpy as np
#For seeing all columns.
pd.set_option("display.max_columns",None)
#For seeing all rows.
pd.set_option("display.max_rows",None)

In [5]:
data=pd.read_excel(r"online_retail_II.xlsx",sheet_name="Year 2010-2011")
df=data.copy()
print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541910 entries, 0 to 541909
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      541910 non-null  object        
 1   StockCode    541910 non-null  object        
 2   Description  540456 non-null  object        
 3   Quantity     541910 non-null  int64         
 4   InvoiceDate  541910 non-null  datetime64[ns]
 5   Price        541910 non-null  float64       
 6   Customer ID  406830 non-null  float64       
 7   Country      541910 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
None


Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### We had to drop null Customer ID values. Because if we don't know the customer id then we can't analyse them. Also null Description values will be dropped.

In [7]:
df.isnull().sum()

Invoice             0
StockCode           0
Description      1454
Quantity            0
InvoiceDate         0
Price               0
Customer ID    135080
Country             0
dtype: int64

### There is negative values in Quantity and Price variable. We need to drop them.

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Quantity,541910.0,9.55223,218.08096,-80995.0,1.0,3.0,10.0,80995.0
Price,541910.0,4.61114,96.75977,-11062.06,1.25,2.08,4.13,38970.0
Customer ID,406830.0,15287.68416,1713.60307,12346.0,13953.0,15152.0,16791.0,18287.0


In [11]:
df.describe(exclude=["int64","float64"],datetime_is_numeric=True).T

Unnamed: 0,count,unique,top,freq,mean,min,25%,50%,75%,max
Invoice,541910.0,25900.0,573585.00000,1114.0,NaT,NaT,NaT,NaT,NaT,NaT
StockCode,541910.0,4070.0,85123A,2313.0,NaT,NaT,NaT,NaT,NaT,NaT
Description,540456.0,4223.0,WHITE HANGING HEART T-LIGHT HOLDER,2369.0,NaT,NaT,NaT,NaT,NaT,NaT
InvoiceDate,541910.0,,,,2011-07-04 13:35:22.342073344,2010-12-01 08:26:00,2011-03-28 11:34:00,2011-07-19 17:17:00,2011-10-19 11:27:00,2011-12-09 12:50:00
Country,541910.0,38.0,United Kingdom,495478.0,NaT,NaT,NaT,NaT,NaT,NaT


### In the Invoice variable, some values contain C and A. That means return. We should drop them.

In [15]:
df=df[~(df["Invoice"].str.contains("C",na=False))]
df=df[~(df["Invoice"].str.contains("A",na=False))]

In [20]:
df=df[df["Quantity"]>0]
df=df[df["Price"]>0]

In [22]:
df.dropna(inplace=True)

In [23]:
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


### We know our Invoice quantity and price. So by this two variable, we can find our Total Price too. Let's do it.

In [24]:
df["TotalPrice"]=df["Quantity"] * df["Price"]
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,TotalPrice
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


### Now we need customers' unique values for CLT Value. These are the total number of invoices, the quantity purchased and the total expenditure. I used the groupby() function to access these values for each unique customer.

In [28]:
df["Customer ID"]=df["Customer ID"].astype(int)

In [30]:
cltv_df=df.groupby("Customer ID").agg({"Invoice": lambda x:x.nunique()
                               ,"Quantity": lambda x:x.sum()
                               ,"TotalPrice": lambda x:x.sum()})
cltv_df

Unnamed: 0_level_0,Invoice,Quantity,TotalPrice
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346,1,74215,77183.60000
12347,7,2458,4310.00000
12348,4,2341,1797.24000
12349,1,631,1757.55000
12350,1,197,334.40000
...,...,...,...
18280,1,45,180.60000
18281,1,54,80.82000
18282,2,103,178.05000
18283,16,1397,2094.88000


In [32]:
cltv_df.columns=["total_transaction","total_unit","total_price"]

### Now time for calculating the CLTV Metrics.

##### CLTV = (Customer Value / Churn Rate) * Profit Margin
##### Customer Value = Average Order Value * Purchase Frequency
##### Average Order Value = Total Price / Total Transaction
##### Purchase Frequency = Total Transaction / Total Number of Customers
##### Churn Rate = 1 -  Repeat Rate
##### Profit Margin = Total Price * n(Company's profit rate)

### Purchase Frequency : 

In [34]:
cltv_df["purchase_frequency"]=cltv_df["total_transaction"]/cltv_df.shape[0]
cltv_df.head()

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
12346,1,74215,77183.6,0.00023
12347,7,2458,4310.0,0.00161
12348,4,2341,1797.24,0.00092
12349,1,631,1757.55,0.00023
12350,1,197,334.4,0.00023


### Average Order Value : 

In [35]:
cltv_df["average_order_value"]=cltv_df["total_price"]/cltv_df["total_transaction"]
cltv_df.head()

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
12346,1,74215,77183.6,0.00023,77183.6
12347,7,2458,4310.0,0.00161,615.71429
12348,4,2341,1797.24,0.00092,449.31
12349,1,631,1757.55,0.00023,1757.55
12350,1,197,334.4,0.00023,334.4


### Churn Rate & Repeat Rate : 

##### Repeat Rate = Number of customers with more than 1 transaction / number of all customers

In [38]:
repeat_rate=cltv_df[cltv_df["total_transaction"]>1].shape[0] / cltv_df.shape[0]
churn_rate=1-repeat_rate
repeat_rate, churn_rate

(0.6558321807284463, 0.3441678192715537)

### Profit Margin : 

In [40]:
# We assume our n is 0.2
cltv_df["profit_margin"]=cltv_df["total_price"] * 0.2
cltv_df.head()

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value,profit_margin
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
12346,1,74215,77183.6,0.00023,77183.6,15436.72
12347,7,2458,4310.0,0.00161,615.71429,862.0
12348,4,2341,1797.24,0.00092,449.31,359.448
12349,1,631,1757.55,0.00023,1757.55,351.51
12350,1,197,334.4,0.00023,334.4,66.88


### Customer Value : 

In [42]:
cltv_df["customer_value"]=cltv_df["average_order_value"] * cltv_df["purchase_frequency"]
cltv_df.head()

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value,profit_margin,customer_value
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
12346,1,74215,77183.6,0.00023,77183.6,15436.72,17.79244
12347,7,2458,4310.0,0.00161,615.71429,862.0,0.99355
12348,4,2341,1797.24,0.00092,449.31,359.448,0.4143
12349,1,631,1757.55,0.00023,1757.55,351.51,0.40515
12350,1,197,334.4,0.00023,334.4,66.88,0.07709


### Customer Life Time Value : 

In [44]:
cltv_df["CLTV"]=(cltv_df["customer_value"] / churn_rate) * cltv_df["profit_margin"]
cltv_df.head()

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value,profit_margin,customer_value,CLTV
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12346,1,74215,77183.6,0.00023,77183.6,15436.72,17.79244,798031.89671
12347,7,2458,4310.0,0.00161,615.71429,862.0,0.99355,2488.42599
12348,4,2341,1797.24,0.00092,449.31,359.448,0.4143,432.69546
12349,1,631,1757.55,0.00023,1757.55,351.51,0.40515,413.79531
12350,1,197,334.4,0.00023,334.4,66.88,0.07709,14.97969


In [46]:
cltv_df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_transaction,4338.0,4.27201,7.698,1.0,1.0,2.0,5.0,209.0
total_unit,4338.0,1191.2893,5046.08151,1.0,160.0,379.0,992.75,196915.0
total_price,4338.0,2054.27061,8989.2299,3.75,307.415,674.485,1661.74,280206.02
purchase_frequency,4338.0,0.00098,0.00177,0.00023,0.00023,0.00046,0.00115,0.04818
average_order_value,4338.0,419.16733,1796.53783,3.45,178.625,293.9,430.11375,84236.25
profit_margin,4338.0,410.85412,1797.84598,0.75,61.483,134.897,332.348,56041.204
customer_value,4338.0,0.47355,2.07221,0.00086,0.07087,0.15548,0.38307,64.59337
CLTV,4338.0,11387.49555,241381.00404,0.00188,12.65961,60.94173,369.91023,10517804.90881


In [45]:
cltv_df.sort_values(by="CLTV",ascending=False)

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value,profit_margin,customer_value,CLTV
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
14646,73,196915,280206.02000,0.01683,3838.43863,56041.20400,64.59337,10517804.90881
18102,60,64124,259657.30000,0.01383,4327.62167,51931.46000,59.85645,9031736.56307
17450,46,69993,194550.79000,0.01060,4229.36500,38910.15800,44.84804,5070329.52306
16446,2,80997,168472.50000,0.00046,84236.25000,33694.50000,38.83645,3802141.09260
14911,201,80265,143825.06000,0.04633,715.54756,28765.01200,33.15469,2771017.80094
...,...,...,...,...,...,...,...,...
16878,1,12,13.30000,0.00023,13.30000,2.66000,0.00307,0.02370
17956,1,1,12.75000,0.00023,12.75000,2.55000,0.00294,0.02178
16454,2,3,6.90000,0.00046,3.45000,1.38000,0.00159,0.00638
14792,1,2,6.20000,0.00023,6.20000,1.24000,0.00143,0.00515


### We must perform segmentation of customers based on CLTV results. I've divided it into 3 here. In the next line, I'll examine whether the distinction I made makes sense.

In [47]:
cltv_df["segment"]=pd.qcut(cltv_df["CLTV"],3,labels=["Low_Value","Medium_Value","High_Value"])
cltv_df.sort_values(by="segment", ascending=False).head(10)

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,average_order_value,profit_margin,customer_value,CLTV,segment
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12346,1,74215,77183.6,0.00023,77183.6,15436.72,17.79244,798031.89671,High_Value
15123,2,1195,2199.31,0.00046,1099.655,439.862,0.50699,647.95237,High_Value
15172,5,1611,1624.05,0.00115,324.81,324.81,0.37438,353.32062,High_Value
15171,1,290,1289.5,0.00023,1289.5,257.9,0.29726,222.74752,High_Value
15159,30,11441,18641.01,0.00692,621.367,3728.202,4.29714,46548.86186,High_Value
15157,3,2392,1936.07,0.00069,645.35667,387.214,0.4463,502.12553,High_Value
15152,13,2704,4745.69,0.003,365.05308,949.138,1.09398,3016.9556,High_Value
15150,10,3658,5314.47,0.00231,531.447,1062.894,1.2251,3783.46837,High_Value
15146,2,1473,1327.16,0.00046,663.58,265.432,0.30594,235.94825,High_Value
15144,17,3050,6580.44,0.00392,387.08471,1316.088,1.51693,5800.69532,High_Value


### As you can see, there are high differences between the mean and sum. I have come to the conclusion that the segments that I have divided into 3 are logical.

In [49]:
cltv_df.groupby("segment").agg(["count","mean","sum"])

Unnamed: 0_level_0,total_transaction,total_transaction,total_transaction,total_unit,total_unit,total_unit,total_price,total_price,total_price,purchase_frequency,purchase_frequency,purchase_frequency,average_order_value,average_order_value,average_order_value,profit_margin,profit_margin,profit_margin,customer_value,customer_value,customer_value,CLTV,CLTV,CLTV
Unnamed: 0_level_1,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum,count,mean,sum
segment,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2,Unnamed: 24_level_2
Low_Value,1446,1.32503,1916,1446,142.63347,206248,1446,220.60091,318988.91,1446,0.00031,0.44168,1446,184.60089,266932.88843,1446,44.12018,63797.782,1446,0.05085,73.53364,1446,7.75962,11220.41452
Medium_Value,1446,2.574,3722,1446,433.3686,626651,1446,712.59973,1030419.213,1446,0.00059,0.858,1446,349.52794,505417.40373,1446,142.51995,206083.8426,1446,0.16427,237.53324,1446,74.78352,108136.96694
High_Value,1446,8.91701,12894,1446,2997.86584,4334914,1446,5229.61119,7562017.781,1446,0.00206,2.97234,1446,723.37315,1045997.57143,1446,1045.92224,1512403.5562,1446,1.20554,1743.20373,1446,34079.94352,49279598.33298


### Automation of the whole process through the function.

In [56]:
def create_cltv_df(dataframe,csv=False,profit=0.10):

    # Preparing the Data
    dataframe = dataframe[~(dataframe["Invoice"].str.contains("C", na=False))]
    dataframe = dataframe[~(dataframe["Invoice"].str.contains("C", na=False))]
    dataframe = dataframe[dataframe["Quantity"] > 0]
    dataframe = dataframe[dataframe["Price"] > 0]
    dataframe.dropna(inplace=True)
    dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"]
    dataframe["Customer ID"] = dataframe["Customer ID"].astype(int)
    cltv_df = dataframe.groupby("Customer ID").agg({"Invoice": lambda x: x.nunique(),
                                                   "Quantity": lambda x: x.sum(),
                                                   "TotalPrice": lambda x: x.sum()})
    cltv_df.columns = ["total_transaction", 'total_unit', 'total_price']
    # purchase_frequency
    cltv_df["purchase_frequency"] = cltv_df["total_transaction"] / cltv_df.shape[0]
    # average_order_value
    cltv_df["avg_order_value"] = cltv_df["total_price"] / cltv_df["total_transaction"]
    # churn rate & repeat rate
    repeat_rate = cltv_df[cltv_df["total_transaction"] > 1].shape[0] / cltv_df.shape[0]
    churn_rate = 1 - repeat_rate
    # profit_margin
    cltv_df["profit_margin"] = cltv_df["total_price"] * profit
    # customer_value
    cltv_df["customer_value"] = (cltv_df["avg_order_value"] * cltv_df["purchase_frequency"])
    # Customer Lifetime Value (CLTV)
    cltv_df["CLTV"] = (cltv_df["customer_value"] / churn_rate) * cltv_df["profit_margin"]
    # Segment
    cltv_df["segment"] = pd.qcut(cltv_df["CLTV"], 3, labels=["Low_Value"
                                                             , "Medium_Value"
                                                             , "High_Value"])

    
    if csv==True:
        cltv_df.to_csv("cltv_c.csv")
    return cltv_df

In [57]:
df=data

In [58]:
create_cltv_df(df,csv=True)

Unnamed: 0_level_0,total_transaction,total_unit,total_price,purchase_frequency,avg_order_value,profit_margin,customer_value,CLTV,segment
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
12346,1,74215,77183.60000,0.00023,77183.60000,7718.36000,17.79244,399015.94836,High_Value
12347,7,2458,4310.00000,0.00161,615.71429,431.00000,0.99355,1244.21299,High_Value
12348,4,2341,1797.24000,0.00092,449.31000,179.72400,0.41430,216.34773,High_Value
12349,1,631,1757.55000,0.00023,1757.55000,175.75500,0.40515,206.89766,High_Value
12350,1,197,334.40000,0.00023,334.40000,33.44000,0.07709,7.48984,Low_Value
...,...,...,...,...,...,...,...,...,...
18280,1,45,180.60000,0.00023,180.60000,18.06000,0.04163,2.18462,Low_Value
18281,1,54,80.82000,0.00023,80.82000,8.08200,0.01863,0.43750,Low_Value
18282,2,103,178.05000,0.00046,89.02500,17.80500,0.04104,2.12336,Low_Value
18283,16,1397,2094.88000,0.00369,130.93000,209.48800,0.48291,293.93987,High_Value
