### Installing libraries for accessing excel data

In [None]:
!pip install xlrd

In [None]:
!pip install openpyxl

### Loading required libraries

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import datetime as dt

In [None]:
df = pd.read_excel('../input/uci-online-retail-ii-data-set/online_retail_II.xlsx', sheet_name="Year 2010-2011")
df.head()

### General information and playing with dataset

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.nunique()

#### Exploring individual variables

In [None]:
df['StockCode'].value_counts()

#### Printing the top 10 most ordered `StockCode` items (first item from each code)

In [None]:
top_codes = df.groupby('StockCode').agg({'Quantity': sum}).sort_values('Quantity', ascending=False).head(10)
top_codes

In [None]:
idx = []
for code in top_codes.index:
    idx.append(df[df['StockCode'] == code].index[0])

df.loc[idx].sort_values('Quantity', ascending=False)

In [None]:
quantity = df['Country'].value_counts().head()
quantity

In [None]:
ax = sns.barplot(x=quantity.values, y=quantity.index)
ax.set(title='Countries with max quantities', xlabel='Quantity', ylabel='Country')

It is pretty clear that the dataset consists of transaction pre-dominantly from the UK, followed by Germany and France.

#### Let's see the most ordered items in various countries

In [None]:
df.loc[df.groupby('Country')['Quantity'].idxmax()].sort_values('Quantity', ascending=False).head(15)

##### Well, it looks like Rabbit Night Light is pretty famous in Netherlands, Japan and France as it the most ordered item in all three countries!

In [None]:
df_cop = df.copy()

### Data Pre-processing
#### Since, there are many null values present in the data, we will get rid of them to perform a more cleaner and safer data analysis.

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace=True)

In [None]:
df.isnull().sum()

#### There are also many transactions with negative `Quantity`. Let;s remove these as well since they might interfere with the further analysis.

In [None]:
print("Number of negative-quantity transactions before: ", (df['Quantity'] < 1).sum())
df = df[df['Quantity'] > 1]
print("Number of negative-quantity transactions now: ", (df['Quantity'] < 1).sum())

#### Before moving further, let's add another column `Revenue` to the dataset which will denote the money made from each transaction.

In [None]:
df['Revenue'] = df['Quantity']*df['Price']

In [None]:
df.head()

## Customer Segmentation using RFM Analysis: (R)ecency, (F)requency, (M)onetary

#### This technique is used for determining marketing, PR and sales strategies based on the consumers' previous transaction data and habits.

### Recency
#### Recency refers to the time since the customer has made their last purchase. Thus, a lesser recency means that the customer was recently in cotact with the company


### Frequency
#### It determines how often the customer has been making purchases. The more, the merrier.


### Monetary
#### Total money spent by the customer in all transactions

In [None]:
df.info()

### Recency Score

#### Usually, Recency score is calculated by: Today's date - Last transaction date
#### But it is not a very realistic idea to use today's day here as it will keep on changing. Instead we will use the last transaction date from the dataset

In [None]:
df['InvoiceDate'].max()

In [None]:
today_date = dt.datetime(2011, 12, 9)
today_date

In [None]:
# Convert Customer ID to integer from string
df['Customer ID'] = df['Customer ID'].astype(int)

In [None]:
recency_df = today_date - df.groupby('Customer ID').agg({'InvoiceDate': 'max'})
recency_df.rename(columns={'InvoiceDate': 'Recency'}, inplace=True)
recency_df.head()

#### Convert recency score to numbers based on days

In [None]:
recency_df['Recency'] = recency_df['Recency'].apply(lambda score: score.days)
recency_df.head()

### Frequency Score

#### We can find the frequency for each CustomerID by counting the respective number invoices for each customer

In [None]:
temp_df = df.groupby(["Customer ID","Invoice"]).agg({"Invoice":"count"})
temp_df.head()

In [None]:
freq_df = temp_df.groupby('Customer ID').agg({'Invoice': 'count'})
freq_df.rename(columns={'Invoice': 'Frequency'}, inplace=True)
freq_df.head()

### Monetary Score
#### We can calculate the monetary score by simply adding the Revenue for each CustomerID group

In [None]:
df

In [None]:
monetary_df = df.groupby("Customer ID").agg({"Revenue":"sum"})
monetary_df['Revenue'] = monetary_df['Revenue'].astype(int)
monetary_df.rename(columns={'Revenue': 'Monetary'}, inplace=True)
monetary_df.head()

In [None]:
# Concatenating all three
rfm = pd.concat([recency_df, freq_df, monetary_df],  axis=1)
rfm.head()

#### There is a table commonly used in RFM Analysis as shown below. 
#### Generally, 2 parameters are used in these tables. These are Recency and Frequency values. 
#### Here the values ​​for these parameters are given. As can be seen, the values ​​are between 1 and 5. 5 means very good, 1 means very bad. As seen in this table, customers are divided into segments. Customer segments are determined by looking at the parameter values. 

#### The group that we should pay attention to here is actually not the champions group that everyone thinks of, this group is already visiting us and shopping and leaving money. The most important class here is the class "can't loose them". Because these people almost never visit us and do a lot of shopping, so as you can see frequency values ​​5, we need to focus on this class. Because the customers in this group will leave us, therefore, they require attention and we must bring the customers in this group to us by making the necessary analysis. 

#### At the same time, there is a "need attention" group. We need to animate it by sending special mails to the group with a Recency value of 3 and the group with a Frequency value of 3, and to the group that is asleep, by making promotions, and moving it to the right or upwards in the table. In short, we can explain this table in this way. This table is the main logic of RFM analysis.

![](https://guillaume-martin.github.io/images/rfm-segments.png)

#### We divide all the score into 5 categories with qcut function where 5 is the best and 1 is the worst. For Frequency most recent transaction is the best.

In [None]:
rfm["RecencyScore"] = pd.qcut(rfm['Recency'], 5, labels=[5, 4, 3, 2, 1])
rfm["FrequencyScore"] = pd.qcut(rfm['Frequency'].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
rfm["MonetaryScore"] = pd.qcut(rfm['Monetary'], 5, labels=[1, 2, 3, 4, 5])

#### After calculating RFM scores, we combine scores as a new column

In [None]:
rfm["RFM_SCORE"] = (rfm['RecencyScore'].astype(str) +
                    rfm['FrequencyScore'].astype(str) +
                    rfm['MonetaryScore'].astype(str))

In [None]:
rfm

#### Here, classes are assigned based on rfm scores. The score range for these classes is stated below.
#### As an example, the Hibernating class has the following score values.
#### Those with Recency value 1-2 and Frequency value 1-2 are read in other classes in this way.
#### The reason for adding only Recency and Frequency here is because only these two parameters are included in the table, but Monetary can also be added next to them.

In [None]:
seg_map = {
    r'[1-2][1-2]': 'Hibernating',
    r'[1-2][3-4]': 'At Risk',
    r'[1-2]5': 'Can\'t Loose',
    r'3[1-2]': 'About to Sleep',
    r'33': 'Need Attention',
    r'[3-4][4-5]': 'Loyal Customers',
    r'41': 'Promising',
    r'51': 'New Customers',
    r'[4-5][2-3]': 'Potential Loyalists',
    r'5[4-5]': 'Champions'
}

In [None]:
# We add a new column to rfm dataframe as "Segment"

rfm['Segment'] = rfm['RecencyScore'].astype(str) + rfm['FrequencyScore'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(seg_map, regex=True)
rfm.head(20)

### In this way, we have divided all the customers into 10 segments based on their Recency and Frequency Scores.
#### We can easily determine the customers that the company needs to focus more upon or the new customers or which of them are the loyal ones etc..

### Visualizing the percentage of different segments

In [None]:
segments_counts = rfm['Segment'].value_counts().sort_values(ascending=True)

fig, ax = plt.subplots()

bars = ax.barh(range(len(segments_counts)),
              segments_counts,
              color='silver')
ax.set_frame_on(False)
ax.tick_params(left=False,
               bottom=False,
               labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)

for i, bar in enumerate(bars):
        value = bar.get_width()
        if segments_counts.index[i] in ['Can\'t loose']:
            bar.set_color('firebrick')
        ax.text(value,
                bar.get_y() + bar.get_height()/2,
                '{:,} ({:}%)'.format(int(value),
                                   int(value*100/segments_counts.sum())),
                va='center',
                ha='left'
               )

plt.title('Different Customer Segments')
plt.show()

In [None]:
rfm[["Segment", "Recency","Frequency","Monetary"]].groupby("Segment").agg(["mean","count","max"]).head(20)

## Comments and Insights

#### I will make my comments according to the descriptive statistics above. The 4 segments to be selected are as follows;

### 1. Champions
#### There are 665 people in this segment,
#### on average, their last purchase took place 4 days ago,
#### their shopping frequency is 15, they have 248 purchases in total,
#### They spent 279489 dollars / TL.

### 2. About to Sleep
#### There are 369 people in this segment,
#### on average, their last shopping took place 51 days ago,
#### their shopping frequency is 1, they have 2 purchases in total,
#### They spent 6208 dollars / TL.

### 3. Need Attention
#### There are 190 people in this segment,
#### On average, their last purchase took place 48 days ago,
#### their shopping frequency is 3, they have 4 purchases in total,
#### They had an expenditure of 3546 dollars / TL.

### 4. Can't Loose 
#### There are 68 people in this segment,
#### On average, their last purchase was 132 days ago,
#### Shopping frequency is 10, they have 35 purchases in total
#### 10217 dollars / TL spent.

### ACTION 
Communicating with people in the Champions segment will make them feel valued and appreciated. These customers are likely to account for a disproportionately high percentage of total revenues, so focusing on keeping them happy should be a top priority. By further analyzing individual preferences and affinities, it will provide additional opportunities for more personalized messaging. For example, personal messages such as congratulations on birthdays can be sent. They may be early adopters for new products, and they can help us promote our brand by communicating our new products to them free of charge.

By sharing our valuable resources to people in the about to sleep segment, we can recommend popular products / renewals at a discount, reconnect with them and make them our customers.

We can make limited-time offers to people in the Need Attention segment, which will push them to shop with us, we can make recommendations based on past purchases. That way, we can reactivate them so they can shop.

Can't Loose is one of the segments that should be given importance. This segment is the one we do not want to lose. The fact that their last purchase was made 132 days ago is a bad result. However, the total number of purchases made is a very high figure. 35 total sales were made. Actually, they shop, but they do not shop at once or in a few times and do not shop again for a long time. We can win back by offering renewals, newer products to people in this segment, or by giving them special promotions, small-scale money points, we should not put the people here in competition, we can win by talking to them by sending e-mails and sms. We can make corrections from the survey results by conducting surveys to find out what went wrong.

# Customer Lifetime Value Calculation

Firms use many techniques and methods to make these critical decisions. Customer lifetime value (also called CLV or CLTV ) is one of the technique which is rapidly gaining acceptance as a metric to acquire, grow, and retain the “right” customers in customer relationship management (CRM).

Customer lifetime value for a firm is the net profit or loss to the firm from a customer over the entire life of transactions of that customer with the firm.

Throughout this project, customer lifetime value example for the calculation will be discussed with and customers will be divided into 4 segments according to CLTV value.

In [None]:
# cltv_df dataframe is created in order for ease of calculation and observation
cltv_df = df.groupby('Customer ID').agg({'Invoice': lambda x: len(x),
                                         'Quantity': lambda x: x.sum(),
                                         'Revenue': lambda x: x.sum()})

In [None]:
# total transaction, total unit, total price
cltv_df.columns = ['total_transaction', 'total_unit', 'total_price']

In [None]:
cltv_df.head()

## Required formulas for Customer Lifetime Value Calculation

### - Average_Order_Value = Total_Revenue / Total_Number_of_Orders
### - Purchase_Frequency =  Total_Number_of_Orders / Total_Number_of_Customers
### - Customer_Value(CV)  = Average_Order_Value * Purchase_Frequency

### - Churn_Rate = 1 - Repeat_Rate
### - Profit_margin
### - CLTV = (Customer_Value / Churn_Rate) x Profit_margin.

For CV calculation AOV and Frequency is required

Average Order Value = Total Revenue / Total Number of Orders

AOV Calculation:

In [None]:
cltv_df["avg_order_value"] = cltv_df["total_price"]/ cltv_df["total_transaction"]

In [None]:
#Customer number:
cltv_df.shape[0]

cltv_df["purchase_frequency"] = cltv_df["total_transaction"]/cltv_df.shape[0]


### Repeat Rate & Churn Rate


If total transaction is bigger than 1 means that these customers were here more than once
For repeat rate, the value where the total transaction is greater than 1 is selected.


**Repeat Rate** = Number of customers who have purchased more than once / Number of Customers

The churn rate, also known as the customer churn, is the rate at which customers stop doing business with an instution.

Formula for the calculation of churn rate:
**churn_rate** = 1- repeat_rate


In [None]:
repeat_rate = cltv_df[cltv_df.total_transaction > 1].shape[0]/cltv_df.shape[0]

churn_rate = 1- repeat_rate

### Profit Margin


Here for this problem profit margin is chosen as 5%, actually the calculated value here can be called as Profit
However it can be named as profit margin in order not to confuse the formulas.


In [None]:
# profit margin : 5 %
cltv_df["profit_margin"] = cltv_df["total_price"]* 0.05
cltv_df.head()


## Calculate Customer Lifetime Value

Customer value (CV) is calculated by using average order value and purchase frequency. Then by using CV, profit margin and churn rate CLTV is calculated.


In [None]:
# Customer Value(CV):
cltv_df["CV"] = cltv_df["avg_order_value"] * cltv_df["purchase_frequency"]
cltv_df.head()

#Customer Lifetime Value(CLTV):
cltv_df["CLTV"] = (cltv_df["CV"] /churn_rate)* cltv_df["profit_margin"]

cltv_df.sort_values("CLTV", ascending = False)

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range= (1,100))
scaler.fit(cltv_df[["CLTV"]])
cltv_df["SCALED_CLTV"] = scaler.transform(cltv_df[["CLTV"]])
cltv_df.sort_values("CLTV", ascending = False)

 For comparison of values, variables are chosen from dataframe and values are sorted by "SCALED_CLTV"


In [None]:
cltv_df[["total_transaction", "total_unit","total_price","CLTV", "SCALED_CLTV"]].\
sort_values(by = "SCALED_CLTV",ascending= False).head()

Segments are created by using qcut.

Segments are helpful to make the customer experience personalized

In [None]:
cltv_df["Segment"] = pd.qcut(cltv_df["SCALED_CLTV"], 4, labels = ["D", "C", "B", "A"])

Total transaction, total unit, total price, CLTV and SCALED CLTV values are interpreted with their count, sum, and mean values.

In [None]:
cltv_df.groupby("Segment")[["total_transaction", "total_unit", "total_price", "CLTV", "SCALED_CLTV"]].agg(
    {"count", "mean", "sum"})

### This way, we can segment the customers according to their Life Time Value

### Calculating the most popular month of the year

In [None]:
def num_to_name(month):
    names = {
        1: "January",
        2: "February",
        3: "March",
        4: "April",
        5: "May",
        6: "June",
        7: "July",
        8: "August",
        9: "September",
        10: "October",
        11: "November",
        12: "December"
    }
    
    return names[month]

In [None]:
monthly_data = df['InvoiceDate'].apply(lambda date: date.month).value_counts().sort_values(ascending=False)
monthly_data = pd.DataFrame(monthly_data)
monthly_data

In [None]:
monthly_data['Month'] = monthly_data.index
monthly_data['Month'] = monthly_data['Month'].apply(lambda num: num_to_name(num))
monthly_data.rename(columns={'InvoiceDate': 'Invoices'}, inplace=True)
monthly_data

In [None]:
ax = sns.barplot(x='Invoices', y='Month', data=monthly_data)
ax.set(title='Popularity across months', xlabel='Number of invoices', ylabel='Month')

### Therefore, we can conclude that November was the most popular and the most profitable month for the company.