# Business Problem with Customer Segmentation


An e-commerce company wants to segment its customers and determine marketing strategies according to these segments.

For this purpose, we will define the behavior of customers and we will form groups according to clustering.

In other words, we will take those who exhibit common behaviors into the same groups and we will try to develop sales and marketing techniques specific to these groups.



### Data Set Story:

https://archive.ics.uci.edu/ml/datasets/Online+Retail+II

This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 and 09/12/2011.

The company mainly sells unique all-occasion gift-ware. 

Many customers of the company are wholesalers.




### Attribute Information:

- InvoiceNo: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice date and time. Numeric. The day and time when a transaction was generated.
- UnitPrice: Unit price. Numeric. Product price per unit in sterling (Â£).
- CustomerID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal. The name of the country where a customer resides.



# Questions from data set


All questions about 2009-2010 years

1. What is the number of unique products?
2. Which product do you have?
3. Which product is the most ordered?
4. How do we rank this output?
5. How many invoices have been issued?
6. How much money has been earned per invoice?
7. Which are the most expensive products?
8. How many orders came from which country?
9. which country gained how much?
10. which product is the most returned?
11. What should we do for customer segmentation with RFM?
12. Scoring for RFM.
13. Finally, create an excel file named New Customer.

# Data Understanding 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

# to display all columns and rows:
pd.set_option('display.max_columns', None); pd.set_option('display.max_rows', None);


The number of numbers that will be shown after the comma. For variables such as 'price', the option below is replaced with 0 instead.

In [None]:
pd.set_option('display.float_format', lambda x: '%.0f' % x)
import matplotlib.pyplot as plt

In [None]:
df_2009_2010 = pd.read_excel("../input/online-retail-ii-data-set-from-ml-repository/online_retail_II.xlsx", sheet_name = "Year 2009-2010")

In [None]:
df = df_2009_2010.copy()

Try to understand the data by using the functions that can be used as a first look at the data in the pandas section.

## 1. What is the number of unique products?

In [None]:
df["Description"].nunique()

## 2. Which product do you have?

In [None]:
df["Description"].value_counts().head()

## 3. Which product is the most ordered?

In [None]:
df.groupby("Description").agg({"Quantity":"sum"}).head()

## 4. How do we rank this output?

In [None]:
df.groupby("Description").agg({"Quantity":"sum"}).sort_values("Quantity", ascending = False).head()

## 5. How many invoices have been issued?

In [None]:
df["Invoice"].nunique()

## 6. How much money has been earned per invoice?

In [None]:
# it is necessary to create a new variable by multiplying two variables

df["TotalPrice"] = df["Quantity"]*df["Price"]

In [None]:
df.head()

In [None]:
df.groupby("Invoice").agg({"TotalPrice":"sum"}).head()

## 7. Which are the most expensive products?

In [None]:
df.sort_values("Price", ascending = False).head()

## 8. How many orders came from which country?

In [None]:
df["Country"].value_counts()

## 9. Which country gained how much?

In [None]:
df.groupby("Country").agg({"TotalPrice":"sum"}).sort_values("TotalPrice", ascending = False).head()

## 10. Which product is the most returned?

In [None]:
df[df['Invoice'].str.startswith("C", na=False)].sort_values("Quantity", ascending = True).head()

# Data Preparation

In [None]:
df.isnull().sum()

In [None]:
df.dropna(inplace = True)

In [None]:
df.shape

In [None]:
df.describe([0.01,0.05,0.10,0.25,0.50,0.75,0.90,0.95, 0.99]).T

In [None]:
for feature in ["Quantity","Price","TotalPrice"]:

    Q1 = df[feature].quantile(0.01)
    Q3 = df[feature].quantile(0.99)
    IQR = Q3-Q1
    upper = Q3 + 1.5*IQR
    lower = Q1 - 1.5*IQR

    if df[(df[feature] > upper) | (df[feature] < lower)].any(axis=None):
        print(feature,"yes")
        print(df[(df[feature] > upper) | (df[feature] < lower)].shape[0])
    else:
        print(feature, "no")

# Customer Segmentation with RFM Scores

Consists of initials of Recency, Frequency, Monetary expressions.

It is a technique that helps determine marketing and sales strategies based on customers' buying habits.

- Recency (innovation): Time since customer last purchased

     -- In other words, it is the “time since the last contact of the customer”.

     -- Today's date - Last purchase

     -- To give an example, if we are doing this analysis today, today's date is the last product purchase date.

     -- This can be for example 20 or 100. We know that 20 customers are hotter. He has been in contact with us recently.

- Frequency: Total number of purchases.

- Monetary (Monetary Value): Total spending by the customer.


In [None]:
df.head()

In [None]:
df.info()

In [None]:
df["InvoiceDate"].min()

In [None]:
df["InvoiceDate"].max()

What is today? Now if we take today's date, then there will be a very serious difference.

For this reason, let us determine ourselves a "today" according to the structure of this data set.

We can set this day as the maximum day of the data set.

We can segmentation according to the day of the last recording.

In [None]:
import datetime as dt

today_date = dt.datetime(2010,12,9)

In [None]:
today_date

## 11. Show the last shopping dates of each customer.

In [None]:
df.groupby("Customer ID").agg({"InvoiceDate":"max"}).head()

Now we have the last shopping dates of each customer. Let's fix "Customer ID"s.

In [None]:
df["Customer ID"] = df["Customer ID"].astype(int)

## 12. What should we do for customer segmentation with RFM?

For each customer, we need to deduce the customers' last purchase date from today's date.

Then we have singularized customer deadlines.

In [None]:
(today_date - df.groupby("Customer ID").agg({"InvoiceDate":"max"})).head()

In [None]:
temp_df = (today_date - df.groupby("Customer ID").agg({"InvoiceDate":"max"}))

In [None]:
temp_df.rename(columns={"InvoiceDate": "Recency"}, inplace = True)

In [None]:
temp_df.head()

In [None]:
recency_df = temp_df["Recency"].apply(lambda x: x.days)

In [None]:
recency_df.head()

In [None]:
#df.groupby("Customer ID").agg({"InvoiceDate": lambda x: (today_date - x.max()).days}).head()

# Frequency

In [None]:
temp_df = df.groupby(["Customer ID","Invoice"]).agg({"Invoice":"count"})

In [None]:
temp_df.head()

In [None]:
temp_df.groupby("Customer ID").agg({"Invoice":"count"}).head()

In [None]:
freq_df = temp_df.groupby("Customer ID").agg({"Invoice":"sum"})
freq_df.rename(columns={"Invoice": "Frequency"}, inplace = True)
freq_df.head()

# Monetary

In [None]:
monetary_df = df.groupby("Customer ID").agg({"TotalPrice":"sum"})

In [None]:
monetary_df.head()

In [None]:
# lets change names

monetary_df.rename(columns={"TotalPrice": "Monetary"}, inplace = True)

In [None]:
print(recency_df.shape,freq_df.shape,monetary_df.shape)

In [None]:
rfm = pd.concat([recency_df, freq_df, monetary_df],  axis=1)

In [None]:
rfm.head()

## Now, we need to score according to the most recent (Recency), the cyclic (Frequency) and the monetary expenditure (Monetary).

## 13. Scoring for RFM

- Let's start with the last 5 here. Let's use the 'qcut' method to score.

In [None]:
rfm["RecencyScore"] = pd.qcut(rfm['Recency'], 5, labels = [5, 4, 3, 2, 1])   

In [None]:
rfm["FrequencyScore"] = pd.qcut(rfm['Frequency'].rank(method = "first"), 5, labels = [1, 2, 3, 4, 5])

In [None]:
rfm["MonetaryScore"] = pd.qcut(rfm['Monetary'], 5, labels = [1, 2, 3, 4, 5])

In [None]:
rfm.head()

Let's write code with RFM values side by side

In [None]:
(rfm['RecencyScore'].astype(str) + 
 rfm['FrequencyScore'].astype(str) + 
 rfm['MonetaryScore'].astype(str)).head()

In [None]:
rfm["RFM_SCORE"] = rfm['RecencyScore'].astype(str) + rfm['FrequencyScore'].astype(str) + rfm['MonetaryScore'].astype(str)

In [None]:
rfm.head()

In [None]:
rfm.describe().T

If we rank 5 points out of 3, 555 are champions.

In [None]:
rfm[rfm["RFM_SCORE"] == "555"].head()

If we rank 1 point out of 3, that is, 111 ones are the lowest.

In [None]:
rfm[rfm["RFM_SCORE"] == "111"].head()

Let's do regex segmentation. With the help of regex, we will set rfm aside and consider r and f.

Example: If you see 1-2 in r and 1-2 in f, write 'Hibernating'

In [None]:
seg_map = {
    r'[1-2][1-2]': 'Hibernating',
    r'[1-2][3-4]': 'At Risk',
    r'[1-2]5': 'Can\'t Loose',
    r'3[1-2]': 'About to Sleep',
    r'33': 'Need Attention',
    r'[3-4][4-5]': 'Loyal Customers',
    r'41': 'Promising',
    r'51': 'New Customers',
    r'[4-5][2-3]': 'Potential Loyalists',
    r'5[4-5]': 'Champions'
}

In [None]:
rfm['Segment'] = rfm['RecencyScore'].astype(str) + rfm['FrequencyScore'].astype(str)
rfm['Segment'] = rfm['Segment'].replace(seg_map, regex=True)
rfm.head()

In [None]:
rfm[["Segment", "Recency","Frequency","Monetary"]].groupby("Segment").agg(["mean","count"])

## If we need to comment, let's make an example of champions.

- Recency is the last 666 number of shopping last 5,
- Frequency average of 272 out of 666 how much shopping it makes,
- Monetary has spent an average of 6533 currencies over 666 shoppers.

Now, let's choose the class (Need Attention) that needs attention.
If we make strategy evaluations: you can take their "Customer ID" and keep it in excel, send sales department and prepare a campaign for them and make it more efficient.

In [None]:
rfm[rfm["Segment"] == "Need Attention"].head()

## 14. Finally, create an excel file named New Customer.

In [None]:
rfm[rfm["Segment"] == "New Customers"].index

In [None]:
new_df = pd.DataFrame()
new_df["NewCustomerID"] = rfm[rfm["Segment"] == "New Customers"].index

In [None]:
new_df.head()

In [None]:
new_df.to_csv("new_customers.csv")



# Conclusion

    After this notebook, my aim is to prepare 'kernel' which is 'not clear' data set.

    If you have any suggestions, please could you write for me? I wil be happy for comment and critics!

    Thank you for your suggestion and votes ;)

