In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd 
import matplotlib.pyplot as plt # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **What is customer behavior?** 

The decisions and instincts that make a customer buy a certain product or service can be described as customer behavior.

With the advent of targeted marketing, traditional marketing techniques are getting obsolete with every new day. The rise of digital marketing, where every customer is shown advertisements particular to their interests and habits, has taken over the world.

This insight into customer’s interests and habits is obtained through an extensive customer behavior analysis approach. We will try to implement a very basic level of this approach that will include finding the products that are selling more and at which time of the day. Then we will group the customers according to their buying habits.

# **Why is it important?**

Do you know that the average attention span of a person is at an all-time low? This means that an average advertiser or salesperson has only seven seconds to grasp a customer’s attention before they move to another product as there are so many options available for them to choose from.

A customer will only be interested in your product if they somehow get convinced that it aligns with their interests and habits.



Do users prefer the products of a specific brand?

What is the user’s activity(view, cart, buy) throughout the day?

Items from which brands and categories are most preferred by users?

Can we effectively conduct targeted marketing?

# Brand analysis 

A brand is a term that differentiates one product from another. In this analysis, we will review whether people like to purchase products with a popular brand or a product without a brand.

For this analysis, only the products actually bought by the users will be considered. In our dataset, the products which have no brand are given a NaN value.
This will be done in two steps:

Separate the original DataFrame into two DataFrames. One with all the products with brands and one with all the products without brands.

Fetch all those rows from the two DataFrames where the event_type value is purchase.

As a final result, two Dataframes will be obtained containing the brand products with and without, that was purchased.

In [None]:
#BRAND_ANALYSIS

df = pd.read_csv('../input/ecommerce-behavior-data-from-multi-category-store/2019-Nov.csv') # Reading the data from file

print(df.head())

In [None]:
# Step 1

# Fetch rows with brand
with_brand = df[df['brand'].notna()]

# Fetch rows without brand
without_brand = df[df['brand'].isna()]

# Step 2

# Purchased products with brands
with_brand = with_brand[with_brand['event_type'] == 'purchase']
print(with_brand)

# Purchased products without brands
without_brand = without_brand[without_brand['event_type'] == 'purchase']
print(without_brand)

In [None]:
# Get length of original dataframe with purchased products
org = len(df[df['event_type'] == 'purchase'])

# Divide the length of with_brand dataframe with length org dataframe
brand_p = len(with_brand) / org
print(brand_p * 100)

# Divide the length of without_brand dataframe with length org dataframe
brand_a = len(without_brand) / org

print(brand_a * 100)

According to the above output, approximately 92% of the purchased products were associated with a brand, and only 8% of products without a brand were bought.

# **THE HYPOTHESIS**

A hypothesis can be drawn based on the above results.

For marketers, most of the marketing budget should be allotted to the advertisement of branded products.

For inventors or entrepreneurs, always introduce the product with a brand name because products without a brand have a very low probability of getting bought.



# **Users activity**

As mentioned in the previous lesson, the user can perform three actions that get recorded in the dataset.

view: The user can view an item.

cart: The user can add the item to the cart.

purchase: The user can purchase the item.

Analyzing the view and purchasing actions of the user across the different timelines in a month can provide very important information as to at what time most of the users visit the site. When such times are known, resources can be allocated according to that information to optimize performance.

For example, if we know that a significant amount of users visit the site on Sunday just to view the products, resources from other components can be transferred to viewing components to enhance the user experience. Similarly, the same approach can be used on other components if we know at what times certain, user activity is preferred.

Let’s apply this approach to our data and review what analysis can be drawn from it.

# **Preprocessing** 

Before we move to extract information, some preprocessing needs to be done on our initial DataFrame. The time values are separated from the event_time column and are made into separate columns. The day, week_day, and hour are computed for each event_time value.

In [None]:

#Convert the type of event_time column to datetime
df['event_time'] = pd.to_datetime(df.event_time)

# Calculate and add relevant columns to track users activity
df["week_day"] = df['event_time'].map(lambda x: x.dayofweek + 1)
df["day"] = df['event_time'].map(lambda x: x.day)
df["hour"] = df['event_time'].map(lambda x: x.hour)

print(df)

# **Weekly analysis**

In this part, we will review a weekly analysis of the number of views. This will reveal the day of the week on which the most or least number of views occur for the website.

In [None]:

# Get all the view events of all users
viewed = df[df['event_type'] == 'view']

# Plot the number views against all week days in a line chart
view_plot = viewed.groupby('event_type')['week_day'].value_counts().sort_index().plot(kind = 'line', figsize = (15,6))

# Set properties of the plot
view_plot.set_xlabel('Day of the week',fontsize = 15)
view_plot.set_ylabel('Number of Views',fontsize = 15)
view_plot.set_title('Number of views for different Week Days',fontsize = 15)
view_plot.set_xticklabels(('Mon','Tue','Wed','Thur','Fri', 'Sat','Sun'), rotation = 'horizontal', fontsize = 15)

#plot the graph
plt.show()

*In the above graph, it can be observed that most items are viewed during the working days instead of on the weekends. This represents the aggregated number of website views for all the weekdays of October 2019.*

# **Hourly analysis**

In this part, an hourly analysis of the number of views will be created. This will reveal at which hour of the day the most and least number of views occur for the website.

In [None]:
#Convert the type of event_time column to datetime
df['event_time'] = pd.to_datetime(df.event_time)

df["week_day"] = df['event_time'].map(lambda x: x.dayofweek + 1)
df["day"] = df['event_time'].map(lambda x: x.day)
df["hour"] = df['event_time'].map(lambda x: x.hour)

# Get all the view events of all users
viewed = df[df['event_type'] == 'view']

# Plot the number views against all 24 hours of the days in a bar chart
view_plot = viewed.groupby('event_type')['hour'].value_counts().sort_index().plot(kind = 'bar', figsize = (15,6))

# Set properties of the plot
view_plot.set_xlabel('Hour',fontsize = 15)
view_plot.set_ylabel('Number of Views',fontsize = 15)
view_plot.set_title('Number of views for different Hours of Days',fontsize = 15)
view_plot.set_xticklabels(range(1,32), rotation='horizontal', fontsize=15)

#plot the graph
plt.show()

In the above graph, it can be observed that most items are viewed in the working hours instead of the free hours. The number of views starts increasing from the start of the day, reaching their peak between 3 and 5 P.M. Then it starts to drop. This is the combined result for each day of October 2019.

The code is exactly the same for the weekly analysis. On line 16, just the week_day parameter is changed using the hour parameter, and some properties are renamed according to the new analysis.

# **The hypothesis**

From the above weekly and hourly analysis, it can be observed that most of the users like to browse the items during working hours of working days. Other time slots are also important but at these time slots, most resources should be allocated to the viewing or browsing component of the website to optimize and enhance user experience which in turn brings profit.

Try doing the same weekly and hourly analysis for the number of products purchased to determine whether the view results hold for the purchase part or not.

The most common problem faced by any business is inventory management. Sometimes business owners either have too much of a product that is not being sold or too little of a product whose demand is very high. This can cause a substantial loss to a company’s profits and reputation. For more information on this problem, refer here.

If we somehow know what products from which brands and categories are selling the most in the market, then inventory management can be optimized to some level. Here, products from which brand and category were bought the most will be determined.

Top brands #
First, the data will be read and the event_time column will be converted to DateTime format. Then, the following steps will be performed to obtain the top brands.

# Identifying Famous Brands and Categories
The most common problem faced by any business is inventory management. Sometimes business owners either have too much of a product that is not being sold or too little of a product whose demand is very high. This can cause a substantial loss to a company’s profits and reputation. For more information on this problem, refer here.

If we somehow know what products from which brands and categories are selling the most in the market, then inventory management can be optimized to some level. Here, products from which brand and category were bought the most will be determined.

# Top brands 

First, the data will be read and the event_time column will be converted to DateTime format. Then, the following steps will be performed to obtain the top brands.

In [None]:

# Get rows where products are purchased
purchase = df[df['event_type'] == 'purchase']

# Group the DataFrame on brands
top_brands = purchase.groupby('brand')

# Get number of products bought by computing length of each grouped brand
top_brands = top_brands['brand'].agg([len])

# Sort the result on obtained length in descending order
top_brands.sort_values('len', ascending = False, inplace = True)

print(top_brands)

*According to this, Samsung is the most famous brand whose products are being bought in excessive quantities.*

# Top categories 
The same steps as above will be performed here, but instead of the brand column, the category_code column will be used.

In [None]:


# Get rows where products are purchased
purchase = df[df['event_type'] == 'purchase']

# Group the DataFrame on category_code
top_catg = purchase.groupby('category_code')

# Get number of products bought by computing length of each grouped category_code
top_catg = top_catg['category_code'].agg([len])

# Sort the result on obtained length in descending order
top_catg.sort_values('len', ascending = False, inplace = True)

print(top_catg)

The same technique and codes to find the top brands are used to get the top categories. Only the brand column is replaced with the category_code column.

According to the above output, the smartphone category is the most famous among others. The difference in the number of products bought for other categories is clearly visible.

# **The hypothesis** 

According to the above analysis, the top brands all include mobile and mobile accessory companies. The top category is the smartphone category, which has over 300,000+ sales, and the other categories don’t even come close to this number. It can be concluded that all products that come under the smartphone category should be in abundance in the inventory with only the top five or six top brands.

# RFM analysis 

RFM is a categorizing technique that uses the previous purchasing behavior of the customers to divide customers into groups so that an optimal marketing strategy can be developed for each individual. RFM stands for recency, frequency, and monetary, respectively.

Recency: How many days have passed since a customer has bought an item

Frequency: How many orders a customer has placed

Monetary: How much money a customer has spent

# Need for RFM analysis 

This technique efficiently categorizes the customers into specific rank-based groups taking into account their past online behaviors.

This can help marketers and advertisers target each group of consumers separately, enabling them to cater to the needs of groups instead of each individual.

This technique also informs us of the most and least profit yielding customers so relevant resources can be deployed to each group according to their needs.

If the results of this technique are correctly used, then even customers who don’t engage in much activity(view, cart, buy) can be influenced to be high potential customers.

RFM technique and steps to perform #
In this process, the customers are separated into four groups under each of the RFM metrics, i.e., recency, frequency, and monetary. This means we’ll have a maximum of (4 x 4 x 4) sixty-four groups to deal with, which is not very large considering that the total number of customers can be in the thousands. Quantiles will be used to divide the customers into groups.

The following steps will be performed to get the final list of segmented customers.

**Step 1: Get the purchase data of all customers.**

In [None]:
# Get rows with event_type equals purchase
purchased = df[df['event_type'] == 'purchase']

# Filter relevant data from Data Frame
purchased = purchased[['user_id', 'user_session', 'event_time' ,'price']]

print(purchased)

Now, we can tell which customer placed how many orders of what price at what time. As we have filtered out all purchase event types from the data set, the number of user_session against a single user_id gives the number of orders an individual user placed. A user_session against a particular user_id could be repeated as a user might have made multiple purchases during a single visit, and a user_session could be different as well.

**Step 2: Compute the RFM metrics for each customer.**

In [None]:
# Compute the R, F, and M values for each user
rfm = purchased.groupby('user_id').agg({'event_time': lambda date: ((purchased['event_time'].max()) - date.max()),
                                    'user_session': lambda num: num.count(),
                                    'price': lambda price: price.sum()})
print(rfm)


rfm['event_time'] = rfm['event_time'].apply(lambda days: int(str(days).split(' days')[0]) + 1)


rfm.columns=['recency','frequency','monetary']

print(rfm)

**Step 3: Compute ranks for each RFM metric using quantiles.**

Now that we have the correct R, F, and M values, it’s time to rank them. It should be noted that in the case of recency, the lower the value the better, but for frequency and monetary, the higher the value the better. So, recency is inverse of frequency and monetary.

As mentioned above, each of the RFM values needs to be divided into four groups, so quantiles are used to categorize the R, F, and M values into correct groups. You can refresh the quantiles function here.

The 1st, 2nd, and 3rd quantiles of the recency, frequency, and monetary columns are calculated from the rfm DataFrame and then converted to dictionary objects for easy access.

quantiles = rfm.quantile(q=[0.25,0.50,0.75])

quantiles = quantiles.to_dict()

print(quantiles)


Now, for each R, F, and M metric in the rfm DataFrame, their values will be compared with their quantile values and will be assigned a rank between 1 and 4 based on the comparison. Here, 1 indicates the highest rank and 4 indicates the lowest rank.

The following functions will compute ranks for the R, F, and M values in the rfm DataFrame.

In [None]:
# Compute Ranks for Recency metric
def Compute_R(val,metric,quantile):
    if val <= quantile[metric][0.25]:
        return 1
    elif val <= quantile[metric][0.50]:
        return 2
    elif val <= quantile[metric][0.75]: 
        return 3
    else:
        return 4
    
# Compute Ranks for Frequency & Monetary metrics
def Compute_FM(val,metric,quantile):
    if val <= quantile[metric][0.25]:
        return 4
    elif val <= quantile[metric][0.50]:
        return 3
    elif val <= quantile[metric][0.75]: 
        return 2
    else:
        return 1

Both functions in the above code snippet have the same parameters and return values and are only different in their comparisons. As low values are better for recency and high values are better for frequency and monetary, two functions are created.

The val parameter is the value of the R, F, or M metric from the rfm DataFrame that is compared by their respective quantile values.

The metric parameter can be recency, frequency or monetary and is used to access the correct quantile value from the quantiles dictionary.

The quantile parameter is the calculated quantiles dictionary which contains the 1st, 2nd, and 3rd quantiles of the R, F, and M metrics from the rfm DataFrame.

The above functions compare the input values with the quantile values of the respective metrics and return ranks based on the mentioned conditions. For recency, the lower the value the higher the rank. For frequency and monetary, the higher the value the higher the rank.

In [None]:
# Compute new column with recency rank of that row
rfm['R_rank'] = rfm_new['recency'].apply(Compute_R, args=('recency',quartiles))

# Compute new column with frequency rank of that row
rfm['F_rank'] = rfm_new['frequency'].apply(Compute_FM, args=('frequency',quartiles))

# Compute new column with monetary rank of that row
rfm['M_rank'] = rfm_new['monetary'].apply(Compute_FM, args=('monetary',quartiles))

print(rfm)

Three new rank columns are created for the R, F, and M metrics. The apply() function is used on each of the respective columns of the rfm DataFrame. The Compute_R() function is used for recency, and the Compute_FM() function is used for frequency and monetary. The second and third parameters of the functions are placed in the args parameter of the apply() function.

The above output displays the new resultant DataFrame with ranks of each of the RFM metrics.

**Step 4: Combine the RFM values to obtain a combined RFM Score.**

Now, the individual R, F, and M metrics are combined to generate the RFM score which is then added as a column in our rfm dataframe.

In [None]:
# Convert RFM values to type string
R = rfm.R_rank.astype(str)
F = rfm.F_rank.astype(str)
M = rfm.M_rank.astype(str)

# Compute new colum with combined RFM values
rfm['RFM_Score'] = R + F + M

print(rfm)

**Step 5: Sort the final DataFrame in ascending order.**

Now, the final DataFrame is sorted in ascending order according to the RFM Score to get customer groups from best to worst.

In [None]:
# Sort the DataFrame by RFM_Socre values
rfm = rfm.sort_values('RFM_Score')

print(rfm)

In the table below, the RFM score is mentioned with the corresponding customer group, what it means to have that RFM score, and what type of marketing strategy can be developed to deal with that group of customers. You should keep in mind that these groups are not a standard and can be adjusted to one’s requirement or problem.

Customer Type                        RFM Score             Explanation                       Marketing Strategy
Best Customers                         111         Bought most recently and frequently,
                                                      and spends the most money             Introduce new products
                                                 
Current Custome                        1XX             Bought most recently             Upsell products related to current                                                                                                     purchase
Loyal Customers                        X1X            Bought most frequently        Use R and M metrics to further segment

Big Spenders                           XX1            Spends the most money                   Suggest costly products

Absent Customers                       411       Purchased frequently and spent the most      Suggest products with                                                                                                             discounts
Absent Common Customers                444       Purchased long ago, purchased few                                                                                                 and spent little                    Pay least amount of attention

The X in the RFM score indicates that any value can occur, and it does not affect the result as long as the 1 stays in its position.

We just reduced the marketing load from managing thousands of customers to managing only sixt-four groups. Resources of the website can be allocated in the best possible way if the above information is correctly calculated and used.