
## EDA

For this EDA , lets skim thru the transaction data and identify different product  and customer behaviours that we can identify from the given data so they can be used as features while creating different recommendation models

* [Read Data](#section-one)
* [What is the Active Customer Base?](#custbase)
* [Does customer always buy products that are less expensive?](#productcust)
* [Which Segment of Customers are most sticking to the platform?](#custseg)
* [Does markdown affect the buying pattern of customers?](#buypattern)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import numpy as np
import gc
pd.set_option('display.max_colwidth', None)

<a id="section-one"></a>
### Read Data

In [None]:
def read_data():
    data=pd.read_parquet('../input/hm2022-low-memory-fast-loading/transactions_train.parquet', engine='pyarrow')
    data['price_band']=pd.qcut(data['price'], q=4, labels=['low','medium','high','very high'])
    data['price']=data['price'].astype('float32')
    data['sales_channel_id']=data['sales_channel_id'].astype('int32')
    return data
def read_customer_data():
    data=pd.read_parquet('../input/hm2022-low-memory-fast-loading/customers.parquet', engine='pyarrow')
    data['FN']=data['FN'].astype('float32')
    data['Active']=data['Active'].astype('float32')
    data['age'].fillna(0, inplace=True)
    data['age']=data['age'].astype('int32')
    return data

<a id="custbase"></a>
### What is the Active Customer Base?

Active Customers are customers who have bought anything with the platform in the last 365 days. We will calculate based on the last training date available.

In [None]:
cust_data=read_customer_data()
trans_data=read_data()


In [None]:
unique_cust=len(trans_data[trans_data['t_dat'] >= '2019-09-22']['customer_id'].unique().tolist())
total_cust=cust_data.shape[0]
active_customers=unique_cust*100/total_cust

In [None]:
del cust_data,trans_data
gc.collect()

<b>The current Active Customer Base for the platform is 72%</b>

###  Seasonal Trends analysis of Sales

In [None]:
trans_data=read_data()
trans_data['t_dat']=pd.to_datetime(trans_data['t_dat'])
trans_data['YearMonth'] = trans_data['t_dat'].apply(lambda x:x.strftime('%Y%m'))

In [None]:
gc.collect()

In [None]:
sales_month=trans_data.groupby(['YearMonth'])['article_id'].size().reset_index(name='totalsales')

In [None]:
sales_month=sales_month.sort_values('YearMonth')
fig = px.line(sales_month, x="YearMonth", y="totalsales")
fig.show()

The graph shows the seasonal pattern of high sales over Jun and July every year

In [None]:
del sales_month,trans_data
gc.collect()

<a id="productcust"></a>
### Does customer always buy products that are less expensive?

In [None]:
data=read_data()


In [None]:
cust_nature=data.groupby(['customer_id','price_band'])['article_id'].count().reset_index(name='totalbought')
cust_price=cust_nature.groupby('price_band')['totalbought'].sum().reset_index(name='totalitems')
cust_price['perc_share']=(cust_price['totalitems']/cust_price['totalitems'].sum())*100

In [None]:
cust_price

In [None]:
import plotly.express as px
fig = px.pie(cust_price, values='perc_share'
                 , names='price_band', title='% Share by Product Price Type')
fig.show()

Most of the products that the customer buy are with in the range of medium priced products followed by low and very high value products. Now we can slice and dice and understand which segment of customers contribute to most of the revenue H&M.

If you are not aware of the type of customers that an online business deal with there are mainly four type of customers that any online business will have

1. New Customers - This segment of customers are the customers who are on the website trying out the platform for the first time and they will move into one of the below categories based on the experience.

2. Returning customers - These are the core customers for any business as they contribute to most of the revenue. These are the segment of customers who keep coming back to the website to buy products at regular intervals

3. Reactivated Customers - This segment of customers are those who havent visited the website in a long time and they have come back to the website because of any marketing activities or any other trigger.Normally the period can vary but some business takes 365 days as reactivation time which means any customer who havent visited the website in the past 365 days and they visit the website they become reactivated customers.

4. Churn Customers - These are short lived customers where the product couldnt establish a long term relationship with the customers. They might have come to the website because of a promotional email or a display campaign and mostly would have done a 1 time purchase or just browsed the website. The churn period differs for different business but some of them take 180 or 365 days as the ideal period that consider that the customer have churned

<a id="custseg"></a>
### Which Segment of Customers are most sticking to the platform?

In [None]:
data['t_dat']=pd.to_datetime(data['t_dat'])
data['days_since_last_event'] = (data.sort_values(['customer_id','t_dat'])
                                 .groupby('customer_id')['t_dat'].diff()
                                 .dt.days)

In [None]:
cust_repeat_data=data[['t_dat','customer_id','days_since_last_event']].drop_duplicates()
cust_repeat_data['days_since_last_event'].fillna(0, inplace=True)

In [None]:
gc.collect()

In [None]:
def map_days_since(days_since_last_event):
  if days_since_last_event >= 365:
    return "Reactivated"
  elif days_since_last_event ==0:
    return "New Customers"
  else:
    return "Returning"

cust_repeat_data["customer_type"] = cust_repeat_data["days_since_last_event"].apply(lambda days_since_last_event: map_days_since(days_since_last_event))


In [None]:
cust_repeat_data=cust_repeat_data[['t_dat','customer_id','customer_type']].drop_duplicates()

In [None]:
cust_type_data=cust_repeat_data.groupby(['t_dat','customer_type'])['customer_id'].nunique().reset_index(name='totalcusttype')

We will plot data after the first date which we will consider as the first date of sale happened on the platform 

In [None]:
cust_type_data['percentage_share']=cust_type_data['totalcusttype'] / \
cust_type_data.groupby('t_dat')['totalcusttype'].transform('sum')
cust_type_data['percentage_share']=cust_type_data['percentage_share']*100

#### Note
For the sake of simplicity, I am considerting reactivation rate as customers who come to the website after 365 days and since we are looking at the transaction data i have excluded churn analysis here.
Also we will consider the first date of transaction that is 2018-09-20 as the start date for all customers

In [None]:
fig = px.line(cust_type_data, x="t_dat", y="percentage_share", color='customer_type')
fig.show()

In [None]:
del cust_type_data,cust_repeat_data 
gc.collect()

This is an interesting graph that shows the behaviour and nature of the business from the transactional data. Based on our assumptions we can see that initially we have new customers and then slowly as the platform establishes itself there are more and more returning customers. Also the anomaly behaviour that we see around April 2020 might be due to Covid related restrictions and lockdowns.

It is interesting to see that new customer rate hasnt dropped much which means the market is not yet saturated with the product that there is possibility to onboard more and more new customers.

Reactivation rates are quite low but this needs to be looked along with churn analysis and then we wil be able to identify if this looks fine or not

<a id="buypattern"></a>
### Does markdown affect the buying pattern of customers?

One of the other factors that impact online business is people might look for good offers when they buy products. There are 2 possibilities while considering the price and stock of the products that we sell on the website

1. We have an expensive product and people will buy that only when they are on sale. This can be identified as the products that are marked high and very high and then during the life cycle of the product there might be offers where you will find the product purchases vary depending on the offer.

2. You have a less appealing product on the website and because of the price people dont want to buy so they will never get traction unless we are selling the product at probably low to clear the stock

Lets have a look at point 1. We need to identify product that switch between the price ranges during the life cycle . We are going to take a look at the products which satisfy this condition and lets understand the buying pattern. For simplicity we will only consider products that are marked very high. 

In [None]:
data=read_data()

In [None]:
product_price_data=data[['t_dat','article_id','price_band']].drop_duplicates()

In [None]:
single_valued=product_price_data.groupby('article_id')['price_band'].nunique().reset_index(name='uniqueprices')
single_valued_skus=single_valued[single_valued['uniqueprices']==1]['article_id'].unique().tolist()
product_price_data=product_price_data[~(product_price_data['article_id'].isin(single_valued_skus))]


In [None]:

date_range=product_price_data.groupby(['article_id','price_band']).agg({'t_dat': [np.min,np.max]}).reset_index()
date_range.columns = date_range.columns.droplevel(0)
date_range.columns = ['article_id','price_band','datemin','datemax']

In [None]:
initial_data=product_price_data.sort_values(['article_id','t_dat']).groupby('article_id').nth(1).reset_index()
initial_data=initial_data[initial_data['price_band'].isin(['very high'])]

In [None]:
merged_data=pd.merge(date_range,initial_data, how='inner')
del date_range,single_valued_skus,initial_data,product_price_data
gc.collect()

In [None]:
data=data[data['article_id'].isin(merged_data['article_id'].tolist())]

In [None]:
import pandasql  as ps

sqlcode = '''
select pp.t_dat,
pp.article_id,
pp.price_band,
md.datemin,
md.datemax
from data pp
inner join merged_data  md on 
md.article_id=pp.article_id
and 
pp.t_dat >= md.datemin and pp.t_dat<= md.datemax
'''

data = ps.sqldf(sqlcode,locals())


In [None]:
price_data=data.groupby(['price_band','article_id']).size().reset_index(name='total_bought')
price_data['percentage'] = 100 * price_data['total_bought'] / price_data.groupby('article_id')['total_bought'].transform('sum')

In [None]:
article_data=price_data.loc[price_data.groupby('article_id')['total_bought'].idxmax()][['price_band','article_id']]

In [None]:
product_data=article_data.groupby('price_band').size().reset_index(name='total_products')
product_data['share']=product_data['total_products']/product_data['total_products'].sum()

In [None]:
product_data

#### Observation based on Scenario 1 for high value products
What the above table indicates is that out of the 20878 very high value products that we sell on the website, during the life cycle of the product where they shited prices but reverted back to original at some point, 92% of products sold at the orginal price band and about 8% had seen a massive sale when promos happened. This is a good insight to identify because this might be some specific category of products which people might buy when they are on offer only. But overall a high % products are well recieved at the original price.

In [None]:
fig = px.pie(product_data, values='share'
                 , names='price_band', title='% Product Original Price Stickiness likelihood')
fig.show()

In [None]:
del product_data,price_data,article_data,data
gc.collect()

#### Scenario 2 
Lets look at the overall product markdown trend where we will consider any purchase made at a price point than the original price band of that product as markdown and see how the general sales trend holds true for the products on the website

In [None]:
data=read_data()
product_price_data=data[['t_dat','article_id','price_band']].drop_duplicates()
initial_data=product_price_data.sort_values(['article_id','t_dat']).groupby('article_id').nth(1).reset_index()
product_data=data.groupby(['price_band','article_id']).size().reset_index(name='total_products')


In [None]:
merge_data=pd.merge(product_data,initial_data, how='left')

In [None]:
merge_data['original']=np.where(~(merge_data['t_dat'].isnull()),1,0)

In [None]:
saledata=merge_data.groupby(['original','article_id'])['total_products'].sum().reset_index(name='total_sale')
saledata['share']=saledata['total_sale']/saledata.groupby('article_id')['total_sale'].transform(sum)

In [None]:
final_sku_md_data=saledata.loc[saledata.groupby('article_id')['share'].idxmax()][['original','article_id']]
product_data=final_sku_md_data.groupby('original').size().reset_index(name='total_products')
product_data['share']=product_data['total_products']/product_data['total_products'].sum()

In [None]:
product_data

In [None]:
fig = px.pie(product_data, values='share'
                 , names='original', title='% Product Sold Max share at Original Price')
fig.update_layout(legend_title_text='Original Price or Not')
fig.show()

In [None]:
del product_data,final_sku_md_data,merge_data,data, saledata
gc.collect()

#### What the above graph shows us 
is about 77% products have were sold at more share of original price compared to when they were markdown. And they might have been markdown towards the end of the lifecycle of the products. This is pretty good as selling the products at original price band ensures more profitability downstream

### WIP

This is the WIP I will be adding in the above analysis .

Some of the other questions that i am looking to add here are
1. Churn data analysis from the customer data
2. Returning customer purchase behaviour

Please let me know in the comments if you would like to add something more
