From transaction, we can extract effective information as out_of_stock item, or find out some campaigns in retail marketing:
* 'Online channel sale': item which its price is lower in online channel
* 'Offline channel sale': item which its price is lower in offline channel(store)
* 'Current month sale': item which its price in September 2020 is lower than other months

In this notebook i will show you how we can get hidden informative item features from features. In practice, 'Campaign' features is very useful, and 'on-campaign' items is ussually enhancement by weight in recommendation. In this notebook, we will extract 4 campaign:

* out-of-stock items
* 'Current month sale'
* 'Offline channel sale'
* 'Online channel sale'

If you want to use directly campaign data for your recommendation model, you can use it directly here:
https://www.kaggle.com/astrung/hm-article-capaign

# 1. Find out out_of_stock_items

In [None]:
import pandas as pd
df = pd.read_csv(r"/kaggle/input/h-and-m-personalized-fashion-recommendations/transactions_train.csv", 
                 dtype={'article_id': 'str'})
df.head()

In [None]:
df['t_dat'] = pd.to_datetime(df['t_dat'], format="%Y-%m-%d")
df['month'] = df['t_dat'].dt.strftime('%m')
df['year'] = df['t_dat'].dt.strftime('%Y')
df.head()

we will mask item which isn't sold in 2020 July, August and September is 'out_of_stock'.
We will remove out of stock item in recommendation item candidates.

In [None]:
df_month_price = df[df['year'] == '2020'][['article_id', 'price', 'month', 'year']].drop_duplicates(
    ['article_id', 'price', 'month', 'year']).copy()
df_month_price.head()

In [None]:
df_month_avg_price = df_month_price.groupby(['article_id', 'month'])['price'].mean().unstack().reset_index()
df_month_avg_price

In above tables, item which isn't sold in month will be masked as `NaN`. We can see that so many items which isn't sold in many months. Let find out item whichs isn't sold in last 3 months, and mask it as out-of-stock

In [None]:
df_out_of_stock = df_month_avg_price[df_month_avg_price['07'].isna() & 
                                     df_month_avg_price['08'].isna() & df_month_avg_price['09'].isna()]
df_out_of_stock

In [None]:
df_out_of_stock = pd.DataFrame({'article_id': df_out_of_stock.article_id.values})
df_out_of_stock['out_of_stock'] = 1
df_out_of_stock

# 2. Find out items which is 'Current month sale' campaign

we calculate average price in 2020 for each items, then compare average price with price in Sep 2020. If its price is lower than 10%, we will mask it as 'current_month_sale' campaign

In [None]:
df_year_avg_price = df_month_price.groupby(['article_id'])['price'].mean().reset_index()
df_year_avg_price.head()

In [None]:
df_on_sale = pd.merge(df_month_avg_price, df_year_avg_price, on='article_id')
df_on_sale = df_on_sale.fillna(-1)
df_on_sale

In [None]:
df_on_sale['on_sale'] = df_on_sale.apply(
    lambda x: 1 if x['09'] != -1 and abs(x['price']-x['09'])/x['price'] > 0.1 else 0, axis=1)
df_on_sale

In [None]:
df_on_sale = df_on_sale[['article_id', 'on_sale']].copy()
df_on_sale.head()

# 3. Find out items which is 'Offline/Online channel sale' campaign

As you see in below plot, some items have different price for each channel. Item in online channel may be higher, or lower more than 50%, so it may be in a campaign for attention in a channel. We will extract campaign information, then use it as a item features

In [None]:
import matplotlib.pyplot as plt
def plot_comparing_price_channel(article_id):
    test2 = df[df.article_id == article_id][['t_dat', 'price', 'sales_channel_id']].drop_duplicates().copy()
    fig, ax = plt.subplots()
    test2[test2.sales_channel_id == 1].set_index("t_dat")['price'].plot(label='store')
    test2[test2.sales_channel_id == 2].set_index("t_dat")['price'].plot(label='online')
    ax.legend()
    plt.show()
    plt.close()

In [None]:
plot_comparing_price_channel('0562245001')

**Online channel has higher price than store. It may be in a campaign. Let find out it fromt transactions**

In [None]:
df_2020 = df[df['year'] == '2020'][
    ['article_id', 'price', 'sales_channel_id', 'month']].drop_duplicates().copy().reset_index(drop=True)
df_2020

In [None]:
df_2020['month'] = df_2020['month'].astype(int)
df_2020

In [None]:
# findout avg price for each channel in last 2 months
df_chanel_1 = df_2020[(df_2020.sales_channel_id == 1) & (df_2020['month'] >= 8)].groupby(
    'article_id')['price'].mean().reset_index()
df_chanel_2 = df_2020[(df_2020.sales_channel_id == 2) & (df_2020['month'] >= 8)].groupby(
    'article_id')['price'].mean().reset_index()
df_compare = pd.merge(df_chanel_2, df_chanel_1, on='article_id', suffixes=('_online', '_store'))
df_compare['price_diff_ratio'] = abs(df_compare.price_online / df_compare.price_store)
df_compare

* We will mask item with price_online / price_store < 0.8(lower than 20%) as 'online channel sale' items
* We will mask item with price_online / price_store > 1.2(higher than 20%) as 'offline channel sale' items

In [None]:
df_online_sale = df_compare[df_compare.price_diff_ratio <= 0.8].copy()
df_online_sale['online_channel_sale'] = 1
df_online_sale

In [None]:
df_offline_sale = df_compare[df_compare.price_diff_ratio >= 1.2].copy()
df_offline_sale['offline_channel_sale'] = 1
df_offline_sale

# finnaly, let make a data with all of campaign we extracted

In [None]:
df_article = pd.read_csv(r"../input/h-and-m-personalized-fashion-recommendations/articles.csv", 
                         dtype={'article_id': 'str'})
df_article.head()

In [None]:
result = pd.merge(df_article[['article_id']], df_out_of_stock, on='article_id', how='outer')
result = pd.merge(result, df_on_sale, on='article_id', how='outer')
result = pd.merge(result, df_online_sale[['article_id', 'online_channel_sale']], on='article_id', how='outer')
result = pd.merge(result, df_offline_sale[['article_id', 'offline_channel_sale']], on='article_id', how='outer')
result

In [None]:
result = result.fillna(0)
result.to_csv('article_campaign.csv', index=False)