# Overview
This note is an EDA focusing on the relationship between the number of units sold and the price of an articles, and shows the correlation coefficient between the number of units sold per month and the average price per month.   
Many of the other notes do not use "price" in the train data, which makes me wonder how price affects sales. (I am most likely missing it).  
Here is what this note indicates    
・Almost no correlation between the number of articles and the price of the articles in-store sales.  
・Online shopping shows a slight correlation between the number of units sold and the price of the product.    
・Overall, there is a slight correlation between the number of units sold and article prices.   
The code is partially based on the code at https://www.kaggle.com/code/negoto/h-m-sales-period-of-fashion-items-with-k-means/notebook.  

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import tqdm
from datetime import datetime as dt
from collections import Counter
from pathlib import Path
path = Path("../input/h-and-m-personalized-fashion-recommendations/")
articles = pd.read_csv(path / "articles.csv", dtype = {'article_id': str})
train = pd.read_csv(path / "transactions_train.csv", dtype = {'article_id': str})
train = train[["t_dat", "article_id", "sales_channel_id","price"]]
train["t_dat"] = pd.to_datetime(train["t_dat"])
# Uncomment the following if you want to limit channel_id
# train = train.query("sales_channel_id == 1") 
# train = train.query("sales_channel_id == 2")
train = train.sort_values(["article_id", "t_dat"], ascending=False)

# Data processing

In [None]:
# Add columns for average number of units purchased and average price for each article over all time periods.
sales_counts = Counter(train.article_id)
for i in articles.index:
    articles.at[i, "sales_count"] = sales_counts[articles.at[i, "article_id"]]/24 # 24:num of month
sales_price_ave = {}
sales_price_ave = train.groupby("article_id").price.mean().to_dict()
for i in articles.index:
    if articles.at[i, "article_id"] in sales_price_ave:
        articles.at[i, "sales_price_ave"] = sales_price_ave[articles.at[i, "article_id"]]

In [None]:
# Add the number of units purchased and average price for each month
YM = [201809, 201810]
while YM[0] < 202010:
    start, end = "-".join(map(str, [YM[0] // 100, YM[0] % 100, 1])), "-".join(map(str, [YM[1] // 100, YM[1] % 100, 1]))
    monthly_sales = Counter(train.query(f"'{start}' <= t_dat < '{end}'").article_id)
    # Sales num
    articles[YM[0]] = 0
    for i in articles.index:
        articles.at[i, YM[0]]= monthly_sales[articles.at[i, "article_id"]]
    YM[0] = YM[1]
    YM[1] = (YM[1] + 100 - 11) if YM[1] % 100 == 12 else (YM[1] + 1)

YM = [201809, 201810]
while YM[0] < 202010:
    start, end = "-".join(map(str, [YM[0] // 100, YM[0] % 100, 1])), "-".join(map(str, [YM[1] // 100, YM[1] % 100, 1]))
    # Sales price
    monthly_price_ave = train.query(f"'{start}' <= t_dat < '{end}'").groupby("article_id").price.mean().to_dict()
    articles[YM[0]+100000000] = 0
    for i in articles.index:
        if articles.at[i, "article_id"] in monthly_price_ave:
            articles.at[i, YM[0]+100000000] =  monthly_price_ave[articles.at[i, "article_id"]]
        # No purchase stores None
        if articles.at[i,YM[0]+100000000] < 1e-4: 
            articles.at[i,YM[0]+100000000] = None
    YM[0] = YM[1]
    YM[1] = (YM[1] + 100 - 11) if YM[1] % 100 == 12 else (YM[1] + 1)

articles.head()

In [None]:
# Standardize the number of units sold and selling price
for i in tqdm.tqdm(articles.index):
    count_std = articles.iloc[i,27:52].std()
    price_std = articles.iloc[i,52:].std()
    articles.iloc[i,27:52] = (articles.iloc[i,27:52]-articles.iloc[i,25]) / count_std
    articles.iloc[i,52:] = (articles.iloc[i,52:]-articles.iloc[i,26]) / price_std

In [None]:
articles.head()


# Transision of article sales and price

In [None]:
def plot_sales_num_and_price(article_id):
    plt.figure(figsize=(24, 1.5))
    plot_df = articles.query(f"article_id == '{article_id}'")
    sns.lineplot(x=plot_df.columns[27:52].map(lambda x: dt.strptime(str(x),'%Y%m')), y=list(*plot_df.values)[27:52], palette=sns.husl_palette(12),linestyle='None',marker="o", markersize=5,color='r')
    sns.lineplot(x=plot_df.columns[27:52].map(lambda x: dt.strptime(str(x),'%Y%m')), y=list(*plot_df.values)[52:77], palette=sns.husl_palette(12),linestyle='None',marker="o", markersize=5,color='b')
    plt.legend(['stand_count','stand_price'])
    plt.title(" ".join(["Monthly Sales of ID :", article_id, "sales_count:", str(plot_df.iloc[0, 25])[:10], "    price_average :", str(plot_df.iloc[0, 26])[:10]]))

Check three samples of monthly sales quantity and sales price

In [None]:
# Sample 1
plot_sales_num_and_price(articles.loc[0,"article_id"])

In [None]:
# Sample 2
plot_sales_num_and_price(articles.loc[1,"article_id"])

In [None]:
# Sample 3
print(articles.shape)
plot_sales_num_and_price(articles.loc[3,"article_id"])

# Calc. correlation coefficient between the number of units sold and the average price

In [None]:
# Correlation coefficients between price and quantity are calculated for each product.
corrcoef_array = []
articles["count_price_coef"] = 0
articles["valid_coef_num"] = 0
for i in tqdm.tqdm(articles.index):
    stand_count = articles.iloc[i,27:52].values
    stand_price = articles.iloc[i,52:77].values
    new_a = []
    new_b = []
    for item1,item2 in zip(stand_count,stand_price):
        if not (np.isnan(item1) or np.isnan(item2)):
            new_a.append(item1)
            new_b.append(item2)
    stand_count = new_a
    stand_price = new_b
    tmp = np.corrcoef(stand_count,stand_price)
    articles.loc[i,"count_price_coef"] = tmp[0,1]
    articles.loc[i,"valid_coef_num"] = len(stand_count)
    
# Exclude products that have many months in which not a single unit is sold.(Here, n=10)
articles = articles[articles["valid_coef_num"]>=10]


In [None]:
# Create a histogram of correlation coefficients
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.hist(articles["count_price_coef"],bins=10)
plt.xlabel('Correlation coefficient[-]')
plt.ylabel('Freaquency[-]')

In [None]:
# Check one of the scatter and transition graphs for products with large correlation coefficients as a sample.
large_corr_id = articles[articles["count_price_coef"]>0.8].iloc[0].article_id
i = articles[articles["article_id"] == large_corr_id].index[0]
stand_count = articles.iloc[i,27:52].values
stand_price = articles.iloc[i,52:77].values
new_a = []
new_b = []
for item1,item2 in zip(stand_count,stand_price):
    if not (np.isnan(item1) or np.isnan(item2)):
        new_a.append(item1)
        new_b.append(item2)
stand_count = new_a
stand_price = new_b
tmp = np.corrcoef(stand_count,stand_price)
articles.loc[i,"count_price_coef"] = tmp[0,1]
articles.loc[i,"valid_coef_num"] = len(stand_count)
plt.plot(stand_count,stand_price,'x')
plot_sales_num_and_price(articles.loc[i,"article_id"])

# Restricted to channel=1 and channel = 2
Uncomment the code at the beginning to get the result when limited.  
- channel = 1  
https://cdn.discordapp.com/attachments/957917652846796830/958133553940557824/channel1.png  
- channel = 2  
https://cdn.discordapp.com/attachments/957917652846796830/958133554154459227/channel2.png

# Summary and Discussion
・I would have expected sales to increase when a sale occurs, but perhaps the trend is more toward lower prices when sales go down.    
・The correlation between the number of units sold and price is more apparent online. It is difficult to imagine that a lower price would result in fewer sales, and it is reasonable to assume that a lower price would result in fewer sales. We considered the following.    
offline：Customers will buy at a lower price for the purpose of inventory clearance if they buy at a lower price.  
online: Even if it's cheaper, don't buy things that are out of season or out of style (perhaps they can buy them when you need them when online). 
・If a correlation was found, we thought that if we knew the cycle of price reductions, we could determine when the number of units sold would increase, but two years of data were not sufficient to do so.    
・From the graph of fluctuations: overall, many clothing prices are falling (if it is natural to say so).    
・I still think it would be better to treat the number of units sold as the feature quantity rather than the price.