# Is there a difference between sales of products of the requested week and the prior week for last year?
Yesterday, I shared my finding of an anomaly in sales data for the last days of September. You can find my observation here: [!!! WARNING!!! Sales spike at end of September](https://www.kaggle.com/amanabdullayev/warning-sales-spike-at-end-of-september)

## Brief summary of the previous observation
- I observed an abnormally high amount of sales for the end of September of last years. Maybe, opening of winter sales? or discount on products for the summer/fall season? 
- Basically, we can say that the last days of September are vibrant! And h&M requested us to predict articles a customer will buy on the last days of September (7 days after 22-09-2020). Therefore, these findings are critical for model building.
- When we analyze public notebooks with having a high score, we see that their logic is: "find hot products for the period just before 22-09-2020 (I will call this period "prior week" hereafter) and make recommendations from these hot products". However, as I said, we have an anomaly in sales data at the end of September (I will call this period "interested week" hereafter). Based on this, it is not a good idea simply to identify hot products of recent weeks and make recommendations out of them. We will understand the reason for this by analyzing data from last year for the same period.


## Goal of this notebook
- Show briefly/clearly that there are much more sales at the last days of September;
- Find of top 12 hot products from the prior week of 2019, i.e. 15-09-2019 to 22-09-2019, and check if they are completely different than the interested week of 2019, i.e. 23-09-2019 to 30-09-2019.

Acknowledgments:
Thanks for your [comments](https://www.kaggle.com/amanabdullayev/warning-sales-spike-at-end-of-september/comments) and pushing me to make more findings on this data:
[ActulVerma](https://www.kaggle.com/atulverma).

In [None]:
# necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from tqdm.auto import tqdm
import functools

In [None]:
# load data with only required columns to save computation power
transactions = pd.read_csv(
    "../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",
    index_col="t_dat",
    usecols=["t_dat", "price", "article_id"],
    parse_dates=["t_dat"],
    infer_datetime_format=True, dtype={'article_id':str}
)
transactions.head()

In [None]:
# pandas resample() function is quite usefull to deal with timeseries data
# we will get daily sum of price column and plot it for two year
transactions["price"].resample("1D").sum().plot(figsize=(20, 6), alpha=0.6)
plt.title("Daily sales data for 2 years", fontsize=18, color="r")
plt.show()

### Here we see that actually, we have peaks in sales data time to time, which most likely to the start of sales or new seasonal produts. Since, we are interested for the period from 23-09-2020 to 30-09-2020, let's deep dive in that periods data.

In [None]:
# apply similar technique as above but for the data before 01-10-2018
plt.figure(figsize=(14, 10))
plt.subplot(311)
transactions["price"].resample("1D").sum().loc[:"2018-09-30"].plot(
    alpha=0.8, ax=plt.gca()
)
plt.title("September 2018", fontsize=16, color="r")
plt.ylim(500, 6500)
plt.xlim("2018-09-15", "2018-09-30")
plt.grid(True, linestyle='--')


# apply similar technique as above but for the data between 15-09-2019 to 01-10-2019
plt.subplot(312)
transactions["price"].resample("1D").sum().loc["2019-09-15":"2019-09-30"].plot(
    alpha=0.8, ax=plt.gca()
)
plt.title("September 2019", fontsize=16, color="r")
plt.ylim(500, 6500)
plt.grid(True, linestyle='--')

# apply similar technique as above but for the data after 15-06-2020
plt.subplot(313)
transactions["price"].resample("1D").sum().loc["2020-09-15":].plot(
    alpha=0.8, ax=plt.gca()
)
plt.title("September 2020", fontsize=16, color="r")
plt.ylim(500, 6500)
plt.xlim("2020-09-15", "2020-09-30")
plt.text(
    "2020-09-23",
    1000,
    "We have to predict sales this vibrant period",
    color="blue",
    fontsize=14,
)
plt.grid(True, linestyle='--')

plt.tight_layout()
plt.show()

### As you can see from the above graphs, sharp peaks are there for the end of September. And, surprisingly, H&M chose that particular vibrant period for us to predict the products. I have the following hypothesis here: top-selling products that are sold before this vibrant period would not be much relevant for the requested period. To see if this is true, let's analyze the same period of last year, i.e. 2019.

In [None]:
# take top-selling products for the same periods of previous year
hot_12_prior_2019 = transactions.loc["2019-09-15":"2019-09-22"]['article_id'].value_counts()[0:12].index.tolist()
hot_12_interested_2019 = transactions.loc["2019-09-23":"2019-09-30"]['article_id'].value_counts()[0:12].index.tolist()


# quickly compare to list if they are same
if functools.reduce(lambda x, y : x and y, map(lambda p, q: p == q,hot_12_prior_2019,hot_12_interested_2019), True): 
    print ("Hot products of prior and interested weeks are the same.") 
else: 
    print ("Hot products are not the same!")

In [None]:

fig = plt.figure(figsize=(24,18))
count = 1
for i in hot_12_prior_2019:
    root_path = '../input/h-and-m-personalized-fashion-recommendations/images/'
    img_path = root_path + i[:3] + '/' + i + '.jpg'
    img = plt.imread(img_path)
    fig.add_subplot(3, 4, count)
    plt.imshow(img)
    plt.title(f'Article {i}')
	# remove axes and place the images closer to one another for a more compact output
    plt.xticks([])
    plt.yticks([])
    plt.suptitle('Top-selling products of the prior week',  y=1.01,fontsize=20, color='b')
    plt.tight_layout()
    count += 1

In [None]:
fig = plt.figure(figsize=(24,18))
count = 1
for i in hot_12_interested_2019:
    root_path = '../input/h-and-m-personalized-fashion-recommendations/images/'
    img_path = root_path + i[:3] + '/' + i + '.jpg'
    img = plt.imread(img_path)
    fig.add_subplot(3, 4, count)
    plt.imshow(img)
    plt.title(f'Article {i}')
	# remove axes and place the images closer to one another for a more compact output
    plt.xticks([])
    plt.yticks([])
    plt.suptitle('Top-selling products of the interested week',  y=1.01,fontsize=20, color='b')
    plt.tight_layout()
    count += 1

### Obviosuly, while there are more thinner and bright colored clothes in the prior week, more darker and warm clothes are in the interested week. Even though this data is for 2019, I am pretty sure that the similar trend will exist in the data for 2020.

Stay safe and healthy! If you like my findings, please upvote it and leave your comments/improvement suggestions below!

### 