Let's analize price distribution.
For simplicity price has been scaled to a range between 0 and 1.

# Imports

In [None]:
import numpy as np 
import pandas as pd 

# Read and prepare Normalized Price

In [None]:
df=pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",dtype={"article_id":str})

In [None]:
df["price_normalized"]=df["price"]/np.max(df["price"])

# Plots of price frequency general and differentiated for channel id 

From these plots as we could expect we can see that most frequently bought items lies in the lowest price range (price < 0.2).

In [None]:
df["price_normalized"].plot.hist(bins=50, range=[-0.1, 1.0])

In [None]:
df[df["sales_channel_id"]==1]["price_normalized"].plot.hist(bins=50, range=[-0.1, 1.0])

In [None]:
df[df["sales_channel_id"]==2]["price_normalized"].plot.hist(bins=50, range=[-0.1, 1.0])

## Log scale for Y axis

We can see once we introduce logarithmic scale that products with highest prices are still being bought but with a lower frequency (around 10^4 articles for each bin compared to 10^7 for each bin for lowest prices).
We have a clear example of a Long Tail, One of the main tasks in this challenge will be to correctly use data about the long tail to customize user Recommendation.

In [None]:
df["price_normalized"].plot.hist(bins=50, log=True, range=[-0.1, 1.0])

In [None]:
df[df["sales_channel_id"]==1]["price_normalized"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])

In [None]:
df[df["sales_channel_id"]==2]["price_normalized"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])

# User Price range

In [None]:
df_users=df.copy()[["customer_id","price_normalized"]]

In [None]:
df_users=df_users.groupby(by=["customer_id"]).agg(["mean","std"]).reset_index()

In [None]:
df_users.head()

## Plots User price range (Mean and Std)

From these plots we can see a similar behaviour compared with the price ranges of transactions in the dataset,
In this case we can see that most of the users buy products with an average price of less than 0.1.
From the std plot we can also see that users tends to buy products within a certain price range and rarely buy products with a price too far from their usual range.

In [None]:
df_users["price_normalized"]["mean"].plot.hist(bins=50, range=[-0.1, 1.0])

In [None]:
df_users["price_normalized"]["std"].plot.hist(bins=50, range=[-0.1, 1.0])

### Log scale for Y axys 

We can see from the log scaled version of the plot that there are users which have a different buying pattern.We can also see some outlier users which buy products with a really high average price compared with a generic user.
Those users are really important for a business case, they tend to have different patterns than general users and for them usually a top Popular approach doesn't work.
We can also see from the std graph that some users buy products from a broader range of price.

For the users in the Long Tail there could be different approaches, clustering them and using a user-user similarity approach combined with an item-item similarity approach to combine similar users data and similar items data, but also their buying history is important. Combining these 3 elements could lead to better recommendation the further we move from a generic user to the end of the Long Tail.
Also an additional analysis of the popularity bias which is introduced by item-item and user-user similarity approaches is important to reduce too generic recommendations.

In [None]:
df_users["price_normalized"]["mean"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])

In [None]:
df_users["price_normalized"]["std"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])

# Products Price range

In [None]:
df_products=df.copy()[["article_id","price_normalized"]]

In [None]:
df_products=df_products.groupby(by=["article_id"]).agg(["mean","std"]).reset_index()

In [None]:
df_products.head()

## Plots products price range (Mean and Std)

From these plots we can see a similar behaviour compared with the price ranges of transactions in the dataset,
In this case we can see that most of the average price for a product tends to be in the lowest price range (<0.1).
And from the std plot we can also see that most products have a small range of variation of their price.

In [None]:
df_products["price_normalized"]["mean"].plot.hist(bins=50, range=[-0.1, 1.0])

In [None]:
df_products["price_normalized"]["std"].plot.hist(bins=50, range=[-0.1, 1.0])

### Log scale for Y axys 

The log scaled versions of the graph resembles the ones seen for user price range, an interesting approach could be to segment these ranges and use a binned top popularity approach so for a user who buys product with an average price of 0.15 suggest the most popular item with a price of 0.15 or close to that price. One could divide the price range in an arbitrary amount of bins and than place each user and item in those bins and than use the bins to create recommendations.

In [None]:
df_products["price_normalized"]["mean"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])

In [None]:
df_products["price_normalized"]["std"].plot.hist(bins=50,log=True, range=[-0.1, 1.0])