# H&M Consumer Analytics-RFM Segmentation, Market Basket Analysis and Collaborative Filtering

### We have applied consumer analytics on H&M Data set

***Overview:
> *
Exploratory Data Analysis

* Overview of Data
* Missing value treatment / Feature Engineering
* Customer Segmentations 
* Product Segmentations 
* Transactions-sales
Data Problem

Solution Approach

Market Basket Analysis

Recommender Algorithm

In [1]:
import numpy as np 
import pandas as pd
 

import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('seaborn-white')
sns.set_style("whitegrid")
sns.despine()
plt.rc("figure", autolayout=True)
plt.rc("axes", labelweight="bold", labelsize="large", titleweight="bold", titlesize=14, titlepad=10)

import matplotlib as mpl

mpl.rcParams['axes.spines.left'] = False
mpl.rcParams['axes.spines.right'] = False
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.bottom'] = False
plt.rcParams["font.weight"] = "bold"
plt.rcParams["axes.labelweight"] = "bold"

### Data import

In [2]:

articles = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/articles.csv", 
                       encoding="ISO-8859-1", header=0)
customers = pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/customers.csv",
                        encoding="ISO-8859-1", header=0)
transactions =  pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",
                           encoding="ISO-8859-1", header=0)

Analyzing the customer data


In [3]:
customers.head(3)

In [4]:
customers.info()

In [5]:
customers.nunique()

In [6]:
customers.shape[0] - customers['customer_id'].nunique()

There are no duplicates in customers

In [8]:
def value_counts(feature_list):
    '''
    To print the value counts of the categories within a feature. Takes a list of features as an argument.
    '''
    for i in feature_list:
        print(i.upper())
        print(customers[i].value_counts())
        print('Missing values: ', customers[i].isnull().sum(), '\n')

In [9]:
check_features = ['FN', 'Active', 'club_member_status', 'fashion_news_frequency']
value_counts(check_features)

Handling the missing values:
FN: Filling the missing values with 0s as the non-null values are 1s
Active: Same as FN
club_member_status: Filling with the mode (ACTIVE)
fashion_news_frequency: Replacing to None values with NONE and filling the missing values with the mode NONE for the time being

In [10]:
customers['fashion_news_frequency'].replace(to_replace='None', value='NONE', inplace=True)
values = {"FN": 0, "Active": 0, "club_member_status": 'ACTIVE', "fashion_news_frequency": 'NONE'}
customers.fillna(value=values, inplace=True)

In [11]:
check_features = ['FN', 'Active', 'club_member_status', 'fashion_news_frequency']
value_counts(check_features)

In [12]:
customers['age'].median()

In [14]:
# Missing value imputation with median as we have outliers
customers['age'].fillna(customers['age'].median(), inplace=True)

In [15]:
sns.boxplot(customers['age'] )

In [None]:
customers['age'].mean()

In [16]:
age_bins = [15,26,36,46,56,66,100]
customers['age'] = pd.cut(customers['age'], bins=age_bins, labels=['Below 26','26-35','36-45','46-55', '56-65', 'Above 65'])

In [17]:
customers= customers.drop( ['postal_code'], axis=1)             
customers=customers.drop_duplicates()

In [18]:
plt.figure(figsize=(10,6))
customers.groupby('age').size().plot(kind='pie', autopct='%1.2d%%')
plt.ylabel('Age Distribution', size=20)
plt.tight_layout()

In [19]:
from matplotlib import pyplot as plt
sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='age', bins=10 )
ax.set_xlabel('Distribution of the customers age')
plt.show()

In [20]:
#sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='club_member_status' )
ax.set_xlabel('Distribution of club member status')
plt.show()

In [21]:
#sns.set_style("darkgrid")
f, ax = plt.subplots(figsize=(10,5))
ax = sns.histplot(data=customers, x='fashion_news_frequency' )
ax.set_xlabel('Distribution of fashion_news_frequency')
plt.show()

In [22]:
import plotly.express as px

In [23]:
px.sunburst(customers, path=['club_member_status','age'], title="Customers' behaviour")

### Articles dataframe analysis

In [24]:
articles.head(1)

In [26]:
articles.shape[0] - articles['article_id'].nunique()
#Duplicates check

In [27]:
print(f"The dataframe articles has {len(articles)} rows")

The "articles" dataframe has 25 columns and more than 100k rows.<br>
For our our analysis we will just select the following columns: 
- article_id
- prod_name
- product_type_name
- product_group_name
- colour_group_name
- index_name

By considering only these columns we can also save lots of memory.

In [None]:
articles.nunique()

In [None]:
articles.head(4)

# <font size='3' color='SteelBlue'><b>Text Mining-unsupervised learning to clean the prod_name and prod_desc-to correct strap top(1) into strap top</b></font>

In [28]:
import nltk
nltk.download('stopwords')

# <font size='3' color='SteelBlue'><b>Dealing with Stop Words</b></font>

- According to [OPINOSIS ANALYTICS](https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/#.YeQh1FnRZhE)

> *Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.*

- You can find list of these words in this [gits](https://github.com/igorbrigadir/stopwords/blob/master/en/terrier.txt).

To remove these words from our tweet's context, we can use `nltk` to download the list of stop words and remove it from the contexts.

In [29]:
from wordcloud import WordCloud, STOPWORDS
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
stop_words.extend(['the', 'with', 'at', 'zip'])     

In [30]:
# Remove stop words and remove words with 2 or less characters
def preprocess(text):
    ''' keeping only the words which are not in stop_words list '''
    result = []
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 2 and token not in stop_words:
            result.append(token)
            
    return ' '.join(result)

In [31]:
articles['detail_desc'].fillna('None', inplace=True)
articles['clean_desc'] = articles['detail_desc'].apply(preprocess)
articles['clean_prod_name'] = articles['prod_name'].apply(preprocess)

In [32]:
articles['clean_prod_name'].nunique()

In [33]:
import re

In [34]:
articles["clean_prod_name"] = articles["clean_prod_name"].str.strip()
articles["clean_prod_name"] = articles["clean_prod_name"].str.lower()
#replace words between ()
articles['clean_prod_name'] = articles['clean_prod_name'].apply(lambda string:  re.sub("\(.*?\)","",string))
#numbers
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace('\d+', '', regex=True)
articles["clean_prod_name"].nunique()

In [35]:
articles["clean_desc"] = articles["clean_desc"].str.strip()
articles["clean_desc"] = articles["clean_desc"].str.lower()
#replace words between ()
articles['clean_desc'] = articles['clean_desc'].apply(lambda string:  re.sub("\(.*?\)","",string))
#numbers
articles['clean_desc'] = articles['clean_desc'].str.replace('\d+', '', regex=True)
articles["clean_desc"].nunique()

# <font size='3' color='SteelBlue'><b>Dealing with Special Cases</b></font>

In [36]:
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace("yr", 'year', regex=False)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r"\.|==|=|'|`|(PM)|(AM)|(UTC)", ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace("...", '', regex=False)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r"(\s[^iIaA]\s)", ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'\s(\w$)|\s(\w[^me]$)', '', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'\s ', ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'(^\s)|(\s$)', '', regex=True)

articles['clean_prod_name'] = articles['clean_prod_name'].str.replace("yr", 'year', regex=False)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r"\.|==|=|'|`|(PM)|(AM)|(UTC)", ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace("...", ' ', regex=False)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r"(\s[^iIaA]\s)", ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'\s(\w$)|\s(\w[^me]$)', '', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'\s ', ' ', regex=True)
articles['clean_prod_name'] = articles['clean_prod_name'].str.replace(r'(^\s)|(\s$)', '', regex=True)
articles["clean_prod_name"].nunique()

In [37]:
articles['clean_desc'] = articles['clean_desc'].str.replace("yr", 'year', regex=False)
articles['clean_desc'] = articles['clean_desc'].str.replace(r"\.|==|=|'|`|(PM)|(AM)|(UTC)", ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace("...", '', regex=False)
articles['clean_desc'] = articles['clean_desc'].str.replace(r"(\s[^iIaA]\s)", ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'\s(\w$)|\s(\w[^me]$)', '', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'\s ', ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'(^\s)|(\s$)', '', regex=True)

articles['clean_desc'] = articles['clean_desc'].str.replace("yr", 'year', regex=False)
articles['clean_desc'] = articles['clean_desc'].str.replace(r"\.|==|=|'|`|(PM)|(AM)|(UTC)", ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace("...", ' ', regex=False)
articles['clean_desc'] = articles['clean_desc'].str.replace(r"(\s[^iIaA]\s)", ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'\s(\w$)|\s(\w[^me]$)', '', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'\s ', ' ', regex=True)
articles['clean_desc'] = articles['clean_desc'].str.replace(r'(^\s)|(\s$)', '', regex=True)
articles["clean_desc"].nunique()

In [38]:
from typing import Text, Dict, List, Union, Tuple, Optional, Set

<p style='font-size:14px;color:RoyalBlue'><b>Dealing with Special Characters</b></p>

By analysing an example such as our `example03` – we can see that there're letters that are not related to the context of the tweet – we consider that as **noise** that affects the context understating and the rationality of the tweet. We can remove them by tracing characters like `+@«#_\-!$%%^&*¬()£<>?/\\|}\]\[{;\,~:` and word that contains these characters; it will be considered as **noise** that doesn't help understand the tweet's context.

Then, we have to analyse our tweets and check if there are any kinds of tags involved in the context – if so, we're to remove those tags. You may be wondering, where do those tags come from? I can be found due to scraping the tweets from twitters.

In [39]:
s_chars = '¥₽ÏïŰŬĎŸæ₿₪ÚŇÀèÅ”ĜåŽÖéříÿý€ŝĤ₹áŜŮÂ₴ûÌÇšŘúüëÓ₫ŠčÎŤÆÒœ₩öËäøÍťìĈôàĥÝ¢ç“žðÙÊĉŭÈŒÐÉÔĵùÁů„âÄűĴóêĝÞîØòď฿ČÜþňÛ'
PUNC = '+@«#_\-!$%%^&*¬()£<>?/\\|}\]\[{;\,~:\"\''

In [40]:
def special_char(text: Text) -> Text:
    # first, let's remove any unicode strings
    text = text.encode('ascii', 'ignore').decode()
    # remove printable bachslashes
    text = re.sub(r'[\t\s\n\r\b\a]', ' ', text)
    # Special letters
    text = re.sub(r'[{}]'.format(s_chars), '', text)
    # Punctuation
    text = re.sub(r'[{}]'.format(PUNC), ' ', text)
    # space at the start or the end of the context
    text = re.sub(r'(^\s)|(\s$)', '', text)
    # Single character
    text = re.sub(r'(\s[^iIaA]\s)', ' ', text)
    return text

In [41]:
articles["clean_prod_name"]  = articles['clean_prod_name'].apply(lambda sentence: special_char(sentence))
articles["clean_prod_name"].nunique()

In [42]:
articles["clean_desc"]  = articles['clean_desc'].apply(lambda sentence: special_char(sentence))
articles["clean_desc"].nunique()

# <font size='3' color='SteelBlue'><b>English contractions</b></font>

**[What is a contraction?](https://www.grammarly.com/blog/contractions/)**

A contraction is a shortened form of a word (or group of words) that omits certain letters or sounds. In most contractions, an apostrophe represents the missing letters. The most common contractions are made up of verbs, auxiliaries, or modals attached to other words: `He would`=`He’d`. `I have`=`I’ve`. `They are`=`They’re`. `You cannot`=`You can’t`.

In [43]:
!pip install contractions
import contractions

In [44]:
def en_contractions(text: Text) -> Text:
    return ' '.join([contractions.fix(word)
                     if word in contractions.contractions_dict else word
                     for word in text.split()])

In [45]:
articles["clean_prod_name"]  = articles['clean_prod_name'].apply(lambda sentence: en_contractions(sentence))
articles["clean_prod_name"].nunique()

In [46]:
articles["clean_desc"]  = articles['clean_desc'].apply(lambda sentence: en_contractions(sentence))
articles["clean_desc"].nunique()

In [47]:
#removing numbers
articles["clean_prod_name"] = articles["clean_prod_name"].str.replace('\d+', '', regex=True)
articles["clean_prod_name"].nunique()

In [48]:
#removing numbers
articles["clean_desc"] = articles["clean_desc"].str.replace('\d+', '', regex=True)
articles["clean_desc"].nunique()

In [49]:
!pip install pandarallel

In [51]:
!pip install pyspellchecker

In [53]:
from spellchecker import SpellChecker

# <font size='3' color='SteelBlue'><b>(B) Spell Correction</b></font>

*Thanks to [@sudalairajkumar – Chat Words Conversion](https://www.kaggle.com/sudalairajkumar/getting-started-with-text-preprocessing?scriptVersionId=19201884&cellId=56)*

In [54]:
import multiprocessing
pool = multiprocessing.Pool()
print(f'Number of workers for parallelism: {pool._processes}')

In [55]:
spell = SpellChecker()
def correct_spellings(text: Text) -> Text:
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

In [56]:
spelling = SpellChecker(distance=1)
def misspelt_words_fn(dataframe: pd.DataFrame, column: Text) -> Set[Text]:
    misspelt_words = set()
    for tweet in dataframe[column].str.casefold():
        [misspelt_words.add(word) for word in spelling.unknown(tweet.split())]
    return misspelt_words


In [57]:
WORD = re.compile(r'\w+')
def reTokenize(tweet: Text) -> List[Text]:
    return WORD.findall(tweet.casefold())

In [58]:
PATTERN = re.compile(r"(.)\1{2,}")
def reduce_lengthening(text: Text) -> Text:
    return PATTERN.sub(r"\1\1", text)

In [59]:
def spell_correction(text: Text) -> Text:
    return ' '.join([spelling.correction(word)
                     if word in misspelt_words else word
                     for word in reTokenize(reduce_lengthening(text))])

In [60]:
# compile words only
WORD = re.compile(r'\w+')
def reTokenize(tweet: Text) -> List[Text]:
    return WORD.findall(tweet.casefold())

# Compile only 2 characters of a word
PATTERN = re.compile(r"(.)\1{2,}")
def reduce_lengthening(text: Text) -> Text:
    return PATTERN.sub(r"\1\1", text)

In [61]:
misspelt_words = misspelt_words_fn(articles, 'clean_prod_name')
len(misspelt_words)

In [62]:
articles['clean_prod_name'] = articles['clean_prod_name'].apply(lambda sentence: spell_correction(sentence))

In [63]:
articles['clean_prod_name'] = articles['clean_prod_name'].str.lower()
articles['clean_desc'] = articles['clean_desc'].str.lower()

In [64]:
articles['clean_prod_name'].nunique()

In [65]:
misspelt_words = misspelt_words_fn(articles, 'clean_desc')
len(misspelt_words)

In [66]:
articles['clean_desc'] = articles['clean_desc'].apply(lambda sentence: spell_correction(sentence))

In [67]:
articles['clean_desc'].nunique()

In [68]:
#Improvement due to Text mining for Prod name
(1-38226/45875)*100

In [69]:
#Improvement due to Text mining for Prod description
(1-42538/43405)*100

In [70]:
#px.sunburst(articles, path=[  'product_group_name'], title="Product Groups -In stock-behaviour")

In [71]:
px.sunburst(articles, path=[  'index_name'], title="Product Index name -Catalog behaviour") 

In [None]:
#px.sunburst(articles, path=[  'perceived_colour_value_name'], title="Product Colour behaviour" )
1+1

In [None]:
#px.sunburst(articles, path=[  'index_group_name', 'index_name','garment_group_name'], title="Product Index Group -Catalog behaviour")
1+1

In [72]:
name = articles.garment_group_name.unique()
sizes = articles.garment_group_name.value_counts()
fig = px.bar(articles, y=sizes, x=name, color = name,color_discrete_sequence=px.colors.sequential.Plotly3
            )
fig.update_layout(
    title="Count of Garment Group Name",
    xaxis_title="Garment Group Name",
    yaxis_title="Count"
)
fig.show()
 

In [None]:
#px.sunburst(articles, path=[  'garment_group_name'], title="Product Index Group -Catalog behaviour")

### Transactions dataframe analysis

In [None]:
transactions.head(7)

In [None]:
transactions.count()

In [73]:
transactions=transactions.drop_duplicates()

In [75]:
print(f"The dataframe Transactions has {len(transactions)} rows")

The Transactions dataframe has more than 29 million rows: in order to save memory, we decide to drop some columns and keep only "customer_id", "article_id", "price".

In [76]:
transactions_rcd = transactions[["customer_id", "article_id", "price","t_dat","sales_channel_id"]]

# Transactions analysis 

First, we extract the quantities sold per article using the value counts method on the "article_id" column of the transaction dataframe.

In [77]:
df_sold_qty = transactions_rcd["article_id"].value_counts()
df_sold_qty

**Then we create a dataframe based on this pandas series: this is necessary since later this dataframe will be joined with the "article" dataframe by the article_id column, in order to get informtions on the products.**

 
 **We have to consider at Product name grain,not the article id level, since Article Id is very granular-to product characteristics.**
 

In [78]:
transactions_rcd=transactions_rcd.drop_duplicates()

In [84]:
a1=articles[['article_id','clean_prod_name']].drop_duplicates()

In [85]:
transactions_article=pd.merge(transactions_rcd,a1,how='inner',on='article_id')

In [86]:
del transactions_rcd

In [87]:
temp=articles.groupby('clean_prod_name').agg({'article_id':np.max})

In [88]:

articles_temp=pd.merge(temp,articles,how='inner',on='article_id')

In [89]:
articles_temp=articles_temp.drop_duplicates()

In [None]:
articles_temp=articles_temp.drop('article_id', axis=1)

In [None]:
transactions_article=pd.merge(transactions_article,articles_temp,how='inner',on='clean_prod_name') 

In [None]:
prod_sold_qty = transactions_article["clean_prod_name"].value_counts()
prod_sold_qty

In [None]:
prod_sold_qty=prod_sold_qty.reset_index()
prod_sold_qty.rename(columns = {"clean_prod_name":"sold_qty","index":"clean_prod_name"}, inplace=True)
prod_sold_qty.head()

 

In [None]:
prod_sold_qty["sold_qty"].sum()

In [None]:
prod_sold_qty.head(10)

In [None]:
plt.figure(figsize=(12,4))
plt.title("Sold Quantity KDE plot")
sns.kdeplot(prod_sold_qty["sold_qty"])
plt.xlabel("Sold Quantity")
plt.show()

In [None]:
 prod_sold_qty["sold_qty"].quantile([1.0,0.90,0.95,0.99,0.999,.80])

In [None]:
prod_sold_qty[prod_sold_qty["sold_qty"]==644]

In [None]:
transactions_article.head()

In [None]:
transactions_article['Quantity'] = 1


In [None]:
#px.sunburst(transactions_article, path=['product_group_name'], title="Product Names Group -Transaction-Sales behaviour")
1+1

In [None]:
px.sunburst(transactions_article, path=[  'index_name'], title="Product Index name --Transaction-Sales Catalog behaviour")

In [None]:
#px.sunburst(transactions_article, path=[  'index_group_name'], title="Product Index Group name --Transaction-Sales Catalog behaviour")
1+1

In [None]:
px.sunburst(transactions_article, path=[  'perceived_colour_value_name'], title="Product Colour Transaction-Sales behaviour" )

In [None]:
Distribution of the RFM Segments#px.sunburst(transactions_article, path=[  'garment_group_name'], title="Product garment_group_name - Transaction-Sales Catalog behaviour")
1+1Distribution of the RFM Segments

**Detailed Description Feature**

In [None]:
!pip install WordCloud

In [None]:
!pip install gensim

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
from wordcloud import WordCloud, STOPWORDS
from gensim.parsing.preprocessing import STOPWORDS
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['the', 'with', 'at', 'zip'])                        # adding in the list

In [None]:
def word_show(group):
    ''' to display the word cloud on the basis of index_group_name '''
    plt.figure(figsize=(15,10))
    wc = WordCloud(max_words=2000, width=1600, height=800, stopwords=stop_words).generate(str(transactions_article[transactions_article['index_group_name']==group].clean_desc))
    plt.title('Dominating words in description of {} section'.format(group), fontsize=25)
    plt.imshow(wc)

In [None]:
word_show('Ladieswear')


In [None]:
word_show('Divided')

In [None]:
transactions_article['year'] = pd.DatetimeIndex(transactions_article['t_dat']).year
transactions_article['month'] = pd.DatetimeIndex(transactions_article['t_dat']).month
transactions_article['dayofweek'] = pd.DatetimeIndex(transactions_article['t_dat']).dayofweek

In [None]:
import matplotlib
import warnings
warnings.simplefilter('ignore')
%matplotlib inline

In [None]:
!pip install matplotlib

In [None]:
month_wise_sales = transactions_article.groupby(['year', 'month']).size().reset_index().rename(columns={0:'Count'})
month_wise_sales['month'] = month_wise_sales['month'].map({1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'})
matplotlib.rcParams['figure.figsize'] = (12,6)
sns.barplot(x='month', y='Count', data=month_wise_sales, hue='year', palette=['salmon', 'yellow', 'grey'])
plt.title('Month wise Number of Items Sold')
plt.show()

In [None]:
day_wise_sales = transactions_article.groupby(['dayofweek']).size().reset_index().rename(columns={0:'Count'})
day_wise_sales['dayofweek'] = day_wise_sales['dayofweek'].map({0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'})
matplotlib.rcParams['figure.figsize'] = (12,6)
sns.barplot(x='dayofweek', y='Count', data=day_wise_sales)
plt.title('WeekDay wise Number of Items Sold')
plt.xticks(rotation=45)
plt.show()

In [None]:
sales = transactions_article[['year', 'month', 'price']].groupby(['year', 'month']).sum().reset_index().rename(columns={0:'Sales'})
sales['price'] = sales['price'].astype(int)
sns.lineplot(data=sales, x="month", y="price", hue='year', palette=['red', 'green', 'blue'])

In [None]:
transactions_article['t_dat'].min()

In [None]:
transactions_article['t_dat'].max()

In [None]:
 prod_sold_qty["sold_qty"].quantile([1.0,0.90,0.95,0.99,0.999,.80])

In [None]:
prod_sold_qty[prod_sold_qty["sold_qty"]==658]

In [None]:
top_80_Perct_details=prod_sold_qty.iloc[:7000]
#del top_100_details

In [None]:
top_80_Perct_details[['sold_qty']].sum()

In [None]:
prod_sold_qty[['sold_qty']].sum()

In [None]:
24342759/28771271

In [None]:
top_80_Perct_details=pd.merge(top_80_Perct_details,articles_temp,how='inner',on='clean_prod_name').drop_duplicates()
 

#  Which are the TOP 100 products in terms of sold quantity?

We can simply extract the most 100 sold items from the dataframe "df_sold_qty" by taking the first 100 rows.

In [None]:
plt.figure(figsize=(10,8))
plt.title("TOP 30 most sold products", fontsize=33, fontweight="bold")
no=30
g = sns.barplot(y="clean_prod_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("clean_prod_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            palette="mako", ci=False)
 
plt.xlabel("Sold Quantity (%)", size=25, fontweight="bold")
plt.ylabel("")
plt.grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

For what concerns other product characteristics (besides the product name), we can obtain very effective plots even if we consider 100 products:

In [None]:
fig, ax = plt.subplots(2,2, figsize=(13,9.5))
plt.suptitle("TOP 100 most sold products characteristics", fontweight="bold",fontsize=30)

no=100

g = sns.barplot(y="product_type_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("product_type_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,0],palette="mako", ci=False)
 
ax[0,0].set_ylabel("")
ax[0,0].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[0,0].set_title("Product Type",fontweight="bold",fontsize=28)
ax[0,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="index_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("index_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,1],palette="viridis", ci=False)
 
ax[0,1].set_ylabel("")
ax[0,1].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[0,1].set_title("Index",fontweight="bold",fontsize=28)
ax[0,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="colour_group_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("colour_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,0],palette="mako", ci=False)
 
ax[1,0].set_ylabel("")
ax[1,0].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[1,0].set_title("Colour Group",fontweight="bold",fontsize=28)
ax[1,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="product_group_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("product_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,1],palette="Reds_r", ci=False)
 
ax[1,1].set_ylabel("")
ax[1,1].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[1,1].set_title("Product Group",fontweight="bold",fontsize=28)
ax[1,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
fig.tight_layout()

plt.show() 

Among the TOP 100 of solds products:
- Almost 30% of sold products are trousers
- 38% is Ladieswear
- 30% is Lingeries/Tights
- Over 70% are black colored
- Almost 40% are related to lower body

In [None]:
fig, ax = plt.subplots(2,2, figsize=(13,9.5))
plt.suptitle("TOP 80 percentile of most sold products characteristics", fontweight="bold",fontsize=30)

no=7000

g = sns.barplot(y="product_type_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("product_type_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,0],palette="mako", ci=False)
 
ax[0,0].set_ylabel("")
ax[0,0].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[0,0].set_title("Product Type",fontweight="bold",fontsize=28)
ax[0,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="index_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("index_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,1],palette="viridis", ci=False)
 
ax[0,1].set_ylabel("")
ax[0,1].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[0,1].set_title("Index",fontweight="bold",fontsize=28)
ax[0,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="colour_group_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("colour_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,0],palette="mako", ci=False)
 
ax[1,0].set_ylabel("")
ax[1,0].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[1,0].set_title("Colour Group",fontweight="bold",fontsize=28)
ax[1,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="product_group_name", x="sold_qty(%)", data=top_80_Perct_details.iloc[:no].groupby("product_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,1],palette="Reds_r", ci=False)
 
ax[1,1].set_ylabel("")
ax[1,1].set_xlabel("Sold Quantity (%)", size=22, fontweight="bold")
ax[1,1].set_title("Product Group",fontweight="bold",fontsize=28)
ax[1,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
fig.tight_layout()

plt.show() 

#  What are the characteristics of products in last 5 percentile?

There are 24,409 products which contributed to last 5 percentile of the sales.

In [None]:
 prod_sold_qty["sold_qty"].quantile([1.0,0.90,0.95,0.99,0.999,.25])

In [None]:
prod_sold_qty[prod_sold_qty["sold_qty"]==250]

In [None]:
worst_20_Perct_details=prod_sold_qty.iloc[13499:]

In [None]:
worst_20_Perct_details.count()

In [None]:
worst_20_Perct_details['sold_qty'].sum()

In [None]:
1570213/prod_sold_qty['sold_qty'].sum()

**Almost 24409 different items sold less than 250 times ie below 5 percentile. <br>
Since in the "Transactions" dataframe there are around 100000 different items, this means that among the transactions, almost 5% of the products  have only been sold once.**

In [None]:
worst_20_Perct_details=pd.merge(worst_20_Perct_details,articles_temp,how='inner',on='clean_prod_name').drop_duplicates()


And finally join this newly defined dataframe "worst_sold" to the articles dataframe df_a to get the articles characterisics.

In [None]:
worst_20_Perct_details.head(1)

In [None]:
fig, ax = plt.subplots(2,2, figsize=(19,14))
plt.suptitle("Characteristic of products sold only once", size=38, fontweight="bold")

no=24409

g = sns.barplot(y="product_type_name", x="sold_qty(%)", data=worst_20_Perct_details.iloc[:no].groupby("product_type_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,0],palette="viridis_r", ci=False)
 
ax[0,0].set_ylabel("")
ax[0,0].set_xlabel("Sold Quantity (%)", size=20, fontweight="bold")
ax[0,0].set_title("Product Type", size=25, fontweight="bold")
ax[0,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="index_name", x="sold_qty(%)", data=worst_20_Perct_details.iloc[:no].groupby("index_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[0,1],palette="Reds_r", ci=False)
 
ax[0,1].set_ylabel("")
ax[0,1].set_xlabel("Sold Quantity (%)", size=20, fontweight="bold")
ax[0,1].set_title("Index", size=25, fontweight="bold")
ax[0,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="colour_group_name", x="sold_qty(%)", data=worst_20_Perct_details.iloc[:no].groupby("colour_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,0],palette="mako", ci=False)
 
ax[1,0].set_ylabel("")
ax[1,0].set_xlabel("Sold Quantity (%)", size=20, fontweight="bold")
ax[1,0].set_title("Colour Group", size=25, fontweight="bold")
ax[1,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="product_group_name", x="sold_qty(%)", data=worst_20_Perct_details.iloc[:no].groupby("product_group_name")["sold_qty"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('sold_qty(%)').reset_index().sort_values(by="sold_qty(%)", ascending=False), \
            ax=ax[1,1],palette="Blues_r", ci=False)
 
ax[1,1].set_ylabel("")
ax[1,1].set_xlabel("Sold Quantity (%)", size=20, fontweight="bold")
ax[1,1].set_title("Product Group", size=25, fontweight="bold")
ax[1,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

fig.tight_layout()

plt.show() 

#Which are the TOP 100 articles that generated most earnings for the company?

After analyzing the sold quantites for each product, it can be interesting to analyze the total earnings generated by each product.<br>
**The earnings can be calculated by multiplying the price of each product by its total sold quanity**. <br>
*NOTE: For privacy reasons, the prices have been transformed/scaled by the creator of the dataset, and so do not represent any known currency.*

We will now create a new dataframe df_prices which will inlude the earnings generated by each product.

In [None]:
transactions_article.head(1)

In [None]:
df_prices = transactions_article[["price","clean_prod_name"]].groupby("clean_prod_name").sum().sort_values(by="price", ascending=False)

In [None]:
df_prices.rename(columns={"price":"earning"}, inplace=True)
df_prices = df_prices.reset_index()

In [None]:
df_prices.head()

In [None]:
df_prices.count()

**We can see that the most earnings generated by a product is 1631**. <br>
How much is the total earnings?

In [None]:
print("Number of different sold articles:",len(df_prices["earning"]))
print("Total Earnings:",df_prices["earning"].sum())

In [None]:
df_prices=pd.merge(df_prices,articles_temp,how='inner',on='clean_prod_name').drop_duplicates()


In [None]:
for i in [10,50,100,200,300,400,1000,1500,2000,2500,3000,4000,5000,6000]:
    print("The TOP {} of products that generate most earnings, account for the {:.2f} % of total earnings".format(i, df_prices["earning"].iloc[:i].sum() / df_prices["earning"].iloc[:].sum() * 100) ) 

**The TOP 100 of over 38K products, generates around 14% of the total earnings. It can be interesting to check these products names and characteristics. We would need to consider RFM for better or clearer analysis.**

So we create a new dataframe top_100_prices, where we include only the TOP 100 articles from the df_prices dataframe.

In [None]:
top_100_prices=df_prices.iloc[:100]

Then, as seen before, we join this new dataframe to the articles dataframe df_a to get the articles information.

In [None]:
plt.figure(figsize=(10,11))
plt.title("TOP 50 most profitable products", size=40, fontweight="bold")
no=50
g = sns.barplot(y="clean_prod_name", x="earning(%)", data=top_100_prices.iloc[:no].groupby("clean_prod_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            palette="mako", ci=False)
 
plt.xlabel("Earnings (%)", size=25, fontweight="bold")
plt.ylabel("")
plt.grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

In [None]:
fig, ax = plt.subplots(2,2, figsize=(13,9))
plt.suptitle("TOP 100 most profitable products characteristics", fontweight="bold", fontsize=30)

no=100

g = sns.barplot(y="product_type_name", x="earning(%)", data=top_100_prices.iloc[:no].groupby("product_type_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[0,0],palette="Blues_r", ci=False)
 
ax[0,0].set_ylabel("")
ax[0,0].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[0,0].set_title("Product Type", size=25,fontweight="bold")
ax[0,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)


g = sns.barplot(y="index_name", x="earning(%)", data=top_100_prices.iloc[:no].groupby("index_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[0,1],palette="viridis", ci=False)
 
ax[0,1].set_ylabel("")
ax[0,1].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[0,1].set_title("Index", size=25,fontweight="bold")
ax[0,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)


g = sns.barplot(y="colour_group_name", x="earning(%)", data=top_100_prices.iloc[:no].groupby("colour_group_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[1,0],palette="mako", ci=False)
 
ax[1,0].set_ylabel("")
ax[1,0].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[1,0].set_title("Colour Group", size=25,fontweight="bold")
ax[1,0].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)

g = sns.barplot(y="product_group_name", x="earning(%)", data=top_100_prices.iloc[:no].groupby("product_group_name")["earning"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index().sort_values(by="earning(%)", ascending=False), \
            ax=ax[1,1],palette="Reds_r", ci=False)
 
ax[1,1].set_ylabel("")
ax[1,1].set_xlabel("Earnings (%)", size=20,fontweight="bold")
ax[1,1].set_title("Product Group", size=25,fontweight="bold")
ax[1,1].grid(axis="x",color = 'grey', linestyle = '--', linewidth = 1.5)
fig.tight_layout()

plt.show()

  

Insights:
- Over 60% of the TOP 100 products in terms of earnings are generated by selling trousers
- Around 50% of these products are divided (a H&M teenage collection)
- 37% of the products are from the Ladieswear line
- 55% of the products are black
- 66.2% of the products are related to lower body

**NOTE: It is also important to notice that the TOP 100 most profitable products list do not exactly match the TOP 100 most sold products one, since lots of products that sells a lot in quantity are cheap, and so generate less earnings.**

# Customer Analayis

In the following, we will start an analysis on the customers to find interesting insights and understand which customers are responsible for msot purchases.

In order to perform the analysis, we first decide to  create a new dataframe that will include, for each row, an unique customer_id, the total purchased quantity by that customer and the ernings generated by the company by the purchases of that customer.

First, we crate a dataframe which will include the unique customer ids and the earnings generated by theirs purchases.

In [None]:
df_cust_prices = transactions_article[["customer_id", "price"]].groupby("customer_id").sum()

In [None]:
df_cust_prices.head()

In [None]:
transactions_article.head(1)

Second, we create a dataframe that will include the unique customer ids and their total purchased quantity of products.

In [None]:
df_cust_qty = transactions_article[["customer_id", "clean_prod_name"]].groupby("customer_id").count()

In [None]:
df_cust_qty.head()

Then, we join these two dataframe to a new one "cust_qty_price", which will include the unique customer ids, their purchased quantity and the earnings generated by the company by their purchases.

In [None]:
cust_qty_price = pd.merge(df_cust_prices, df_cust_qty, on='customer_id', how='inner')

In [None]:
cust_qty_price.count()

Finally we can join this new dataframe to the Customer dataframe df_c, so that we can add some informations about the customer on the newly defined cust_qty_price dataframe.

In [None]:
customers.head()

In [None]:
cust_details = pd.merge(cust_qty_price, customers, on='customer_id', how='inner')

In [None]:
cust_details.head()

In [None]:
print(f"In total there are {len(cust_details)} different customers")

# Purchased Quantity by Customer Analysis

Now we will analyze the purchased quantity by the customers.

In [None]:
cust_details.clean_prod_name.describe()

By calling the "describe" method on the "product_name" column, we can observe that:
- The minimum purchased quantity by a single customer is 1
- 25% of customers Purchased 3 or less items
- 50% of customers Purchased 9 or less items
- 75% of customers Purchased 25 or less items
- The maximum purchased quantity by a single customer is 1636 products

In [None]:
plt.figure(figsize=(10,4))
plt.title("Distribution of purchased quantity by customer", fontweight="bold", size=20)
sns.kdeplot(cust_details["article_id"])
plt.xlabel("purchased quantity",fontweight="bold", size=20)
plt.ylabel("Count",fontweight="bold", size=20)
plt.show()

Indeed, the distribution of this variable is highly skewed.

Next, we will analyze the age and other provided features of the customer to better find insights on the customers and their purchase behaviour.

# Purchase Behaviors according to Age

In [None]:
plt.figure(figsize=(10,5))
plt.title("Customers age distribution", fontweight="bold", size=30)
plt.hist(cust_details["age"], bins=70, edgecolor="black", color="#1ABC9C")
plt.xlabel("Age",fontweight="bold", size=20)
plt.ylabel("Count",fontweight="bold", size=20)
plt.show()

**The distribution of the age feature is bivariate. In order to create more effective plots, we will create a categorical column for age which divides the ages in age groups.**

# Q5 - Which age group purchase more articles?

In [None]:
plt.figure(figsize=(8,5))
plt.title("Purchased quantity by age group\n", fontweight="bold", size=28)
g = sns.barplot(x="age", y="Purchased Quantity(%)", data=cust_details.groupby("age")["clean_prod_name"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('Purchased Quantity(%)').reset_index(), palette="icefire", edgecolor="black")
plt.xlabel("Age Group",fontweight="bold", size=22)
plt.ylabel("Purchased Quantity (%)",fontweight="bold", size=19)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Insights:
- Customers in the range 20-30 are responsible for more than 42% of the total purchased products.
- Customers in the range 16-20. 60-70 and 70+ are responsible for the 8% of the total purchased products
- Customers in the range 30-40, 40-50 and 50-60 are responsible for 16% of purchased quantity each.

After analyzing the purchases quantity, it could be interesting to analyze the earnings provided to the company by each customer.

# Q6 - Which age group generates more earnings for the company?

In [None]:
plt.figure(figsize=(8,5))
plt.title("Company Earnings by age group\n", fontweight="bold", size=28)
g = sns.barplot(x="age", y="earning(%)", data=cust_details.groupby("age")["price"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('earning(%)').reset_index(), palette="icefire",edgecolor="black")
plt.xlabel("Age Group",fontweight="bold", size=22)
plt.ylabel("Earnings (%)",fontweight="bold", size=25)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()


**Indeed a very similar situation to the purchases quantity can be found in the earnings analysis, since customers who buys more, on average leads to higher earnings for the company. <br>
The age group 20-30 is by far responsible for the highest earnings for the company (41.9% of total earnings).**

# Q7 - Do active customers on the fashion news purchase more articles?

In [None]:
plt.figure(figsize=(9,5))
plt.title("Purchased quantity by Fashion News Frequency\n", fontweight="bold", size=20)
g = sns.barplot(x="fashion_news_frequency", y="Purchased Quantity(%)", data=cust_details.groupby("fashion_news_frequency")["clean_prod_name"].sum() \
            .transform(lambda x: (x / x.sum() * 100)).rename('Purchased Quantity(%)').reset_index(), palette="Spectral", edgecolor="black")
plt.xlabel("Fashion News Frequency",fontweight="bold", size=22)
plt.ylabel("Purchased Quantity (%)",fontweight="bold", size=25)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

**Active customers on the fashion news are responsible for 43% of the total purchases, while the remaining 57% of purchased quantity comes from customer not registed in the fashion news.** <br>
The other 2 categories "Monthly" and "None" can be ignored and won't be considered for the further analysis.

So then it could be interesting to check the fashion news frequency by age group, to find more useful insights,

In [None]:
x, y = 'age', 'fashion_news_frequency'
df_age_news = cust_details.groupby(x)[y].value_counts(normalize=True)
df_age_news = df_age_news.mul(100)
df_age_news = df_age_news.rename('percent(%)').reset_index()
df_age_news = df_age_news[df_age_news["fashion_news_frequency"].isin(["Regularly","NONE"])]

In [None]:
palette1 = {"Regularly":'#46C646', "NONE":'#FF0000'}

plt.figure(figsize=(13,6))
plt.title("Fashion News Frequency by age group\n",fontweight="bold", size=33)
g=sns.barplot(x="age", y="percent(%)",data=df_age_news, hue="fashion_news_frequency", palette=palette1)
plt.xlabel("Age group",fontweight="bold", size=22)
plt.ylabel("Percentage (%)",fontweight="bold", size=25)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.legend(title='News\nFrequency',bbox_to_anchor=(1.0, 1.0), ncol=1, fancybox=True, shadow=True, fontsize=17,title_fontsize=22)
plt.show()

We can see that **customers in the range 20-30 and 30-40 have the lowest percentage of fashion news frequency, while being the groups which buy the most**.<br>
Moreover, **the frequency of customer that regulary check fashion news starts increasing from the range 40-50, with a peak value of 43.7% of regular/active users for customers in the range 70+ years old**. This means that **checking fashion news seems to be more effective for older customers, who still represent a small percentage of total sold products, while younger customers do not need to check the news to buy new products**. <br>
**It could be effective for the company to invite younger customers (range 20-40) to check the news more frequently in order to increase the sold items.**

# Q8 - Does the club member status influence the purchased quantity?

In [None]:
cust_details["club_member_status"].value_counts(normalize=True)

We can see that:
- More than 93% of the customers belong to the ACTIVE category
- 6.8% of the customers belong to the PRE-CREATE cateory
- 0.3% of the customers belong to the LEFT CLUB category

**This shows a very high imbalance among the classes: if we consider the sum of purchased products per each category, this will likely show that the most part of Purchased products belongs to the ACTIVE members.**

In [None]:
cust_details.groupby("club_member_status")["clean_prod_name"].sum()

**Indeed, more customers in a group leads to higher purchases. For this reason, it is more wise to consider a mean Purchased quantity instead of a sum:**

In [None]:
print("The average quantity of purchased products by the customers is {:.0f} products ".format(cust_details["clean_prod_name"].mean()))

In [None]:
print("The average quantity of purchased products by the ACTIVE customers is {:.0f} products ".format(cust_details.groupby("club_member_status")["clean_prod_name"].mean()["ACTIVE"]))
print("The average quantity of purchased products by the LEFT-CLUB customers is {:.0f} products ".format(cust_details.groupby("club_member_status")["clean_prod_name"].mean()["LEFT CLUB"]))
print("The average quantity of purchased products by the PRE-CREATE customers is {:.0f} products ".format(cust_details.groupby("club_member_status")["clean_prod_name"].mean()["PRE-CREATE"]))

By considering the mean, we can see a very different situation, which will be shown as percentages in the following plot:

In [None]:
plt.figure(figsize=(9,5))
plt.title("Average Purchased Quantity by Club Member Status\n", fontweight="bold", size=22)
g = sns.barplot(x="club_member_status", y="clean_prod_name", data=cust_details.groupby("club_member_status")["clean_prod_name"].mean().astype(int).reset_index(), palette="viridis", edgecolor="black")
plt.axhline(y = cust_details["clean_prod_name"].mean(), color = 'r', linestyle = '--')
plt.text(0.76, 23.7, 'Mean Purchased Quantity: {:.0f}'.format(cust_details["clean_prod_name"].mean()), size=16, color="red",fontweight="bold")
plt.xlabel("Club Member Status",fontweight="bold", size=20)
plt.ylabel("Average Purchased Quantity",fontweight="bold", size=16)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

**This plots shows that the average purchased quantity differs a lot among the categories**. <br>
In particular, **customers belonging to the ACTIVE clubs, purchase more products than other categories, while those in the "pre-create" category purchaes on average less than a third of third of active customers**.

Finally, since the distribution of the purchased quantity is heavily right skewed, it could be interesting to check out also the median purhcased quantity.

In [None]:
plt.figure(figsize=(9,5))
plt.title("Median Purchased Quantity by Club Member Status\n", fontweight="bold", size=22)
g = sns.barplot(x="club_member_status", y="clean_prod_name", data=cust_details.groupby("club_member_status")["clean_prod_name"].median().reset_index(), palette="viridis", edgecolor="black")
plt.axhline(y = cust_details["clean_prod_name"].median(), color = 'r', linestyle = '--')
plt.text(0.76, 9.3, 'Median Purchased Quantity: {:.2f}'.format(cust_details["clean_prod_name"].median()), size=16, color="red",fontweight="bold")
plt.xlabel("Club Member Status",fontweight="bold", size=20)
plt.ylabel("Median Purchaed Quantity",fontweight="bold", size=16)
 
plt.grid(axis="y",color = 'grey', linestyle = '--', linewidth = 1.5)
plt.show()

Indeed, even if the Median is quite different for the Mean due to high skeweness of the data, a very similar situation situation to the mean purchases quantity can be observed, where ACTIVE customers buys more product on average.

**Online vs Offine**

In [None]:
 
sns.displot(data=transactions, x='t_dat', hue='sales_channel_id', kind='kde', fill=True)

# RFM Analysis


In [None]:
# import required libraries for clustering
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import linkage
from scipy.cluster.hierarchy import dendrogram
from scipy.cluster.hierarchy import cut_tree

In [None]:
transactions=pd.merge(transactions,articles,how='inner',on='article_id')
transactions=pd.merge(transactions,customers,how='inner',on='customer_id')
#consider transactions which are present in customer and articles.

transactions['InvoiceDate'] = pd.to_datetime(transactions['t_dat'],format='%Y-%m-%d')
transactions=transactions[["InvoiceDate", "customer_id", "article_id", "price","sales_channel_id"]].drop_duplicates()

In [None]:
transactions.shape

In [None]:
# checking df's missing value's attribution in %
df_null = round(100*(transactions.isnull().sum())/len(transactions), 2)
df_null

In [None]:
# checking df's missing value's attribution in %
df_null = round(100*(transactions.isna().sum())/len(transactions), 2)
df_null

In [None]:
transactions.loc[(transactions["customer_id"]=="03e315b7d4d160a7a3651cc558120165af7cbe5c67c343b87a7e7d81ccdd4595")
                &
                 (transactions["InvoiceDate"]=='2019-09-06')
                ]

In [None]:
transactions.loc[(transactions["customer_id"]=="2ddeebea89ec886bf30d163b1ccad3c8147b9fe9a12c972ff6552c6170743319")
                &
                 (transactions["article_id"]==778075001)
                ]

In [None]:
transactions.loc[(transactions["customer_id"]=="2ddeebea89ec886bf30d163b1ccad3c8147b9fe9a12c972ff6552c6170743319")
                &
                 (transactions["article_id"]==778075001)
                ]

In [None]:
##Generate Invoice ID as combination of Customer id and Transaction Date.
transactions['_ID'] = transactions['customer_id']  + transactions['InvoiceDate'].astype(str) 

transactions['Invoice_id'] = pd.factorize(transactions['_ID'])[0]


In [None]:
transactions.head()

In [None]:
import datetime as dt

In [None]:
analysis_date = max(transactions['InvoiceDate']) + dt.timedelta(days= 1)
print((analysis_date).date())

In [None]:
rfm = transactions.groupby('customer_id').agg({
    'InvoiceDate': lambda x: (analysis_date - x.max()).days,
    'Invoice_id': 'count',
    'price': 'sum'})
rfm.columns=["Recency","Frequency","Monetary"]
rfm = rfm[rfm["Monetary"] > 0]
rfm.describe().T
 
rfm

#https://www.kaggle.com/code/kanberburak/rfm-analysis/notebook

In [None]:
#Date from customer's last purchase.The nearest date gets 5 and the furthest date gets 1.
rfm["recency_score"] = pd.qcut(rfm['Recency'], 5, labels=[5, 4, 3, 2, 1])
# Total number of purchases.The least frequency gets 1 and the maximum frequency gets 5.
rfm["frequency_score"] = pd.qcut(rfm["Frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
#Total spend by the customer.The least money gets 1, the most money gets 5.
rfm["monetary_score"]= pd.qcut(rfm["Monetary"],5,labels=[1,2,3,4,5])
rfm.head()

In [None]:
#RFM - The value of 2 different variables that were formed was recorded as a RFM_SCORE
rfm["RFM_SCORE"] = (rfm["recency_score"].astype(str) + rfm["frequency_score"].astype(str))

In [None]:
seg_map = {
    r'[1-2][1-2]': 'hibernating',
    r'[1-2][3-4]': 'at_Risk',
    r'[1-2]5': 'cant_loose',
    r'3[1-2]': 'about_to_sleep',
    r'33': 'need_attention',
    r'[3-4][4-5]': 'loyal_customers',
    r'41': 'promising',
    r'51': 'new_customers',
    r'[4-5][2-3]': 'potential_loyalists',
    r'5[4-5]': 'champions'
}
rfm['segment'] = rfm['RFM_SCORE'].replace(seg_map, regex=True)
rfm.head()

In [None]:
rfm[["segment", "Recency","Frequency","Monetary"]].groupby("segment").agg(["mean","count","max"]).round()

In [None]:
 
x = rfm.segment.value_counts()
fig = px.treemap(x, path=[x.index], values=x)
fig.update_layout(title_text='Distribution of the RFM Segments', title_x=0.5,
                  title_font=dict(size=20))
fig.update_traces(textinfo="label+value+percent root")
fig.show()

Due to the absence of customer-related data, customer segmentation is carried out based on the RFM (recency, frequency, monetary) value.
With RFM a business can divide customers into three segments:

Low Value: Customers who are less active than others, not very frequent buyer/visitor and generates very low — zero — maybe negative revenue.
Mid Value: In the middle of everything. Often using our platform (but not as much as our High Values), fairly frequent and generates moderate revenue.
High Value: The group we don’t want to lose. High Revenue, Frequency and low Inactivity.
Key k-means assumptions
Symmetric distribution of variables (not skewed)
Variables with same average values
Variables with same variance

In [None]:
def rfm_dist(column):
    r_skew = rfm[column].skew()

    plt.rcParams['font.size']=13
    plt.figure(figsize=(10,6))
    sns.distplot(rfm[column], color = 'tomato', kde_kws={"color": "k"})
    plt.title('{} Distribution'.format(column))

    plt.show()
    print('Skewness value of data {}: {}'.format(column, r_skew))

In [None]:
rfm_dist('Recency')


In [None]:
rfm_dist('Monetary')

In [None]:
rfm_dist('Frequency')

In [None]:
print("The following is the skewness value for Recency, Frequency, and Monetary data:",
      '\nRecency   : {}'.format(rfm['Recency'].skew()),
      '\nFrequency : {}'.format(rfm['Frequency'].skew()),
      '\nMonetary  : {}'.format(rfm['Monetary'].skew()))

As a general rule of thumb:
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and -0.5(negatively skewed) or between 0.5 and 1(positively skewed), the data are moderately skewed.
If the skewness is less than -1(negatively skewed) or greater than 1(positively skewed), the data are highly skewed.

From the general rule of thumb, it can be concluded that data Recency, Frequency, and Monetary are highly positive skew. So, the data must be transformed (Data Transformation) to make it into a normal distribution.

In [None]:
rfm_norm = rfm[['Recency','Frequency','Monetary']]
rfm_norm.describe()

In [None]:
from sklearn.preprocessing import PowerTransformer, StandardScaler


In [None]:
power = PowerTransformer().fit(rfm_norm)
rfm_transform =power.transform(rfm_norm)

#Standard Scaler to Scale the data
scaler = StandardScaler().fit(rfm_transform)
rfm_transform = scaler.transform(rfm_transform)

rfm_transform =pd.DataFrame(rfm_transform, index=rfm_norm.index, columns=rfm_norm.columns)
print('These are the mean value after data transformation:')
rfm_transform.describe()

In [None]:
def rfm_dist_transform(column):
    r_skew_transform = rfm_transform[column].skew()

    plt.rcParams['font.size']=13
    plt.figure(figsize=(10,6))
    sns.distplot(rfm_transform[column], color = 'g', kde_kws={"color": "k"})
    plt.title('{} Distribution after Data Transformation'.format(column))

    plt.show()
    print('Skewness value of data {} after data transformation: {}'.format(column, r_skew_transform))

In [None]:
rfm_dist_transform('Recency')


In [None]:
rfm_dist_transform('Frequency')


In [None]:
rfm_dist_transform('Monetary')


In [None]:
print("The following is the skewness value for Recency, Frequency, and Monetary data after data transformation:",
      '\nRecency   : {}'.format(rfm_transform['Recency'].skew()),
      '\nFrequency : {}'.format(rfm_transform['Frequency'].skew()),
      '\nMonetary  : {}'.format(rfm_transform['Monetary'].skew()))

Since the skewness values are between -0.5 to 0.5, or close to 0, it can be assumed that the RFM data has a normal distribution.



Determining the number of clusters
Elbow Method
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and for each value of k, calculate the sum of squared errors (SSE). 
The idea of elbow method is to choose K at SSE decreases abruptly.

In [None]:
#Elbow Method
inertia=[]
cl = 15

for i in range(1,cl+1):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=42)
    kmeans.fit(rfm_transform)
    inertia.append(kmeans.inertia_) #sum of squared distances to closest cluster center


fig,ax = plt.subplots()
fig.set_size_inches(10, 7)
plt.rcParams['font.size']=16
plt.plot(range(1,cl+1), inertia, 'o-', color = 'g')
plt.xticks(np.arange(0, 15, 1.0))
plt.title('The Elbow Method')
plt.xlabel('Number of of Clusters (k)')
plt.ylabel('Sum of Squared Distances')
plt.grid(True)
plt.show()

Silhouette Analysis
Silhouette displays a measure of how close each point in one cluster is to points in the neighbouring clusters. Silhouette score has a range of [-1, 1]

A value near +1 indicate that the sample is far away from the neighbouring clusters.
A value of 0 indicates that the sample is on or very close to the decision boundary between two neighbouring clusters.
Negative values indicate that those samples might have been assigned to the wrong cluster.

We will choose 4 clusters based on the elbow method, because there is no significant decrease 
in sum of squared distance or SSE after 4 clusters. Based on the silhouette analysis, 
the silhouette_score that is close to 1 is for the number of clusters 2 and 4.
By combining the elbow method and silhouette analysis, the number of clusters 4 is selected.
KMeans Clustering


In [None]:
n = 4

model = KMeans(n_clusters=n,init='k-means++',max_iter=300, n_init=10, random_state=0)

#fit the data
model_kmeans = model.fit(rfm_transform)

# These are our fitted labels for clusters -- the first cluster has label 0, and the second has label 1.
labels = model_kmeans.labels_

centers = model_kmeans.cluster_centers_

#Append labels/clusters to scaled and original dataframe
rfm['Cluster'] = labels
clusters_seg = pd.DataFrame(rfm_transform,columns=['Recency','Frequency','Monetary'])
clusters_seg['Cluster'] = labels+1

clusters = []

for i in range(0,n):
    clusters.append(rfm[rfm['Cluster']==i])

values = power.inverse_transform(scaler.inverse_transform(centers))

for i in range(0,n):
    print('-----CLUSTER {}-----'.format(i+1))
    print('Number of Customers: {}'.format(clusters[i].shape[0]))
    print('(Recency)Last Day Shopping: {:4.0f}'.format(values[i][0]))
    print('(Frec)Purchases Frequency : {:4.0f}'.format(values[i][1]))
    print('(Monetary)Total purchases: {:6.2f}\n'.format(values[i][2]))


In [None]:
rfm_transform['Cluster'] = model_kmeans.labels_
rfm_transform.head()

In [None]:
rfm.groupby('Cluster').agg({'Recency':'mean',
                                 'Frequency':'mean',
                                 'Monetary':['mean','count']}).round(1)

Snake plot
Market research technique to compare different segments
Visual representation of each segment's attributes

In [None]:
data_melt = pd.melt(rfm_transform.reset_index(), id_vars=['customer_id','Cluster'], value_vars=['Recency','Frequency','Monetary'], var_name='Attribute', value_name='Value')
plt.figure(figsize=(14,10))
plt.title('Snake plot of standardized variables')
ax = sns.lineplot(x='Attribute', y='Value', hue='Cluster',palette=['tomato','cornflowerblue','violet','mediumseagreen'], data=data_melt, sort=False)
ax.legend(['1','2','3','4'])
plt.show()

In [None]:
from mpl_toolkits.mplot3d import Axes3D


In [None]:
fig = plt.figure(figsize=(16,8))
ax=Axes3D(fig)
ax.scatter(rfm_transform[rfm_transform['Cluster']==0]['Recency'],rfm_transform[rfm_transform['Cluster']==0]['Frequency'],rfm_transform[rfm_transform['Cluster']==0]['Monetary'],color='tomato', label='Cluster 1')
ax.scatter(rfm_transform[rfm_transform['Cluster']==1]['Recency'],rfm_transform[rfm_transform['Cluster']==1]['Frequency'],rfm_transform[rfm_transform['Cluster']==1]['Monetary'],color='cornflowerblue', label='Cluster 2')
ax.scatter(rfm_transform[rfm_transform['Cluster']==2]['Recency'],rfm_transform[rfm_transform['Cluster']==2]['Frequency'],rfm_transform[rfm_transform['Cluster']==2]['Monetary'],color='violet', label='Cluster 3')
ax.scatter(rfm_transform[rfm_transform['Cluster']==3]['Recency'],rfm_transform[rfm_transform['Cluster']==3]['Frequency'],rfm_transform[rfm_transform['Cluster']==3]['Monetary'],color='mediumseagreen', label='Cluster 4')
ax.scatter(centers[:,0], centers[:,1], centers[:,2], s = 250, c = 'k', marker='*', label = 'Centroid')
plt.legend(loc='center left',markerscale=2)
plt.title('3D Customer Segmentation', pad = 15)
ax.set_xlabel('Recency', labelpad = 15)
ax.set_ylabel('Frequency', labelpad = 15)
ax.set_zlabel('Monetary')
plt.show()

In [None]:
rfm.head(1)

In [None]:
rfm=rfm.reset_index().rename({'customer_id':'customer_id'}, axis = 'columns')
rfm=rfm.reset_index().set_index('customer_id', drop=False)


In [None]:
rfm.head(1)

Conclusion Customer Segmentation
Cluster 3 is our Champion Customers
Champion customers are the best customers, new buyers, most frequently, and are heavy buyers. Reward this customer. They can be early adopters for new products and will help promote your brand.
Cluster 4 is our At Risk Customers
At Risk Customers are customers who often buy and spend a lot of money, but haven't been shopping recently. We can provide offers or discounts to these customers, so that they are interested in shopping again at our store.
Cluster 1 is our Promising Customers
Promising customers are customers who have been shopping recently, but the frequency and amount of money spent in our stores is still small or below average. Create brand awareness and provide discounts.
Cluster 2 is our Lost Customers
Lost customers are customers who have not been shopping for a long time, and the frequency and amount of money spent is also very low. Revive their interest with reach out campaign, ignore otherwise.


In [None]:
px.sunburst(rfm_customers, path=[ 'Cluster', 'segment'], title="Customers' behaviour")

In [None]:
rfm_customers=pd.merge(rfm,customers ,how='inner',on='customer_id')

In [None]:
#px.sunburst(rfm_customers, path=['club_member_status','Cluster'], title="Customers' behaviour")

# Recommend Items Frequently Purchased Together


The most popular method is collabrative Item-based recommendation-based on cosine similarity

At our interest of time and scope for the course work, we have decided not to proceed with complex techniques. 
And pursue the 'Common sense method'(purely based on past purchase of customer-article, along with exponential time factor) as Proof of concept. 

As future recommendation, we would like to sugguest H and M to proceed with 
Collabrative Based filtering Recommendation system.

This is most simple method, further we can expand to using similarity, pattern,likeablity- We would leave this to future research-and in our work has made recommendation for next steps(what algoritms n all).

The recommendation for our Proof of concept-is based on 'Common Sense Method' :

    (1)Part 1 of 2-Recommend Items Purchased Together-market basket analysis
    (2)Part 2 of 2-Recommend Most Often Previously Purchased Items- Time decaying method, popularity
    (3)Part 2 of 2-Recommend Last Week's Most Popular Items-Time decaying method, popularity
    Results
     Part 1- we have got 1912 association rules, at support >= 0.01
     Part 2- our results are mAP Score on Validation set: 0.0243
    


    
Mean Average Precision @ 12 (MAP@12):
 

where  is the number of customers,  is the precision at cutoff ,  is the number predictions per customer,  is the number of ground truth values per customer, and  is an indicator function equaling 1 if the item at rank  is a relevant (correct) label, zero otherwise.

https://www.kaggle.com/code/cdeotte/recommend-items-purchased-together-0-021/notebook
https://www.kaggle.com/code/mayukh18/time-decaying-popularity-benchmark-0-0216/notebook
https://www.analyticsvidhya.com/blog/2021/07/recommendation-system-understanding-the-basic-concepts/#:~:text=A%20recommendation%20system%20is%20a,suggests%20relevant%20items%20to%20users.

# Market Basket Analysis
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. Market basket analysis has the objective of identifying products, or groups of products, that tend to occur together (are associated) in buying transactions (baskets).


In [None]:
#query = ''' select distinct t.*,a.product_code, a.prod_name from transactions as t
 #INNER JOIN articles as a ON t.article_id=a.article_id'''
#transaction_article=pysqldf(query)
#transaction_article.columns
 
transaction_article=pd.merge(transactions,articles[["article_id","clean_prod_name"
                      ,"product_type_name","index_name","product_group_name"
                     ]],how='inner',on='article_id')


In [None]:
articles.head(1)

In [None]:
transaction_article['Quantity'] = 1
#tst1['type'] = pd.Series(["omnichannel" for x in range(len(tst1.index))])
transaction_article.head()

In [None]:
#transaction_article.count()
##prod_name, product_type_name, index_name, colour_group_name
transaction_article1=transaction_article[["customer_id","InvoiceDate","clean_prod_name","Quantity","price"]].drop_duplicates()
transaction_article1.count()

transaction_article1 = transaction_article.groupby(["customer_id","InvoiceDate","clean_prod_name"]).agg({
  
    'Quantity': 'sum',
    'price': 'sum'})


In [None]:
transaction_article1=transaction_article1.reset_index()


In [None]:
transaction_article1['_ID'] = transaction_article1['customer_id']  + transaction_article1['InvoiceDate'].astype(str) 
##+transactions['article_id'].astype(str) 


transaction_article1['Invoice_id'] = pd.factorize(transaction_article1['_ID'])[0]

In [None]:
#customers/transcations who buy same article in same time window, in different channels.
agg_func_count = {'customer_id': [ 'nunique']}

tst=transaction_article1.groupby(['Invoice_id']).agg(agg_func_count)
tst.columns = tst.columns.to_flat_index().str.join('_')
tst=tst.reset_index()
tst1=tst.loc[tst['customer_id_nunique']!=1]
tst1.head()

In [None]:
basket = transaction_article1.groupby(['Invoice_id','clean_prod_name'])['Quantity'].sum()

In [None]:
basket=basket.fillna(0)

In [None]:
basket=basket.reset_index()

In [None]:
basket.head(10)

In [None]:
basket.count()

In [None]:
#!pip install pandas==0.21
import pandas as pd

In [None]:
# Convert from DataFrame to a Series, with order_id as index and item_id as value
orders  = basket.set_index('Invoice_id')['clean_prod_name'].rename('order_id')
display(orders.head(10))
type(orders)

In [None]:
import numpy as np
from itertools import combinations, groupby
from collections import Counter
import sys
from itertools import combinations, groupby
from collections import Counter
from IPython.display import display

In [None]:
# Returns frequency counts for items and item pairs
def freq(iterable):
    if type(iterable) == pd.core.series.Series:
        return iterable.value_counts().rename("freq")
    else: 
        return pd.Series(Counter(iterable)).rename("freq")

    
# Returns number of unique orders
def order_count(order_item):
    return len(set(order_item.index))


# Returns generator that yields item pairs, one at a time
def get_item_pairs(order_item):
    order_item = order_item.reset_index().to_numpy()
    for order_id, order_object in groupby(order_item, lambda x: x[0]):
        item_list = [item[1] for item in order_object]
              
        for item_pair in combinations(item_list, 2):
            yield item_pair
            

# Returns frequency and support associated with item
def merge_item_stats(item_pairs, item_stats):
    return (item_pairs
                .merge(item_stats.rename(columns={'freq': 'freqA', 'support': 'supportA'}), left_on='item_A', right_index=True)
                .merge(item_stats.rename(columns={'freq': 'freqB', 'support': 'supportB'}), left_on='item_B', right_index=True))


# Returns name associated with item
def merge_item_name(rules, item_name):
    columns = ['itemA','itemB','freqAB','supportAB','freqA','supportA','freqB','supportB', 
               'confidenceAtoB','confidenceBtoA','lift']
    rules = (rules
                .merge(item_name.rename(columns={'item_name': 'itemA'}), left_on='item_A', right_on='item_id')
                .merge(item_name.rename(columns={'item_name': 'itemB'}), left_on='item_B', right_on='item_id'))
    return rules[columns]  

In [None]:
def association_rules(order_item, min_support):

    print("Starting order_item: {:22d}".format(len(order_item)))


    # Calculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Filter from order_item items below min support 
    qualifying_items       = item_stats[item_stats['support'] >= min_support].index
    order_item             = order_item[order_item.isin(qualifying_items)]

    print("Items with support >= {}: {:15d}".format(min_support, len(qualifying_items)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Filter from order_item orders with less than 2 items
    order_size             = freq(order_item.index)
    qualifying_orders      = order_size[order_size >= 2].index
    order_item             = order_item[order_item.index.isin(qualifying_orders)]

    print("Remaining orders with 2+ items: {:11d}".format(len(qualifying_orders)))
    print("Remaining order_item: {:21d}".format(len(order_item)))


    # Recalculate item frequency and support
    item_stats             = freq(order_item).to_frame("freq")
    item_stats['support']  = item_stats['freq'] / order_count(order_item) * 100


    # Get item pairs generator
    item_pair_gen          = get_item_pairs(order_item)


    # Calculate item pair frequency and support
    item_pairs              = freq(item_pair_gen).to_frame("freqAB")
    item_pairs['supportAB'] = item_pairs['freqAB'] / len(qualifying_orders) * 100

    print("Item pairs: {:31d}".format(len(item_pairs)))


    # Filter from item_pairs those below min support
    item_pairs              = item_pairs[item_pairs['supportAB'] >= min_support]

    print("Item pairs with support >= {}: {:10d}\n".format(min_support, len(item_pairs)))


    # Create table of association rules and compute relevant metrics
    item_pairs = item_pairs.reset_index().rename(columns={'level_0': 'item_A', 'level_1': 'item_B'})
    item_pairs = merge_item_stats(item_pairs, item_stats)
    
    item_pairs['confidenceAtoB'] = item_pairs['supportAB'] / item_pairs['supportA']
    item_pairs['confidenceBtoA'] = item_pairs['supportAB'] / item_pairs['supportB']
    item_pairs['lift']           = item_pairs['supportAB'] / (item_pairs['supportA'] * item_pairs['supportB'])
    
    
    # Return association rules sorted by lift in descending order
    return item_pairs.sort_values('lift', ascending=False)

In [None]:
 
rules = association_rules(orders, 0.01)  

In [None]:
 for i,(k,v) in enumerate( pairs.items() ):
        
        if i==50: break
 
    
        plt.figure(figsize=(20,5))
 
        img1 = mpimg.imread(f'F:/images/0{str(v["article_id_A"])[:2]}/0{int(v["article_id_A"])}.jpg')
        img2 = mpimg.imread(f'F:/images/0{str(v["article_id_B"])[:2]}/0{int(v["article_id_B"])}.jpg')
        plt.subplot(1,4,1)
        plt.title('When customers buy this',size=18)
        plt.imshow(img1)
        plt.subplot(1,4,2)
        plt.title('They buy this',size=18)
        plt.imshow(img2)
         
        
        plt.show()

rules.count()


# Item-Based Collaborative Filtering -using Probabilistic Matrix Factorization



**Preparing the data** 
We need to restrict the data respect to a minimum transaction date. In that way, we reduce the dimensionality of the problem and we get rid of transactions that are not important in terms of the time decaying popularity.

Also, we are getting rid of articles that have not been bought enough. (Minimum 10 purchases are required)


https://www.kaggle.com/code/luisrodri97/item-based-collaborative-filtering

In [1]:
import pandas as pd
import numpy as np
import datetime
from tqdm import tqdm

In [2]:
transactions =  pd.read_csv("../input/h-and-m-personalized-fashion-recommendations/transactions_train.csv",
                           encoding="ISO-8859-1", dtype={'article_id':str},header=0).drop_duplicates()

In [3]:
start_date = datetime.datetime(2020,9,1)
# Filter transactions by date
transactions["t_dat"] = pd.to_datetime(transactions["t_dat"])
transactions = transactions.loc[transactions["t_dat"] >= start_date]

In [21]:

# Filter transactions by number of an article has been bought
article_bought_count = transactions[['article_id', 't_dat']].groupby('article_id').count().reset_index().rename(columns={'t_dat': 'count'})
most_bought_articles = article_bought_count[article_bought_count['count']>10]['article_id'].values
transactions = transactions[transactions['article_id'].isin(most_bought_articles)]
transactions["bought"]=1 

In [22]:
negative_samples.head()

In [23]:
# Generate negative samples
np.random.seed(0)

negative_samples = pd.DataFrame({
    'article_id': np.random.choice(transactions.article_id.unique(), transactions.shape[0]),
    'customer_id': np.random.choice(transactions.customer_id.unique(), transactions.shape[0]),
    'bought': np.zeros(transactions.shape[0])
})

In [16]:
from sklearn.metrics.pairwise import cosine_similarity


class ItemBased_RecSys:
    ''' Collaborative filtering using a custom sim(u,u'). '''

    def __init__(self, positive_transactions, negative_transactions, num_components=10):
        ''' Constructor '''
        self.positive_transactions = positive_transactions
        self.transactions = pd.concat([positive_transactions, negative_transactions])
        self.customers = self.transactions.customer_id.values
        self.articles = self.transactions.article_id.values
        self.bought = self.transactions.bought.values
        self.num_components = num_components

        self.customer_id2index = {c: i for i, c in enumerate(np.unique(self.customers))}
        self.article_id2index = {a: i for i, a in enumerate(np.unique(self.articles))}
        
    def __sdg__(self):
        for idx in tqdm(self.training_indices):
            # Get the current sample
            customer_id = self.customers[idx]
            article_id = self.articles[idx]
            bought = self.bought[idx]

            # Get the index of the user and the article
            customer_index = self.customer_id2index[customer_id]
            article_index = self.article_id2index[article_id]

            # Compute the prediction and the error
            prediction = self.predict_single(customer_index, article_index)
            error = (bought - prediction) # error
            
            # Update latent factors in terms of the learning rate and the observed error
            self.customers_latent_matrix[customer_index] += self.learning_rate * \
                                    (error * self.articles_latent_matrix[article_index] - \
                                     self.lmbda * self.customers_latent_matrix[customer_index])
            self.articles_latent_matrix[article_index] += self.learning_rate * \
                                    (error * self.customers_latent_matrix[customer_index] - \
                                     self.lmbda * self.articles_latent_matrix[article_index])
                
                
    def fit(self, n_epochs=10, learning_rate=0.001, lmbda=0.1):
        ''' Compute the matrix factorization R = P x Q '''
        self.learning_rate = learning_rate
        self.lmbda = lmbda
        n_samples = self.transactions.shape[0]
        
        # Initialize latent matrices
        self.customers_latent_matrix = np.random.normal(scale=1., size=(len(np.unique(self.customers)), self.num_components))
        self.articles_latent_matrix = np.random.normal(scale=1., size=(len(np.unique(self.articles)), self.num_components))

        for epoch in range(n_epochs):
            print('Epoch: {}'.format(epoch))
            self.training_indices = np.arange(n_samples)
            
            # Shuffle training samples and follow stochastic gradient descent
            np.random.shuffle(self.training_indices)
            self.__sdg__()

    def predict_single(self, customer_index, article_index):
        ''' Make a prediction for an specific user and article '''
        prediction = np.dot(self.customers_latent_matrix[customer_index], self.articles_latent_matrix[article_index])
        prediction = np.clip(prediction, 0, 1)
        
        return prediction

    def default_recommendation(self):
        ''' Calculate time decaying popularity '''
        # Calculate time decaying popularity. This leads to items bought more recently having more weight in the popularity list.
        # In simple words, item A bought 5 times on the first day of the train period is inferior than item B bought 4 times on the last day of the train period.
        self.positive_transactions['pop_factor'] = self.positive_transactions['t_dat'].apply(lambda x: 1/(datetime.datetime(2020,9,23) - x).days)
        transactions_by_article = self.positive_transactions[['article_id', 'pop_factor']].groupby('article_id').sum().reset_index()
        return transactions_by_article.sort_values(by='pop_factor', ascending=False)['article_id'].values[:12]


    def predict(self, customers):
        ''' Make recommendations '''
        recommendations = []
        self.articles_latent_matrix[np.isnan(self.articles_latent_matrix)] = 0
        # Compute similarity matrix (cosine)
        similarity_matrix = cosine_similarity(self.articles_latent_matrix, self.articles_latent_matrix, dense_output=False)

        # Convert similarity matrix into a matrix containing the 12 most similar items' index for each item
        similarity_matrix = np.argsort(similarity_matrix, axis=1)
        similarity_matrix = similarity_matrix[:, -12:]

        # Get default recommendation (time decay popularity)
        default_recommendation = self.default_recommendation()

        # Group articles by user and articles to compute the number of times each article has been bought by each user
        transactions_by_customer = self.positive_transactions[['customer_id', 'article_id', 'bought']].groupby(['customer_id', 'article_id']).count().reset_index()
        most_bought_article = transactions_by_customer.loc[transactions_by_customer.groupby('customer_id').bought.idxmax()]['article_id'].values

        # Make predictions
        for customer in tqdm(customers):
            try:
                rec_aux1 = []
                rec_aux2 = []
                aux = []

                # Retrieve the most bought article by customer
                user_most_bought_article_id = most_bought_article[self.customer_id2index[customer]]

                # Using the similarity matrix, get the 6 most similar articles
                rec_aux1 = self.articles[similarity_matrix[self.article_id2index[user_most_bought_article_id]]]
                # Return the half of the default recommendation
                rec_aux2 = default_recommendation

                # Merge half of both recommendation lists
                for rec_idx in range(6):
                    aux.append(rec_aux2[rec_idx])
                    aux.append(rec_aux1[rec_idx])

                recommendations.append(' '.join(aux))
            except:
                # Return the default recommendation
                recommendations.append(' '.join(default_recommendation))
        
        return pd.DataFrame({
            'customer_id': customers,
            'prediction': recommendations,
        })

In [32]:
rec = ItemBased_RecSys(transactions, negative_samples, num_components=1000)
rec.fit(n_epochs=25)


In [33]:
customers = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/sample_submission.csv'
                       ,encoding="ISO-8859-1", dtype={'article_id':str},header=0  ).customer_id.unique()

In [34]:
recommendations = rec.predict(customers)

In [35]:
recommendations.head()

In [36]:
recommendations.to_csv('submission.csv', index=False)