In [None]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules

In [None]:
df = pd.read_csv('../input/groceries-dataset/Groceries_dataset.csv', parse_dates=[1]) 

In [None]:
df.dtypes

In [None]:
df.isna().sum()

In [None]:
df

The format of the data is the id, data and the item bought (i.e. the item description column). 
Since the itemDescription column has only 1 item, we first group it by the id and the date, to create itemsets (also known as baskets).

The issue with this grouping is that we cannot be sure that the items are bought during the same visit, or doing multiple visits in the same day. However, since there is no other way, we assume that no one visited the store more than once a day

Since the dataset is in one column, we need to group it based on date and member number
We will have to make the assumption that one member does not visit the store twice in one day, since we only have date of visit, not the time.

In [None]:
# First, get a list of all items. Easiest way to do this is by dummyfying the item description column
items_dummies = pd.get_dummies(df['itemDescription'])

In [None]:
# Baskets will be a list of items bought together
baskets = df.groupby(['Date', 'Member_number']).agg(lambda x: ','.join(x).split(','))['itemDescription'].values

In [None]:
baskets, len(baskets)

In [None]:
df = df.join(items_dummies)

In [None]:
df

In [None]:
df.drop('itemDescription', axis=1, inplace=True)

In [None]:
df = df.groupby(['Date', 'Member_number']).sum() 

In [None]:
df.head()

In [None]:
df['basket'] = baskets
df.head(1) # Added the basket column to the dataset

In [None]:
# Check if the calculation is ok
(df.sum(axis=1) != df['basket'].apply(len)).sum() # Perfect

In [None]:
# There are samples where there are more than one of the same item in the basket (eg. {milk, milk})
# We need to only keep 1
len(np.where(df.drop('basket', axis=1)>1)[0])

In [None]:
for i in df.drop('basket', axis=1):
    df[i] = df[i].map(lambda x: 1 if x >1 else x)

In [None]:
df

In [None]:
len(np.where(df.drop('basket', axis=1)>1)[0])

In [None]:
df['UHT-milk'].sum()/len(df) # An example of calculating support for an item

In [None]:
# Drop the item list from the dataframe. No longer needed, since we have verified that the encoding is correct.
df.drop('basket', axis=1, inplace=True)

The support is the probability of an item (or itemset) being bought. Companies usually ignore promoting or working on products that have low support, since there is no point promoting a product that in any case hardly sells. 

In [None]:
# Lets keep the support threshold at 0.1%
supports = apriori(df, min_support=1e-3, use_colnames=True)

In [None]:
# Down to 69 items. How many associations can we have?
supports

In [None]:
# Number of itemsets which have over 1 item in the basket
supports[supports['itemsets'].map(lambda x: len(x)>1)] 

It is good to have a threshold metric for association rules.
The two main options are "confidence" and "lift".

Confidence is the proportion of all baskets of the selected antecendent itemset that also contains the consequent item. For eg, if 70% of all baskets with {Egg, Cheese} also contains {Milk}, then the confidence in the rule is 70%. 

Lift is the influence the antecedent itemset has on the consequent item. For instance, if 70% of baskets with {Egg, Cheese} contain {Milk}, but overall, 80% of all itemsets contain milk, then the lift is 70%/80% = 0.875. This means that although the confidence in the rule is high, the probability of the customer buying milk if he/she has egg and cheese in the basket actually decreases.

Using confidence as a pruning metric may work in certain scenarios, but in general, it is always better to use lift.
The minimum threshold should be over 1. However, since the data we have is extremely limited, I have gone with a threshold of 1.

In [None]:
associations = association_rules(supports, metric='lift', min_threshold=1)

In [None]:
# Antecendent support is support for the antecedent, i.e. before adding the new item
# Consequent support is the support of the consequent, i.e. the new item (in the row 127, the consequent support
# is the same as the support of sausage)
associations = associations.sort_values('confidence', ascending=False)


The table below is sorted by confidence. The topmost association {Yogurt, Sausage} -> Whole Milk has a confidence of 0.25, i.e. 25% of all baskets containing yogurt and sausage also contains whole milk.

Similarly, 21% of items containing {Rolls/Buns, Sausage} also contains Whole Milk. 

Whole milk is highly prevalent in this table, since it's individual support is 15.7%, i.e. 15.7% of all people who buy from the store buy milk.

In [None]:
associations.head(10)

The table below is sorted by lift. This is a better metric to use while creating association rules, since it tells you how much more a likely customer will buy an item given that he already has the other items in his basket.

For the first two examples, a customer is twice as likely to buy yogurt and milk if he/she has already bought sausages, or sausages if he/she has already bought yogurt and milk.

An interesting association here is between citrus fruit and specialty chocolate (lift of 1.65) and between tropical fruit and flour (lift of 1.61). However, since the dataset is limited (only ~14000 baskets) and the support for the new rules is extremely low, this could be due to chance as well. 

In [None]:
associations.sort_values('lift', ascending=False).head(10)