# Objective

In this project, we will explore groceries items of a store. The questions I will try to answer are: Which items are best-sellers, extracting frequent item sets using apriori algorithm, and presenting the association. So, the bakery could decide where to put items to get more sales.

We'll start by importing the important libraries.

In [0]:
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

In [0]:
# Loading the dataset
data = pd.read_csv('groceries.csv', header=None)

Here, we added header=None so that first row won't be include as header row.

In [5]:
# Checking the data
data.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,22,23,24,25,26,27,28,29,30,31
0,frankfurter,sausage,liver loaf,ham,chicken,beef,citrus fruit,tropical fruit,root vegetables,other vegetables,...,roll products,flour,pasta,margarine,specialty fat,sugar,soups,skin care,hygiene articles,candles
1,citrus fruit,semi-finished bread,margarine,ready soups,,,,,,,...,,,,,,,,,,
2,tropical fruit,yogurt,coffee,,,,,,,,...,,,,,,,,,,
3,whole milk,,,,,,,,,,...,,,,,,,,,,
4,pip fruit,yogurt,cream cheese,meat spreads,,,,,,,...,,,,,,,,,,


In [6]:
# Get the unique items from the data
data.describe().iloc[1, :].sum()

2201

The dataframe is in the format that, it will take a lot of computation, iterations and code to make association around different items. To simplify this, I'll turn the dataframe to **Python generators** that returns an iterable sequence of items.

In [0]:
def p_item_generator(dataset):
    transactions = []
    for i in range(0, dataset.shape[0]):
        x = [str(dataset.values[i,j]) for j in range(0, dataset.shape[1])]
        x_cleaned = [val for val in x if str(val) != 'nan'] # Removing nan from the list
        transactions.append(x_cleaned)
    return transactions
        
transactions = p_item_generator(data)

In [10]:
# Checking the best-sellers or most sold items in the store.
from collections import Counter
Counter(x for sublist in transactions for x in sublist).most_common()[0:10]

[('whole milk', 2513),
 ('other vegetables', 1903),
 ('rolls/buns', 1809),
 ('soda', 1715),
 ('yogurt', 1372),
 ('bottled water', 1087),
 ('root vegetables', 1072),
 ('tropical fruit', 1032),
 ('shopping bags', 969),
 ('sausage', 924)]

The top 3 best-sellers are whole milk, other vegetables, and rolls/buns.

# Apriori Algorithm

In layman's term, this algorithm performs association rule analysis to give an idea about how items are associated to each other.

The common metrics to analyze association are:

**Support:** It means how many times an item appeared in a dataset. Let's say, we have some transactions:

1.   ['tropical fruit', 'yogurt', 'coffee', chicken]
2.   ['pip fruit', 'yogurt', 'cream cheese ', 'chicken', 'meat spreads']
1.   ['whole milk', 'cereals']
2.   ['chicken', 'tropical fruit']

```
support(chicken) = 3/4
support(tropical fruit) = 2/4
support(chicken, tropical fruit) = 2/4
```





**Confidence:** It means a customer who bought item1 given that also bought item2. In simple words, a customer who bought 'chicken' also bought 'tropical fruit'. Let's run the above example:

```
confidence{chicken -> topical fruit} = support(chicken, tropical fruit) / support(chicken)
= (2/4) / (3/4) 
= 0.67 
= 67%
```

0.67 confidence value implies that all transactions that contain chicken, 67% of them also contain tropical fruit.



```
confidence{tropical fruit -> chicken} = support(tropical fruit, chicken) / support(tropical fruit)
= (2/4) / (2/4) 
= 1%
```

Here, we could see that all transactions containing tropical fruilt also contain chicken.



**Lift:** It measures the relationship between/among items, or to check whether the items are appearning in the same order in random or simply by chance. There are two methods to calculate the lift:

**Method 1: **

```
lift{chicken, tropical fruit} = support(chicken, tropical fruit) / (support(chicken) * support(tropical fruit))
= (2/4) / ((3/4) * (2/4))
= 1.333333333
```

**Method 2: **


```
lift{chicken, tropical fruit} = confidence{chicken -> tropical fruit} / support(tropical fruit)
= (0.67) / (2/4)
= 1.34
```

**Note:** lift{chicken, tropical fruit} = lift{tropical fruit, chicken}. Let's test this:


```
lift{tropical fruit, chicken} = confidence{tropical fruit -> chicken} / support(chicken)
= 1 / (3/4)
= 1.333333333
```

To understand lift, keep the following three points in mind.

1. lift = 1 implies chicken and tropical fruit occur together only by chance (No relationship)
2. lift > 1 implies chicken and tropical fruit occur together more often than random. (Positive relationship)
3. lift < 1 implies chicken and tropical fruit occure together less often than random. (Negative relationship)


In the above example, lift{chicken, tropical fruit} implies that chicken and tropical fruit occur together 1.33333 times more than random. 

## Encoding

For apriori algorithm to work, the dataset needs to be one hot encoded. For encoding, we are going to use TransactionEncoder. Read the doc here https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/

In [11]:
tc = TransactionEncoder()
tc_ary = tc.fit(transactions).transform(transactions)
df = pd.DataFrame(tc_ary, columns=tc.columns_)
df.head(2)

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,baby food,bags,baking powder,bathroom cleaner,beef,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,True,...,False,False,False,True,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Frequent Item Sets

In [12]:
# Creating frequent item sets.
# apriori function doc: https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
frequent_itemsets = apriori(df, min_support=0.002, use_colnames=True)
frequent_itemsets.sample(5)

Unnamed: 0,support,itemsets
3560,0.005389,"(pastry, soda, rolls/buns)"
2402,0.002644,"(tropical fruit, cat food, whole milk)"
166,0.002542,"(napkins, UHT-milk)"
177,0.002237,"(sugar, UHT-milk)"
1642,0.002644,"(red/blush wine, shopping bags)"


Setting the min_support is a difficult task which changes from domain to domain, and you should have better knowledge of the domain.

My formula for calculating is simple. As, we are generating associations for market bucket optimization.

So, I will suppose chicken is bought 3 times a day, 21 times a week. So, I will divide 21 by total number of transactions 9835. 21/9835 = 0.0021. 

You can try different minimum support.

Here, I extracted frequent item sets using apriori with alteast 0.2% support. It means, items sets with less than 0.2% support will be excluded.

## Generating Rules

Documentation: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

In [13]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2).sort_values(by='lift', ascending=False)
rules.head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
19645,"(tropical fruit, root vegetables)","(citrus fruit, other vegetables, whole milk)",0.021047,0.013015,0.003152,0.149758,11.506831,0.002878,1.160829
19644,"(citrus fruit, other vegetables, whole milk)","(tropical fruit, root vegetables)",0.013015,0.021047,0.003152,0.242188,11.506831,0.002878,1.291814
1,(hamburger meat),(Instant food products),0.033249,0.008033,0.00305,0.091743,11.421438,0.002783,1.092166
0,(Instant food products),(hamburger meat),0.008033,0.033249,0.00305,0.379747,11.421438,0.002783,1.55864
8904,(flour),"(sugar, whole milk)",0.017387,0.015048,0.002847,0.163743,10.881144,0.002585,1.177809
8901,"(sugar, whole milk)",(flour),0.015048,0.017387,0.002847,0.189189,10.881144,0.002585,1.21189
19613,"(yogurt, other vegetables, whole milk)","(tropical fruit, butter)",0.022267,0.009964,0.002339,0.105023,10.539791,0.002117,1.106213
19616,"(tropical fruit, butter)","(yogurt, other vegetables, whole milk)",0.009964,0.022267,0.002339,0.234694,10.539791,0.002117,1.277571
19639,"(tropical fruit, other vegetables, whole milk)","(root vegetables, citrus fruit)",0.017082,0.017692,0.003152,0.184524,10.429837,0.00285,1.204582
19650,"(root vegetables, citrus fruit)","(tropical fruit, other vegetables, whole milk)",0.017692,0.017082,0.003152,0.178161,10.429837,0.00285,1.195998


I extracted the association rules which have lift equal or greater than 1.2.

Let's say, we are interested in the rules with: 2 antecedents, confidence 0.75, lift 1.2.

In [15]:
# First add antecedent_len column to the rules
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))

rules[ (rules['antecedent_len'] >= 2) &
       (rules['confidence'] > 0.75) &
       (rules['lift'] > 1.2) ].head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
19632,"(tropical fruit, root vegetables, citrus fruit...",(other vegetables),0.003559,0.193493,0.003152,0.885714,4.577509,0.002463,7.05694,4
17295,"(tropical fruit, grapes, whole milk)",(other vegetables),0.002542,0.193493,0.002034,0.8,4.134524,0.001542,4.032537,3
19722,"(root vegetables, yogurt, fruit/vegetable juic...",(other vegetables),0.002542,0.193493,0.002034,0.8,4.134524,0.001542,4.032537,4
15802,"(tropical fruit, root vegetables, citrus fruit)",(other vegetables),0.005694,0.193493,0.004474,0.785714,4.060694,0.003372,3.763701,3
17098,"(tropical fruit, root vegetables, fruit/vegeta...",(other vegetables),0.003254,0.193493,0.002542,0.78125,4.037622,0.001912,3.686891,3
17743,"(onions, whipped/sour cream, whole milk)",(other vegetables),0.002745,0.193493,0.002135,0.777778,4.019677,0.001604,3.629283,3
19751,"(tropical fruit, root vegetables, pip fruit, w...",(other vegetables),0.003152,0.193493,0.00244,0.774194,4.001153,0.00183,3.571676,4
18467,"(root vegetables, sliced cheese, whole milk)",(other vegetables),0.003152,0.193493,0.00244,0.774194,4.001153,0.00183,3.571676,3
15639,"(root vegetables, citrus fruit, frozen vegetab...",(other vegetables),0.002644,0.193493,0.002034,0.769231,3.975504,0.001522,3.494865,3
15331,"(tropical fruit, butter, whipped/sour cream)",(other vegetables),0.00305,0.193493,0.002339,0.766667,3.962253,0.001748,3.45646,3


# Conclusion

From the output, we can see that one set of item being purchased with other set of item in the same family. For example, {root vegetables, tropical fruit} is purchased with {other vegetables, whole milk, citrus fruit}.

