# Association Rule Mining - Market Basket Analysis
### Raul Almuzara

------------------------

We find association rules with a support of at least $0.2$ and a confidence of at least $0.7$ based on the list of transactions from *supermarket.arff*

--------------------------

**Basic definitions in association rule mining:**

We consider rules *R* of the form *IF A THEN B* where *A* and *B* are sets of items.

- The **support** of a set of items is the number of transactions which contain all the items of the set divided by the total number of transactions in the dataset.

- The **support** of an association rule is the number of transactions which contain all the items involved in the rule (in A and B) divided by the total number of transactions in the dataset.

- The **confidence** of an association rule is the support of the union of A and B divided by the support of A.

$$\text{conf}(R) = \frac{\text{supp}(A\cup B)}{\text{supp}(A)}$$

- The **lift** of an association rule is the support of the union of A and B divided by the product of the support of *A* and the support of *B*.

$$\text{lift}(R) = \frac{\text{supp}(A\cup B)}{\text{supp}(A)\cdot\text{supp}(B)}$$

--------------------------

### Libraries

In [273]:
import numpy as np
import pandas as pd
from scipy.io import arff
import apyori
from apyori import apriori

### Load data

In [274]:
data = arff.loadarff('supermarket.arff')
df = pd.DataFrame(data[0])

In [275]:
df

Unnamed: 0,department1,department2,department3,department4,department5,department6,department7,department8,department9,grocery misc,...,department208,department209,department210,department211,department212,department213,department214,department215,department216,total
0,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'high'
1,b't',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
2,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
3,b't',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
4,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4622,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
4623,b'?',b'?',b'?',b't',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'high'
4624,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'
4625,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',...,b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'?',b'low'


As we can see, there are 4627 transactions and 217 attributes. The first 216 attributes indicate if a specific product has been purchased (*t*) or has not been purchased (*?*). The 217th attribute (*total*) is related to the total cost of the transaction and we can drop it because we are only interested in the lists of products purchased.

In [276]:
df = df.drop(['total'],axis=1)

Full list of attributes of interest indicating the different products available:

In [277]:
attributes = list(df.columns)

In [278]:
attributes

['department1',
 'department2',
 'department3',
 'department4',
 'department5',
 'department6',
 'department7',
 'department8',
 'department9',
 'grocery misc',
 'department11',
 'baby needs',
 'bread and cake',
 'baking needs',
 'coupons',
 'juice-sat-cord-ms',
 'tea',
 'biscuits',
 'canned fish-meat',
 'canned fruit',
 'canned vegetables',
 'breakfast food',
 'cigs-tobacco pkts',
 'cigarette cartons',
 'cleaners-polishers',
 'coffee',
 'sauces-gravy-pkle',
 'confectionary',
 'puddings-deserts',
 'dishcloths-scour',
 'deod-disinfectant',
 'frozen foods',
 'razor blades',
 'fuels-garden aids',
 'spices',
 'jams-spreads',
 'insecticides',
 'pet foods',
 'laundry needs',
 'party snack foods',
 'tissues-paper prd',
 'wrapping',
 'dried vegetables',
 'pkt-canned soup',
 'soft drinks',
 'health food other',
 'beverages hot',
 'health&beauty misc',
 'deodorants-soap',
 'mens toiletries',
 'medicines',
 'haircare',
 'dental needs',
 'lotions-creams',
 'sanitary pads',
 'cough-cold-pain',
 'de

Since the data in the original *.arff* file seems to be encoded in the *bytes* format, we transform the data into strings and into a numpy matrix for easier handling.

In [279]:
sales = df.to_numpy().astype(str)

In [280]:
sales

array([['?', '?', '?', ..., '?', '?', '?'],
       ['t', '?', '?', ..., '?', '?', '?'],
       ['?', '?', '?', ..., '?', '?', '?'],
       ...,
       ['?', '?', '?', ..., '?', '?', '?'],
       ['?', '?', '?', ..., '?', '?', '?'],
       ['t', '?', '?', ..., '?', '?', '?']], dtype='<U1')

Now, we want to build lists with the specific attributes involved in each individual transaction (those with the *t* symbol).

In [281]:
transactions = [ [] for _ in range(df.shape[0]) ]  #Empty lists to be filled with the specific products purchased.

#We iterate over every element in the sales matrix and, if a product belongs to a transaction, we save it.
for i in range(df.shape[0]):
    for j in range(df.shape[1]):
        if sales[i,j] == "t":
            transactions[i].append(attributes[j])

In [282]:
transactions

[['baby needs',
  'bread and cake',
  'baking needs',
  'juice-sat-cord-ms',
  'biscuits',
  'canned vegetables',
  'cleaners-polishers',
  'coffee',
  'sauces-gravy-pkle',
  'confectionary',
  'dishcloths-scour',
  'frozen foods',
  'razor blades',
  'party snack foods',
  'tissues-paper prd',
  'wrapping',
  'mens toiletries',
  'cheese',
  'milk-cream',
  'margarine',
  'small goods',
  'fruit',
  'vegetables',
  'department122',
  '750ml white nz'],
 ['department1',
  'canned fish-meat',
  'canned fruit',
  'canned vegetables',
  'sauces-gravy-pkle',
  'deod-disinfectant',
  'frozen foods',
  'pet foods',
  'laundry needs',
  'tissues-paper prd',
  'deodorants-soap',
  'haircare',
  'milk-cream',
  'fruit',
  'vegetables'],
 ['bread and cake',
  'baking needs',
  'juice-sat-cord-ms',
  'biscuits',
  'canned fruit',
  'sauces-gravy-pkle',
  'puddings-deserts',
  'wrapping',
  'health food other',
  'small goods',
  'dairy foods',
  'beef',
  'lamb',
  'fruit',
  'vegetables',
  'sta

Finally, we use the *apriori* algorithm to find association rules with a minimum support of $0.2$ and a minimum confidence of $0.7$.

In [283]:
mins = 0.2
minc = 0.7

results = list(apriori(transactions,min_support=mins,min_confidence=minc))

In [284]:
for i in range(len(results)):
    print(list(results[i][2][0][0]),'->',list(results[i][2][0][1]))

[] -> ['bread and cake']
['baking needs'] -> ['bread and cake']
['breakfast food'] -> ['baking needs']
['canned vegetables'] -> ['baking needs']
['jams-spreads'] -> ['baking needs']
['laundry needs'] -> ['baking needs']
['margarine'] -> ['baking needs']
['tissues-paper prd'] -> ['baking needs']
['wrapping'] -> ['baking needs']
['beef'] -> ['bread and cake']
['beef'] -> ['vegetables']
['biscuits'] -> ['bread and cake']
['biscuits'] -> ['fruit']
['breakfast food'] -> ['bread and cake']
['canned fruit'] -> ['bread and cake']
['canned vegetables'] -> ['bread and cake']
['cheese'] -> ['bread and cake']
['cleaners-polishers'] -> ['bread and cake']
['confectionary'] -> ['bread and cake']
['dairy foods'] -> ['bread and cake']
['department137'] -> ['bread and cake']
['frozen foods'] -> ['bread and cake']
['fruit'] -> ['bread and cake']
['jams-spreads'] -> ['bread and cake']
['juice-sat-cord-ms'] -> ['bread and cake']
['laundry needs'] -> ['bread and cake']
['margarine'] -> ['bread and cake']
['

In [285]:
print('There are '+str(len(results))+' rules with a minimum support of '+str(mins)+
      ' and a minimum confidence of ' +str(minc))

There are 363 rules with a minimum support of 0.2 and a minimum confidence of 0.7


Of course, not all of them are informative enough. This level of interest is given by the lift. The interesting association rules are those with lifts far from 1. The *apriori* function can also take a minimum lift as an argument. For example, if we fix a minimum lift of $1.33$, we get the following reduced set of more interesting rules.

In [286]:
minl = 1.33

results_minlift = list(apriori(transactions,min_support=mins,min_confidence=minc,min_lift=minl))

In [287]:
for i in range(len(results_minlift)):
    print(list(results_minlift[i][2][0][0]),'->',list(results_minlift[i][2][0][1]))

['margarine', 'party snack foods'] -> ['biscuits']
['baking needs', 'bread and cake', 'frozen foods'] -> ['biscuits']
['baking needs', 'bread and cake', 'party snack foods'] -> ['biscuits']
['baking needs', 'fruit', 'frozen foods'] -> ['biscuits']
['bread and cake', 'frozen foods', 'margarine'] -> ['biscuits']
['bread and cake', 'party snack foods', 'frozen foods'] -> ['biscuits']
['fruit', 'biscuits', 'frozen foods'] -> ['bread and cake', 'vegetables']


In [288]:
print('There are '+str(len(results_minlift))+' rules with a minimum support of '+str(mins)+
      ', a minimum confidence of ' +str(minc)+
      ' and a minimum lift of ' +str(minl))

There are 7 rules with a minimum support of 0.2, a minimum confidence of 0.7 and a minimum lift of 1.33
