# Association Rules ~ Market Basket Analysis
<br>
<div>
<img src="images/rules.jpg" width="300" style="float:left;"/>
<img src="images/market.jpg" width="300"/>
</div>
<br>
<br>
<br>
<br>

1. association rule sendiri punya nama lain atau salah satu aplikasi dari association rules adalah market basket analysis. Market basket analysis adalah aplikasi paling populer di association rules. Permasalahan nya adalah ketika seorang customer membeli suatu barang, maka barang apalagi yang kemungkinan akan dibeli
2. Market basket analysis juga bisa digunakan untuk meyakinkan agar seorang customer yang berbelanja di sebuah pusat perbelanjaan, ia tidak lupa membeli barang lain disaat bersamaan. Maka kalau ada 2 barang seperti itu di toko atau di supermarket diletakan berdekatan, sehingga seorang customer tidak lupa untuk membeli keduanya.

## Outline:
- Pendahuluan
- 

## Kenapa Association Analysis (AA) penting di Data Mining?

Association Analysis adalah mencari hubungan (links) antar variabel menurut himpunan records di data. Links ini disebut sebagai asosiasi (ASSOCIATION).

<img src="images/ecommerce.jpg"/>

Di zaman industri 4.0 kita sering melihat industri marketplace yang berkembang di indonesia, baik marketplace yang customer-customer, bisnis-customer, bisnis-bisnis. Industri ini industri besar yang sangat berpengaruh dalam kehidupan masyarakat, baik online maupun ofline.

1. Model rekomendasi seperti association rule (Market Basket data analysis) bisa digunakan untuk meningkatkan penjualan misalnya cross-marketing (untuk menjual lebih banyak barang), catalog design, sale campaign analysis (untuk marketing)
2. Association rule bisa juga digunakan untuk Web log analysis, DNA sequence analysis, etc. Karena data nya berbentuk sekuensial.
<br>
<br>
<br>

Image Source: 
- https://www.liputan6.com/tekno/read/2586238/pasar-online-indonesia-kian-tumbuh-ecommerce-berjaya
- https://ginbusiness.wordpress.com/2016/02/27/jenis-e-commerce-di-indonesia/

## Association Rules (AR) dalam satu paragraph

AR berusaha menemukan semua himpunan ITEM (ITEMSETS) yang memiliki SUPPORT lebih besar dari MINIMUM SUPPORT, kemudian menggunakan itemsets yang signifikan untuk menghasilkan RULES yang memiliki CONFIDENCE lebih besar dari suatu MINIMUM CONFIDENCE. Rules ini akan dinilai berharga (signifikan) berdasarkan nilai LIFT-nya. Aplikasi paling populer AR adalah Market Basket Analysis (MBA).

## Items dan Itemsets

- Data AR berbentuk "transaksi": himpunan itemsets yang masing-masing elemen himpunannya adalah items
- Items: Bread, Milk, Coke, dll
- Itemset: {Bread, Milk}
- Contoh transaksi pada suatu hari di sebuah toko:

<img src = "images/tabel.jpg" />

## Secara Formal (Ringkasan Teori AR)

- Item adalah elemen himpunan dari data, contoh: Milk,Bread,Eggs
- Itemset adalah kemungkinan subset yang dibentuk dari item, contoh:  {Milk,Bread,Eggs} atau {Milk, Eggs}.
- Frekuensi kemunculan item atau itemset dalam data disebut Support:
- Jika support > dari suatu nilai ambang (threshold) maka itemset tersebut disebut frequent itemset.
- Sebuah Rule berbentuk X⇒Y dimana X (Antecedent) dan Y (Consequent) adalah itemsets. Contoh: {Milk,Diaper}⇒{Beer}
- Support dari sebuah rule adalah banyaknya transaksi yang memuat X dan Y.
s(X⇒Y)=s(X∪Y)
- Dalam association rule mining, kita ingin mencari Rules yang memiliki  support and confidence yang signifikan. 
- Nilai expected confidence tak bersyarat di AR disebut juga sebagai "lift:" --> 1). Lift<1 dianggap "negatif" (less than expected). 2). Lift = 1 : netral

## Contoh Rule:
Mie Instant ==> Saos Sambal
Rules digunakan dalam marketing untuk membuat berbagai keputusan, beberapa contohnya:

1. Letakkan kedua barang berdekatan (agar ndak lupa keduanya untuk dibeli).
2. Letakkan kedua barang berjauhan (agar konsumen akan melihat-lihat barang yang lain)
3. Satukan kedua barang dalam sebuah promo (promo akan jadi lebih menarik karena konsumen memang membutuhkan keduanya)
4. Satukan kedua barang dengan barang lain yang kurang laku (Cross selling)
5. Naikkan barang yang satu dan turunkan yang lain (teknik kompetisi dengan "toko sebelah")
6. Jangan iklankan kedua barang bersamaan.
7. Tawarkan promo saos dalam bentuk sachet gratis setiap membeli mie instan premium.

## Rule, Support, Confidence, Lift by Example
<br>
<br>
<img src="images/support_confidence_lift.png"/>

Image Source: http://www.saedsayad.com/association_rules.htm

## Prinsip Apriori (Sifat anti-monotone)
Jika sebuah itemset sering muncul, maka semua subset-nya juga pasti sering muncul. Begitupula kebalikannya juga berlaku, jika sebuah itemset jarang muncul, maka semua superset-nya pasti juga jarang muncul. Secara formal dituliskan
∀A,B:(A⊂B)=>s(A)≥s(B) 
Atau dengan kata lain support itemset tidak akan pernah melebihi support dari subset-nya. Sifat ini menjadi sangat penting nanti untuk mengurangi komputasi (Computational Complexity) dari perhitungan rules dari data.

## Algoritma Apriori:

- Candidate itemsets are generated using only the large itemsets of the previous pass without considering the transactions in the database.
- The large itemset of the previous pass is joined with itself to generate all itemsets whose size is higher by 1.
- Each generated itemset that has a subset which is not large is deleted. The remaining itemsets are the candidate ones.

<p><font face="Calibri"><img alt="" src="images/Apriori_Alg.png" /></font></p>

Image Source: http://www.saedsayad.com/association_rules.htm

In [29]:
import warnings; warnings.simplefilter('ignore')
import pandas as pd, matplotlib.pyplot as plt, seaborn as sns
from itertools import combinations
from collections import Counter
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#from pycaret.arules import *

%matplotlib inline
plt.style.use('bmh'); sns.set()

In [2]:
# In Python
T = [
 ('Bread', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
 ('Beer', 'Coke', 'Diaper', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Milk'),
 ('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper'),
]
T

[('Bread', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Eggs', 'Milk', 'Bread', 'Milk', 'Milk'),
 ('Beer', 'Coke', 'Diaper', 'Milk'),
 ('Beer', 'Bread', 'Diaper', 'Milk'),
 ('Bread', 'Coke', 'Diaper', 'Milk', 'Diaper')]

In [3]:
# Calculating item sets
# Nostalgia Matematika Diskrit :)
def subsets(S, k):
  return [set(s) for s in combinations(S, k)]

subsets({1, 2, 3, 7, 8}, 3)

[{1, 2, 3},
 {1, 2, 7},
 {1, 2, 8},
 {1, 3, 7},
 {1, 3, 8},
 {1, 7, 8},
 {2, 3, 7},
 {2, 3, 8},
 {2, 7, 8},
 {3, 7, 8}]

In [4]:
# Calculating support
Counter(T[1])

Counter({'Beer': 1, 'Bread': 2, 'Diaper': 1, 'Eggs': 1, 'Milk': 3})

In [5]:
# Using Module
# Taken from https://pbpython.com/market-basket-analysis.html
# Pertama-tama load Data
try:
    df = pd.read_csv('data/Online_Retail.csv', error_bad_lines=False, low_memory = False)
except:
    df = pd.read_excel('http://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [6]:
# Preprocessing
df['Description'] = df['Description'].str.strip() # remove unnecessary spaces
df['Description'] = df['Description'].str.lower() # lower case normalization
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True) # delete rows with no invoice no
df['InvoiceNo'] = df['InvoiceNo'].astype('str') # Change data type
df = df[~df['InvoiceNo'].str.contains('c')] # remove invoice with C in it
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,white metal lantern,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [7]:
df.to_csv("data/Online_Retail.csv", encoding='utf8', index=False)
'Done'

'Done'

In [8]:
filter_ = {'pls', 'plas'}
for f in filter_:
    df = df[~df['InvoiceNo'].str.contains(f)] # filtering invoice

In [9]:
print(set(df['Country']))

{'Bahrain', 'Iceland', 'United Arab Emirates', 'Cyprus', 'EIRE', 'Sweden', 'European Community', 'Spain', 'Israel', 'Germany', 'Greece', 'Unspecified', 'Australia', 'Singapore', 'Poland', 'Hong Kong', 'Switzerland', 'United Kingdom', 'Portugal', 'Finland', 'Czech Republic', 'Norway', 'Japan', 'Saudi Arabia', 'Belgium', 'Denmark', 'Brazil', 'Netherlands', 'Canada', 'USA', 'Malta', 'Lebanon', 'RSA', 'France', 'Austria', 'Italy', 'Channel Islands', 'Lithuania'}


In [10]:
df_A = df[df['Country'] =="Australia"]
df_A.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
197,536389,22941,christmas lights 10 reindeer,6,2010-12-01 10:03:00,8.5,12431.0,Australia
198,536389,21622,vintage union jack cushion cover,8,2010-12-01 10:03:00,4.95,12431.0,Australia
199,536389,21791,vintage heads and tails card game,12,2010-12-01 10:03:00,1.25,12431.0,Australia
200,536389,35004C,set of 3 coloured flying ducks,6,2010-12-01 10:03:00,5.45,12431.0,Australia
201,536389,35004G,set of 3 gold flying ducks,4,2010-12-01 10:03:00,6.35,12431.0,Australia


In [11]:
# Let's sample the data
basket = df[df['Country'] =="Australia"]
basket.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
197,536389,22941,christmas lights 10 reindeer,6,2010-12-01 10:03:00,8.5,12431.0,Australia
198,536389,21622,vintage union jack cushion cover,8,2010-12-01 10:03:00,4.95,12431.0,Australia
199,536389,21791,vintage heads and tails card game,12,2010-12-01 10:03:00,1.25,12431.0,Australia
200,536389,35004C,set of 3 coloured flying ducks,6,2010-12-01 10:03:00,5.45,12431.0,Australia
201,536389,35004G,set of 3 gold flying ducks,4,2010-12-01 10:03:00,6.35,12431.0,Australia


In [12]:
# Group the transaction
basket = basket.groupby(['InvoiceNo', 'Description'])['Quantity']
basket.head()

197         6
198         8
199        12
200         6
201         4
202         6
203         3
204         2
205         4
206         4
207         2
208         2
209        24
210        24
17067      24
17068     120
17069      12
17070      24
17071       4
17072       6
17073      12
17074      12
29278      -7
29279      -5
29280      -1
34673      10
34674      50
34675      10
34676      10
34677       4
         ... 
468133    100
468134    100
468135    100
468136     48
468137     96
468138     36
468139    960
468140     96
468141    144
468142     48
468143     36
468144     50
468145     50
468146     48
468147     48
468148     72
468149     96
468150    160
468151     80
468152     10
468153    120
469139    240
497678     96
497679     20
497680     20
497681     20
497682     24
497683     20
497684     12
497685     12
Name: Quantity, Length: 1259, dtype: int64

In [13]:
# Jumlahkan, unstack, Null=0, index baris menggunakan Nomer Invoice
basket = basket.sum().unstack().reset_index().fillna(0).set_index('InvoiceNo')
basket.head()

Description,10 colour spaceboy pen,12 pencil small tube woodland,12 pencils tall tube posy,12 pencils tall tube red retrospot,16 piece cutlery set pantry design,20 dolly pegs retrospot,3 hook hanger magic garden,3 stripey mice feltcraft,3 tier cake tin green and cream,3 tier cake tin red and cream,...,wrap doiley design,wrap dolly girl,wrap english rose,wrap i love london,wrap poppies design,wrap red apples,wrap red vintage doily,wrap vintage leaf design,wrap wedding day,yellow giant garden thermometer
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536389,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
537676,0.0,0.0,0.0,0.0,0.0,24.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
539419,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540267,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
540280,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = basket.applymap(encode_units) # one-hot encoding
basket_sets.head()

Description,10 colour spaceboy pen,12 pencil small tube woodland,12 pencils tall tube posy,12 pencils tall tube red retrospot,16 piece cutlery set pantry design,20 dolly pegs retrospot,3 hook hanger magic garden,3 stripey mice feltcraft,3 tier cake tin green and cream,3 tier cake tin red and cream,...,wrap doiley design,wrap dolly girl,wrap english rose,wrap i love london,wrap poppies design,wrap red apples,wrap red vintage doily,wrap vintage leaf design,wrap wedding day,yellow giant garden thermometer
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536389,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537676,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
539419,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540267,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
540280,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Understanding the Data Structure

In [15]:
basket_sets.columns

Index(['10 colour spaceboy pen', '12 pencil small tube woodland',
       '12 pencils tall tube posy', '12 pencils tall tube red retrospot',
       '16 piece cutlery set pantry design', '20 dolly pegs retrospot',
       '3 hook hanger magic garden', '3 stripey mice feltcraft',
       '3 tier cake tin green and cream', '3 tier cake tin red and cream',
       ...
       'wrap doiley design', 'wrap dolly girl', 'wrap english rose',
       'wrap i love london', 'wrap poppies  design', 'wrap red apples',
       'wrap red vintage doily', 'wrap vintage leaf design',
       'wrap wedding day', 'yellow giant garden thermometer'],
      dtype='object', name='Description', length=609)

In [16]:
basket_sets.index

Index(['536389', '537676', '539419', '540267', '540280', '540557', '540700',
       '541149', '541271', '541520', '541657', '542542', '543357', '543372',
       '543376', '543989', '545065', '545475', '546135', '547659', '548661',
       '549313', '552956', '553546', '554037', '554126', '556917', '556918',
       '558536', '558537', '559919', '559920', '560033', '560473', '560491',
       '561040', '561228', '563179', '563614', '565145', '565146', '565466',
       '567085', '568145', '568687', '568695', '568708', '569647', '569650',
       '569722', '569723', '574014', '574138', '574469', '576394', '576586',
       '578459', 'C538723', 'C543375', 'C545525', 'C548729', 'C551348',
       'C555046', 'C555288', 'C560540', 'C561227', 'C568694', 'C574019',
       'C574344'],
      dtype='object', name='InvoiceNo')

In [17]:
basket_sets.iloc[0]

Description
10 colour spaceboy pen                0
12 pencil small tube woodland         0
12 pencils tall tube posy             0
12 pencils tall tube red retrospot    0
16 piece cutlery set pantry design    0
20 dolly pegs retrospot               0
3 hook hanger magic garden            0
3 stripey mice feltcraft              0
3 tier cake tin green and cream       0
3 tier cake tin red and cream         0
36 doilies vintage christmas          0
36 pencils tube red retrospot         0
36 pencils tube skulls                0
4 traditional spinning tops           0
6 gift tags vintage christmas         0
6 ribbons rustic charm                0
60 cake cases vintage christmas       0
70's alphabet wall art                0
72 sweetheart fairy cake cases        0
abc treasure book box                 0
advent calendar gingham sack          0
alarm clock bakelike chocolate        0
alarm clock bakelike green            1
alarm clock bakelike ivory            0
alarm clock bakelike orange 

In [18]:
basket_sets.iloc[0].sum()

14

In [19]:
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
frequent_itemsets.sort_values(by='support', ascending=False, na_position='last', inplace = True)
frequent_itemsets

Unnamed: 0,support,itemsets
33,0.130435,(set of 3 cake tins pantry design)
28,0.130435,(red toadstool led night light)
31,0.115942,(roses regency teacup and saucer)
15,0.115942,(lunch bag red retrospot)
4,0.115942,(baking set spaceboy design)
21,0.115942,(party bunting)
16,0.115942,(lunch bag spaceboy design)
38,0.101449,(spotty bunting)
35,0.101449,(set of 6 soldier skittles)
26,0.101449,(red harmonica in box)


In [20]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules.sort_values(by='lift', ascending=False, na_position='last', inplace = True)
rules.head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
68,"(spaceboy lunch box, regency cakestand 3 tier)","(dolly girl lunch box, roses regency teacup an...",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
73,"(dolly girl lunch box, roses regency teacup an...","(spaceboy lunch box, regency cakestand 3 tier)",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
72,"(spaceboy lunch box, roses regency teacup and ...","(regency cakestand 3 tier, dolly girl lunch box)",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
69,"(regency cakestand 3 tier, dolly girl lunch box)","(spaceboy lunch box, roses regency teacup and ...",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
0,(spaceboy lunch box),(dolly girl lunch box),0.086957,0.086957,0.086957,1.0,11.5,0.079395,inf


In [21]:
# Filtering
rules[ (rules['lift'] >= 6) & (rules['confidence'] >= 0.8) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
68,"(spaceboy lunch box, regency cakestand 3 tier)","(dolly girl lunch box, roses regency teacup an...",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
73,"(dolly girl lunch box, roses regency teacup an...","(spaceboy lunch box, regency cakestand 3 tier)",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
72,"(spaceboy lunch box, roses regency teacup and ...","(regency cakestand 3 tier, dolly girl lunch box)",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
69,"(regency cakestand 3 tier, dolly girl lunch box)","(spaceboy lunch box, roses regency teacup and ...",0.072464,0.072464,0.072464,1.0,13.8,0.067213,inf
0,(spaceboy lunch box),(dolly girl lunch box),0.086957,0.086957,0.086957,1.0,11.5,0.079395,inf
47,(spaceboy lunch box),"(dolly girl lunch box, roses regency teacup an...",0.086957,0.072464,0.072464,0.833333,11.5,0.066163,5.565217
46,"(dolly girl lunch box, roses regency teacup an...",(spaceboy lunch box),0.072464,0.086957,0.072464,1.0,11.5,0.066163,inf
45,"(spaceboy lunch box, roses regency teacup and ...",(dolly girl lunch box),0.072464,0.086957,0.072464,1.0,11.5,0.066163,inf
43,(circus parade lunch box),"(spaceboy lunch box, dolly girl lunch box)",0.072464,0.086957,0.072464,1.0,11.5,0.066163,inf
42,(dolly girl lunch box),"(spaceboy lunch box, circus parade lunch box)",0.086957,0.072464,0.072464,0.833333,11.5,0.066163,5.565217


In [22]:
basket['postage'].sum()

0.0