# Market Basket Analysis using assocition rules

**`In this Notebook we would learn`**

1. What is Association Rule Learning?

2. How Apriori works ?

3. Implementing Apriori With Python - in 2 ways



**Problem Statement :**

When we go grocery shopping, we often have a standard list of things to buy. Each shopper has a distinctive list, depending on one’s needs and preferences. A housewife might buy healthy ingredients for a family dinner, while a bachelor might buy beer and chips. Understanding these buying patterns can help to increase sales in several ways. If there is a pair of items, X and Y, that are frequently bought together:

> Both X and Y can be placed on the same shelf, so that buyers of one item would be prompted to buy the other.

> Promotional discounts could be applied to just one out of the two items.

> Advertisements on X could be targeted at buyers who purchase Y.

> X and Y could be combined into a new product, such as having Y in flavors of X.

> **While we may know that certain items are frequently bought together, the question is, how do we uncover these associations?**

*Besides increasing sales profits, association rules can also be used in other fields. In medical diagnosis for instance, understanding which symptoms tend to co-morbid can help to improve patient care and medicine prescription.*


What is Association Rule Learning?
--

Association Rule Learning is rule-based learning for identifying the association between different variables in a database. **`One of the best and most popular examples of Association Rule Learning is the Market Basket Analysis`**. The problem analyses the association between various items that has the highest probability of being bought together by a customer.

For example, the association rule, {onions, chicken masala} => {chicken} says that a person who has got both onions and chicken masala in his or her basket has a high probability of buying chicken also.


Apriori Algorithm
--

The algorithm was first proposed in 1994 by Rakesh Agrawal and Ramakrishnan Srikant. Apriori algorithm finds the most frequent itemsets or elements in a transaction database and identifies association rules between the items just like the above-mentioned example.


How Apriori works ?
--

To construct association rules between elements or items, the algorithm considers 3 important factors which are, support, confidence and lift. Each of these factors is explained as follows:

**Support:**

The support of item I is defined as the ratio between the number of transactions containing the item I by the total number of transactions expressed as :

<img src="https://miro.medium.com/max/403/0*pyOADkeaWyrVP2ft.png" />

<font color='darkgreen'><b>Support</b></font> indicates how popular an itemset is, as measured by the proportion of transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%.

<img src='https://annalyzin.files.wordpress.com/2016/04/association-rule-support-table.png?w=503&h=447' />

<br />

<font color='red'><em>If you discover that sales of items beyond a certain proportion tend to have a significant impact on your profits, you might consider using that proportion as your support threshold. You may then identify itemsets with support values above this threshold as significant itemsets.</em></font>


**Confidence:**

This is measured by the proportion of transactions with item I1, in which item I2 also appears. The confidence between two items I1 and I2,  in a transaction is defined as the total number of transactions containing both items I1 and I2 divided by the total number of transactions containing I1. ( Assume I1 as X , I2 as Y )

<img src='https://miro.medium.com/max/576/1*50GI4dR58MnhwBP9dw6nFQ.png' />

<font color='darkgreen'><b>Confidence</b></font> says how likely item Y is purchased when item X is purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.

<img src='https://annalyzin.files.wordpress.com/2016/03/association-rule-confidence-eqn.png?w=527&h=77' />

<font color='red'><em>One drawback of the confidence measure is that it might misrepresent the importance of an association. This is because it only accounts for how popular apples are, but not beers. If beers are also very popular in general, there will be a higher chance that a transaction containing apples will also contain beers, thus inflating the confidence measure. To account for the base popularity of both constituent items, we use a third measure called lift. </em></font>


**Lift:**

Lift is the ratio between the confidence and support.

<font color='darkgreen'><b>Lift</b></font> says how likely item Y is purchased when item X is purchased, while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association between items. A lift value greater than 1 means that item Y is likely to be bought if item X is bought, while a value less than 1 means that item Y is unlikely to be bought if item X is bought. ( *here X represents apple and Y represents beer* )

<img src='https://annalyzin.files.wordpress.com/2016/03/association-rule-lift-eqn.png?w=566&h=80' />

<hr>

for **Extra Reading** : refer <a href='https://towardsdatascience.com/association-rules-2-aa9a77241654'> this </a>


Implementing Apriori With Python - in 2 ways
--

<h3>In the <b><u>first way</u></b> , we use apyori package.</h3>


In [27]:
#External package need to install
!pip install apyori

  and should_run_async(code)




In [28]:
#import all required packages..
import pandas as pd
import numpy as np
from apyori import apriori

  and should_run_async(code)


In [29]:
#loading market basket dataset..

df = pd.read_csv('/content/Market_Basket_Optimisation.csv',header=None)

  and should_run_async(code)


In [30]:
df.head()

  and should_run_async(code)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


In [31]:
## Data Cleaning step

# replacing empty value with 0.
df.fillna(0,inplace=True)

  and should_run_async(code)


In [32]:
df.head()

  and should_run_async(code)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,chutney,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,turkey,avocado,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,mineral water,milk,energy bar,whole wheat rice,green tea,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [33]:
# Data Pre-processing step

# for using aprori , need to convert data in list format..
# transaction = [['apple','almonds'],['apple'],['banana','apple']]....

transactions = []

for i in range(0,len(df)):
    transactions.append([str(df.values[i,j]) for j in range(0,20) if str(df.values[i,j])!='0'])

  and should_run_async(code)


In [34]:
## verifying - by printing the 0th transaction
transactions[0]

  and should_run_async(code)


['shrimp',
 'almonds',
 'avocado',
 'vegetables mix',
 'green grapes',
 'whole weat flour',
 'yams',
 'cottage cheese',
 'energy drink',
 'tomato juice',
 'low fat yogurt',
 'green tea',
 'honey',
 'salad',
 'mineral water',
 'salmon',
 'antioxydant juice',
 'frozen smoothie',
 'spinach',
 'olive oil']

In [35]:
## verifying - by printing the 1st transaction
transactions[1]

  and should_run_async(code)


['burgers', 'meatballs', 'eggs']

In [36]:
# Call apriori function which requires minimum support, confidance and lift, min length is combination of item default is 2".
rules = apriori(transactions, min_support=0.003, min_confidance=0.2, min_lift=3, min_length=2)

## min_support = 0.003 -> means selecting items with min support of 0.3%
## min_confidance = 0.2 -> means min confidance of 20%
## min_lift = 3
## min_length = 2 -> means no. of items in the transaction should be 2

  and should_run_async(code)


In [37]:
#it generates a set of rules in a generator file...
rules

  and should_run_async(code)


<generator object apriori at 0x7a4a81f9dfe0>

In [38]:
# all rules need to be converted in a list..
Results = list(rules)
Results

  and should_run_async(code)


[RelationRecord(items=frozenset({'brownies', 'cottage cheese'}), support=0.0034662045060658577, ordered_statistics=[OrderedStatistic(items_base=frozenset({'brownies'}), items_add=frozenset({'cottage cheese'}), confidence=0.10276679841897232, lift=3.225329518580382), OrderedStatistic(items_base=frozenset({'cottage cheese'}), items_add=frozenset({'brownies'}), confidence=0.10878661087866107, lift=3.2253295185803816)]),
 RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'chicken'}), items_add=frozenset({'light cream'}), confidence=0.07555555555555556, lift=4.843950617283951), OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'escalope', 'mushroom cream sauce'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'escalope'}),

In [39]:
# convert result in a dataframe for further operation...
df_results = pd.DataFrame(Results)

  and should_run_async(code)


In [40]:
# as we see "order_statistics" , is itself a list so need to be converted in proper format..
df_results.head()

  and should_run_async(code)


Unnamed: 0,items,support,ordered_statistics
0,"(brownies, cottage cheese)",0.003466,"[((brownies), (cottage cheese), 0.102766798418..."
1,"(light cream, chicken)",0.004533,"[((chicken), (light cream), 0.0755555555555555..."
2,"(escalope, mushroom cream sauce)",0.005733,"[((escalope), (mushroom cream sauce), 0.072268..."
3,"(escalope, pasta)",0.005866,"[((escalope), (pasta), 0.07394957983193277, 4...."
4,"(fresh bread, tomato juice)",0.004266,"[((fresh bread), (tomato juice), 0.09907120743..."


In [41]:
# keep support in a separate data frame so we can use later..
support = df_results.support

  and should_run_async(code)


In [42]:
'''
convert orderstatistic in a proper format.
order statistic has lhs => rhs as well rhs => lhs
we can choose any one for convience.
Let's choose first one which is 'df_results['ordered_statistics'][i][0]'
'''

#all four empty list which will contain lhs, rhs, confidance and lift respectively.
first_values = []
second_values = []
third_values = []
fourth_value = []

# loop number of rows time and append 1 by 1 value in a separate list..
# first and second element was frozenset which need to be converted in list..
for i in range(df_results.shape[0]):
    single_list = df_results['ordered_statistics'][i][0]
    first_values.append(list(single_list[0]))
    second_values.append(list(single_list[1]))
    third_values.append(single_list[2])
    fourth_value.append(single_list[3])

  and should_run_async(code)


In [43]:
# convert all four list into dataframe for further operation..
lhs = pd.DataFrame(first_values)
rhs = pd.DataFrame(second_values)

confidance=pd.DataFrame(third_values,columns=['Confidance'])

lift=pd.DataFrame(fourth_value,columns=['lift'])

  and should_run_async(code)


In [44]:
# concat all list together in a single dataframe
df_final = pd.concat([lhs,rhs,support,confidance,lift], axis=1)
df_final

  and should_run_async(code)


Unnamed: 0,0,1,0.1,1.1,2,support,Confidance,lift
0,brownies,,cottage cheese,,,0.003466,0.102767,3.225330
1,chicken,,light cream,,,0.004533,0.075556,4.843951
2,escalope,,mushroom cream sauce,,,0.005733,0.072269,3.790833
3,escalope,,pasta,,,0.005866,0.073950,4.700812
4,fresh bread,,tomato juice,,,0.004266,0.099071,3.259356
...,...,...,...,...,...,...,...,...
89,pancakes,ground beef,mineral water,spaghetti,,0.003066,0.211009,3.532991
90,ground beef,,mineral water,spaghetti,tomatoes,0.003066,0.031208,3.344117
91,olive oil,,milk,mineral water,spaghetti,0.003333,0.050607,3.216994
92,milk,mineral water,spaghetti,shrimp,,0.003066,0.063889,3.014029


In [45]:
'''
 we have some of place only 1 item in lhs and some place 3 or more so we need to a proper represenation for User to understand.
 replacing none with ' ' and combining three column's in 1
 example : coffee,none,none is converted to coffee, ,
'''
df_final.fillna(value=' ', inplace=True)
df_final.head()

  and should_run_async(code)


Unnamed: 0,0,1,0.1,1.1,2,support,Confidance,lift
0,brownies,,cottage cheese,,,0.003466,0.102767,3.22533
1,chicken,,light cream,,,0.004533,0.075556,4.843951
2,escalope,,mushroom cream sauce,,,0.005733,0.072269,3.790833
3,escalope,,pasta,,,0.005866,0.07395,4.700812
4,fresh bread,,tomato juice,,,0.004266,0.099071,3.259356


In [46]:
#set column name
df_final.columns = ['lhs',1,'rhs',2,3,'support','confidance','lift']
df_final.head()

  and should_run_async(code)


Unnamed: 0,lhs,1,rhs,2,3,support,confidance,lift
0,brownies,,cottage cheese,,,0.003466,0.102767,3.22533
1,chicken,,light cream,,,0.004533,0.075556,4.843951
2,escalope,,mushroom cream sauce,,,0.005733,0.072269,3.790833
3,escalope,,pasta,,,0.005866,0.07395,4.700812
4,fresh bread,,tomato juice,,,0.004266,0.099071,3.259356


In [47]:
# add all three column to lhs itemset only
df_final['lhs'] = df_final['lhs'] + str(", ") + df_final[1]

df_final['rhs'] = df_final['rhs']+str(", ")+df_final[2] + str(", ") + df_final[3]

  and should_run_async(code)


In [48]:
df_final.head()

  and should_run_async(code)


Unnamed: 0,lhs,1,rhs,2,3,support,confidance,lift
0,"brownies,",,"cottage cheese, ,",,,0.003466,0.102767,3.22533
1,"chicken,",,"light cream, ,",,,0.004533,0.075556,4.843951
2,"escalope,",,"mushroom cream sauce, ,",,,0.005733,0.072269,3.790833
3,"escalope,",,"pasta, ,",,,0.005866,0.07395,4.700812
4,"fresh bread,",,"tomato juice, ,",,,0.004266,0.099071,3.259356


In [49]:
#drop columns 1,2 and 3 because now we already appended to lhs column.

df_final.drop(columns=[1,2,3],inplace=True)

  and should_run_async(code)


In [50]:
#this is final output. You can sort based on the support lift and confidance..
df_final.head()

  and should_run_async(code)


Unnamed: 0,lhs,rhs,support,confidance,lift
0,"brownies,","cottage cheese, ,",0.003466,0.102767,3.22533
1,"chicken,","light cream, ,",0.004533,0.075556,4.843951
2,"escalope,","mushroom cream sauce, ,",0.005733,0.072269,3.790833
3,"escalope,","pasta, ,",0.005866,0.07395,4.700812
4,"fresh bread,","tomato juice, ,",0.004266,0.099071,3.259356


In [51]:
## Showing top 10 items, based on lift.  Sorting in desc order
df_final.sort_values('lift', ascending=False).head(10)

  and should_run_async(code)


Unnamed: 0,lhs,rhs,support,confidance,lift
58,"olive oil,","mineral water, whole wheat pasta,",0.003866,0.058704,6.115863
6,"fromage blanc,","honey, ,",0.003333,0.245098,5.164271
49,"ground beef,","tomato sauce, spaghetti,",0.003066,0.031208,4.9806
1,"chicken,","light cream, ,",0.004533,0.075556,4.843951
3,"escalope,","pasta, ,",0.005866,0.07395,4.700812
28,"ground beef,","herb & pepper, french fries,",0.0032,0.032564,4.697422
11,"pasta,","shrimp, ,",0.005066,0.322034,4.506672
23,"ground beef,","chocolate, herb & pepper,",0.003999,0.040706,4.490183
69,"frozen vegetables,","chocolate, mineral water, shrimp",0.0032,0.033566,4.417225
10,"olive oil,","whole wheat pasta, ,",0.007999,0.121457,4.12241


Other way of doing Apriori in Python
--

<h3>In the <u>Second way</u> we would use <b>mlxtend.frequent_patterns</b> package.</h3>

<font color='red'><b>Why we doing it in this way</b></font> -

1.  **Limitation of first approach** was need to convert data in a list fomat. In real life a store has many thousands of transactations hence it is computationally expensive.

2. `Apyori package is outdated`.

3. Results are coming in less friendly format so their is a need for pre-processing.

4. Instead lets use **`mlxtend`** package. It generate frequent itemset and association rules both.

In [52]:
'''
load apriori and association modules from mlxtend.frequent_patterns
Used different dataset because mlxtend need data in below format.

    transaction_name    apple banana grapes
transaction  1            0     1      1
             2            1     0      1
             3            1     0      0
             4            0     1      0

 we could have used above data as well but need to perform operation to bring in this format instead of that used seperate data only.
'''

import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

df1 = pd.read_csv('/content/data-2.csv', encoding="ISO-8859-1")
df1.head()

  and should_run_async(code)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [53]:
# data has many country choose any one for check.
df1.Country.value_counts().head(5)

  and should_run_async(code)


Unnamed: 0_level_0,count
Country,Unnamed: 1_level_1
United Kingdom,495478
Germany,9495
France,8557
EIRE,8196
Spain,2533


In [54]:
# using only France country data for now;  can check for other as well..
df1 = df1[df1.Country == 'France']

  and should_run_async(code)


In [55]:
# some spaces are there in description; need to remove else later operation it will create problem..
df1['Description'] = df1['Description'].str.strip()

  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df1['Description'] = df1['Description'].str.strip()


In [56]:
#some of transaction quantity is negative which can not be possible remove them.
df1 = df1[df1.Quantity >0]

  and should_run_async(code)


In [57]:
#df1[df1.Country == 'France'].head(10)
df1.head(10)

  and should_run_async(code)


Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
26,536370,22728,ALARM CLOCK BAKELIKE PINK,24,12/1/2010 8:45,3.75,12583.0,France
27,536370,22727,ALARM CLOCK BAKELIKE RED,24,12/1/2010 8:45,3.75,12583.0,France
28,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,12/1/2010 8:45,3.75,12583.0,France
29,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,12/1/2010 8:45,0.85,12583.0,France
30,536370,21883,STARS GIFT TAPE,24,12/1/2010 8:45,0.65,12583.0,France
31,536370,10002,INFLATABLE POLITICAL GLOBE,48,12/1/2010 8:45,0.85,12583.0,France
32,536370,21791,VINTAGE HEADS AND TAILS CARD GAME,24,12/1/2010 8:45,1.25,12583.0,France
33,536370,21035,SET/2 RED RETROSPOT TEA TOWELS,18,12/1/2010 8:45,2.95,12583.0,France
34,536370,22326,ROUND SNACK BOXES SET OF4 WOODLAND,24,12/1/2010 8:45,2.95,12583.0,France
35,536370,22629,SPACEBOY LUNCH BOX,24,12/1/2010 8:45,1.95,12583.0,France


In [58]:
# convert data in format which is required
# converting using pivot table and Quantity sum as values. fill 0 if any nan values

basket = pd.pivot_table(data=df1,index='InvoiceNo',columns='Description',values='Quantity', aggfunc='sum',fill_value=0)

  and should_run_async(code)


In [59]:
basket.head()

  and should_run_async(code)


Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 EGG HOUSE PAINTED WOOD,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE WOODLAND,...,WRAP VINTAGE PETALS DESIGN,YELLOW COAT RACK PARIS FASHION,YELLOW GIANT GARDEN THERMOMETER,YELLOW SHARK HELICOPTER,ZINC STAR T-LIGHT HOLDER,ZINC FOLKART SLEIGH BELLS,ZINC HERB GARDEN CONTAINER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [60]:
#this to check correctness after binning it to 1 ..
basket['10 COLOUR SPACEBOY PEN'].head(10)

  and should_run_async(code)


Unnamed: 0_level_0,10 COLOUR SPACEBOY PEN
InvoiceNo,Unnamed: 1_level_1
536370,0
536852,0
536974,0
537065,0
537463,0
537468,24
537693,0
537897,0
537967,0
538008,0


In [61]:
# we dont need quantity sum
# we need either has taken or not
# so if user has taken that item mark as 1 else mark as 0.

def convert_into_binary(x):
    if x > 0:
        return 1
    else:
        return 0

  and should_run_async(code)


In [62]:
basket_sets = basket.applymap(convert_into_binary)

  and should_run_async(code)
  basket_sets = basket.applymap(convert_into_binary)


In [63]:
# check : has quantity now converted to 1 or 0.
basket_sets['10 COLOUR SPACEBOY PEN'].head(10)

  and should_run_async(code)


Unnamed: 0_level_0,10 COLOUR SPACEBOY PEN
InvoiceNo,Unnamed: 1_level_1
536370,0
536852,0
536974,0
537065,0
537463,0
537468,1
537693,0
537897,0
537967,0
538008,0


In [64]:
# remove postage item as it is just a seal which almost all transaction contains.
print(basket_sets['POSTAGE'].head())

basket_sets.drop(columns=['POSTAGE'],inplace=True)

InvoiceNo
536370    1
536852    1
536974    1
537065    1
537463    1
Name: POSTAGE, dtype: int64


  and should_run_async(code)


In [65]:
# call apriori function and pass minimum support here we are passing 7%.
# means 7 times in total number of transaction the item should be present.
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)

  and should_run_async(code)


In [66]:
#it will generate frequent itemsets
frequent_itemsets

  and should_run_async(code)


Unnamed: 0,support,itemsets
0,0.071429,(4 TRADITIONAL SPINNING TOPS)
1,0.096939,(ALARM CLOCK BAKELIKE GREEN)
2,0.102041,(ALARM CLOCK BAKELIKE PINK)
3,0.094388,(ALARM CLOCK BAKELIKE RED)
4,0.081633,(BAKING SET 9 PIECE RETROSPOT)
5,0.071429,(CHILDRENS CUTLERY DOLLY GIRL)
6,0.09949,(DOLLY GIRL LUNCH BOX)
7,0.096939,(JUMBO BAG RED RETROSPOT)
8,0.076531,(JUMBO BAG WOODLAND ANIMALS)
9,0.125,(LUNCH BAG APPLE DESIGN)


Generating frequent itemsets from a list of items -
--

First step in generation of association rules is to get all the frequent itemsets on which binary partitions can be performed to get the antecedent and the consequent.


<img src='https://miro.medium.com/max/576/1*NoUFBxsokdnw3wrYyVXkgw.png' />


`For example`, if there are 6 items {Bread, Butter, Egg, Milk, Notebook, Toothbrush} on all the transactions combined, itemsets will look like {Bread}, {Butter}, {Bread, Notebook}, {Milk, Toothbrush}, {Milk, Egg, Vegetables} etc. Size of an itemset can vary from one to the total number of items that we have. Now, we seek only frequent itemsets from this and not all so as to put a check on the number of total itemsets generated.

<img src='https://miro.medium.com/max/256/1*gZujUCrD9nanRr-ETBFtbg.png' />

<br>

Frequent itemsets are the ones which occur at least a minimum number of times in the transactions. Technically, these are the itemsets for which support value (fraction of transactions containing the itemset) is above a **minimum threshold — min_support. We have kept min_support=0.07 in our above code**.

So, {Bread, Notebook} might not be a frequent itemset if it occurs only 2 times out of 100 transactions and (2/100) = 0.02 falls below the value of min_support.

In [67]:
# We would apply association rules on frequent itemset.
# here we are setting based on lift and keeping minimum lift as 1

rules_mlxtend = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules_mlxtend.head()

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE PINK),0.096939,0.102041,0.07398,0.763158,7.478947,1.0,0.064088,3.791383,0.959283,0.591837,0.736244,0.744079
1,(ALARM CLOCK BAKELIKE PINK),(ALARM CLOCK BAKELIKE GREEN),0.102041,0.096939,0.07398,0.725,7.478947,1.0,0.064088,3.283859,0.964734,0.591837,0.69548,0.744079
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,1.0,0.069932,5.568878,0.976465,0.704545,0.820431,0.826814
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,1.0,0.069932,4.916181,0.979224,0.704545,0.79659,0.826814
4,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE PINK),0.094388,0.102041,0.07398,0.783784,7.681081,1.0,0.064348,4.153061,0.960466,0.604167,0.759214,0.754392


Generating all possible rules from the frequent itemsets
--

Once the frequent itemsets are generated, identifying rules out of them is comparatively less taxing. Rules are formed by binary partition of each itemset. If {Bread,Egg,Milk,Butter} is the frequent itemset, candidate rules will look like:

(Egg, Milk, Butter → Bread), (Bread, Milk, Butter → Egg), (Bread, Egg → Milk, Butter), (Egg, Milk → Bread, Butter), (Butter→ Bread, Egg, Milk)

From a list of all possible candidate rules, we aim to identify rules that fall above a minimum threshold level (like min_confidence or min_lift).



In [68]:
# rules_mlxtend.rename(columns={'antecedents':'lhs','consequents':'rhs'})

# as based business use case we can sort based on confidance and lift.
rules_mlxtend[ (rules_mlxtend['lift'] >= 4) & (rules_mlxtend['confidence'] >= 0.8) ].head()

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
2,(ALARM CLOCK BAKELIKE RED),(ALARM CLOCK BAKELIKE GREEN),0.094388,0.096939,0.079082,0.837838,8.642959,1.0,0.069932,5.568878,0.976465,0.704545,0.820431,0.826814
3,(ALARM CLOCK BAKELIKE GREEN),(ALARM CLOCK BAKELIKE RED),0.096939,0.094388,0.079082,0.815789,8.642959,1.0,0.069932,4.916181,0.979224,0.704545,0.79659,0.826814
16,(SET/6 RED SPOTTY PAPER PLATES),(SET/20 RED RETROSPOT PAPER NAPKINS),0.127551,0.132653,0.102041,0.8,6.030769,1.0,0.085121,4.336735,0.95614,0.645161,0.769412,0.784615
18,(SET/6 RED SPOTTY PAPER PLATES),(SET/6 RED SPOTTY PAPER CUPS),0.127551,0.137755,0.122449,0.96,6.968889,1.0,0.104878,21.556122,0.981725,0.857143,0.953609,0.924444
19,(SET/6 RED SPOTTY PAPER CUPS),(SET/6 RED SPOTTY PAPER PLATES),0.137755,0.127551,0.122449,0.888889,6.968889,1.0,0.104878,7.852041,0.993343,0.857143,0.872645,0.924444


How to read the above data ?
--

> antecedents and consequents  -> The IF component of an association rule is known as the antecedent. The THEN component is known as the consequent. The antecedent and the consequent are disjoint; they have no items in common.

> Example of antecedents and consequents -> Consider antecedents as X and consequents as Y.  

<img src='https://miro.medium.com/max/576/1*NoUFBxsokdnw3wrYyVXkgw.png' />

> antecedent support -> This measure gives an idea of how frequent antecedent is in all the transactions. Like (ALARM CLOCK BAKELIKE RED) is present in 9.4% of the transactions.

> consequent support -> This measure gives an idea of how frequent consequent is in all the transactions. Like (ALARM CLOCK BAKELIKE GREEN) is present in 9.6% of the transactions.

> support -> This measure gives an idea of how frequent `ItemSet` is in all the transactions. Like {ALARM CLOCK BAKELIKE RED,ALARM CLOCK BAKELIKE GREEN} is present in 7.9% of the transactions.

> confidance -> This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedents. So {ALARM CLOCK BAKELIKE RED} -> {ALARM CLOCK BAKELIKE GREEN} as a confidence of 83%. In simple words their is an 83% chance of finding {ALARM CLOCK BAKELIKE GREEN} , if the cart contains {ALARM CLOCK BAKELIKE RED}.

> lift -> This measure defines the likeliness of occurrence of consequent on the cart given that the cart already has the antecedent, but controlling the popularity of consequent. So lift of {ALARM CLOCK BAKELIKE GREEN} w.r.t {ALARM CLOCK BAKELIKE RED} is 8.64. Which is quite good. Any lift value > 1 implies that the Association rule is worth considering.  [ Reference : https://en.wikipedia.org/wiki/Lift_(data_mining)]

> leverage -> leverage(X -> Y) = P(X and Y) - (P(X)P(Y))

*Leverage measures the difference of X and Y appearing together in the data set and what would be expected if X and Y where statistically dependent. The rational in a sales setting is to find out how many more units (items X and Y together) are sold than expected from the independent sells.*  [Reference : https://michael.hahsler.net/research/recommender/associationrules.html]

> conviction -> Ignore this parameter.  Not much of use in most situations.
