
# <center>  ECLAT ALGORITHM  </center>


<b>Association rule</b> mining finds interesting associations and relationships among large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical example is a Market Based Analysis.



The ECLAT (Equivalence Class Clustering and bottom-up Lattice Traversal) algorithm is a data mining algorithm for association rule mining designed to solve customer bucket analysis problems. The goal is to understand which products from the bucket are commonly bought together.

The Apriori and FP-growth algorithms require data to be in the row format (sometimes called “horizontal format” in other blogs on the internet). In contrast, the ECLAT algorithm is designed to deal with the data stored in the column-oriented format (sometimes called “vertical format”).


<img src="image/eclat_image.png" width="700" height="600" >



This vertical approach of the ECLAT algorithm makes it faster than the Apriori and FP-growth algorithms as it scans the database only once. The Apriori algorithm scans the database every single iteration, and the FP-growth algorithm does it two times two times.





## Example

Let’s assume that we have the following database, and we set up the minimum support value (minimum number of item occurrences in the transactions list) to 2.

<img src="image/example_1.png" width="300" height="200"/>

As we can see, the data is in horizontal format. So, we need to transform it:

<img src="image/example_2.png" width="300" height="200"/>

As our minimum support value is 2, all items that appeared in only one transaction will be excluded from the dataset.

<img src="image/example_3.png" width="600" height="300"/>

The next step is to create a list containing different sets of items’ combinations with the total set length equal to 2.

All possible combinations are the following:

<img src="image/example_4.png" width="150" height="200"/>


Now, we need to associate all item combinations with corresponding transaction IDs. In our example, we’ll get the following table:

<img src="image/example_5.png" width="300" height="200"/>

We need to remove items combinations having support value less than the minimum support:

<img src="image/example_6.png" width="600" height="300"/>

We repeat these steps as many times as needed to analyze itemsets of the required length. In our current example, when we create the product pairs of three products, we can find that only one group of items appeared in a single transaction.
<img src="image/example_7.png" width="300" height="200"/>


**But one transaction is less than the minimum support value (two transactions), so we will generate association rules based on the previous step output.

Here’s the recommended items list generated by the ECLAT algorithm based on our conditions and dataset:

<img src="image/example_8.png" width="150" height="200"/>


## Implementation of ECLAT using Python


In [15]:
# importing dataset ( example 1 and example 2 are datasets in pyECLAT)
from pyECLAT import Example2

# storing the dataset in a variable
dataset = Example2().get()

# printing the dataset
dataset.head()


Unnamed: 0,0,1,2,3,4,5,6
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams
1,burgers,meatballs,eggs,,,,
2,chutney,,,,,,
3,turkey,avocado,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,


Each row represents a customer’s purchase at a supermarket in this dataset. For example, in row 1, the customer purchased only burgers, meatballs, and eggs.
Let’s get more information about the dataset by printing more details.

In [16]:
# Info method
dataset.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       3001 non-null   object
 1   1       2315 non-null   object
 2   2       1774 non-null   object
 3   3       1374 non-null   object
 4   4       1048 non-null   object
 5   5       775 non-null    object
 6   6       581 non-null    object
dtypes: object(7)
memory usage: 164.2+ KB


The output shows that the dataset contains 3001 rows and 7 columns.


### Visualizing the frequent items
To visualize the frequent items, let’s load the dataset to the ECLAT class and generate binary DataFrame:

In [17]:
# importing the ECLAT module
from pyECLAT import ECLAT

# loading transactions DataFrame to ECLAT class
eclat = ECLAT(data=dataset)

# DataFrame of binary values
eclat.df_bin



Unnamed: 0,blueberries,chocolate bread,turkey,asparagus,mayonnaise,tomato juice,almonds,nonfat milk,ketchup,mashed potato,...,tomatoes,cottage cheese,mushroom cream sauce,energy drink,chili,fresh bread,whole wheat pasta,pickles,mint,strong cheese
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2999,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In this binary dataset, every row represents a transaction. Columns are possible products that might appear in every transaction. Every cell contains one of two possible values:

     0 the product was not included in the transaction
     1 the transaction contains the product
Now, we need to count items for every column in the DataFrame:

In [18]:
# count items in each column
items_total = eclat.df_bin.astype(int).sum(axis=0)

items_total


blueberries           26
chocolate bread       10
turkey               198
asparagus              7
mayonnaise            10
                    ... 
fresh bread           91
whole wheat pasta     91
pickles               17
mint                  43
strong cheese         25
Length: 119, dtype: int64

And it would be helpful to get the count of the items for every row in the DataFrame:


In [19]:
# count items in each row
items_per_transaction = eclat.df_bin.astype(int).sum(axis=1)

items_per_transaction


0       7
1       3
2       1
3       2
4       5
       ..
2996    1
2997    2
2998    3
2999    7
3000    5
Length: 3001, dtype: int64

Now, we can use these Series to visualize items distribution:



In [20]:
# frequency item list
import pandas as pd

# Loading items per column stats to the DataFrame
df = pd.DataFrame({'items': items_total.index, 'transactions': items_total.values}) 

# cloning pandas DataFrame for visualization purpose  
df_table = df.sort_values("transactions", ascending=False)

#  Top 5 most popular products/items
df_table.head(5).style.background_gradient(cmap='Blues')

Unnamed: 0,items,transactions
43,mineral water,711
91,spaghetti,549
21,eggs,532
100,chocolate,485
23,french fries,463


We can also visualize the frequently occurring items using TreeMap:



In [21]:
# importing required module
import plotly.express as px

# to have a same origin
df_table["all"] = "Tree Map" 

# creating tree map using plotly
fig = px.treemap(df_table.head(50), path=['all', "items"], values='transactions',
                  color=df_table["transactions"].head(50), hover_data=['items'],
                  color_continuous_scale='Blues',
                )
# ploting the treemap
fig.show()



The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.


The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.



### Generating association rules
To generate association rules, we need to define:

    Minimum support – should be provided as a percentage of the overall items from the dataset
    Minumum combinations – the minimum amount of items in the transaction
    Maximum combinations – the minimum amount of items in the transaction
Note: the higher the value of the maximum combinations the longer the calculation will take.

In [16]:
#ECLAT algorithm
# the item shoud appear at least at 5% of transactions
min_support = 5/100

# start from transactions containing at least 2 items
min_combination = 2

# up to maximum items per transaction
max_combination = max(items_per_transaction)

rule_indices, rule_supports = eclat.fit(min_support=min_support,
                                                 min_combination=min_combination,
                                                 max_combination=max_combination,
                                                 separator=' & ',
                                                 verbose=True)

Combination 2 by 2


253it [00:01, 200.08it/s]


Combination 3 by 3


1771it [00:08, 196.80it/s]


Combination 4 by 4


8855it [00:46, 189.51it/s]


Combination 5 by 5


33649it [03:10, 176.55it/s]


Combination 6 by 6


100947it [10:26, 161.19it/s]


Combination 7 by 7


245157it [2:19:23, 29.31it/s] 


In [18]:
import pandas as pd

result = pd.DataFrame(rule_supports.items(),columns=['Item', 'Support'])
result.sort_values(by=['Support'], ascending=False)


Unnamed: 0,Item,Support
0,mineral water & spaghetti,0.060646


We found that mineral water and spaghetti are commonly purchased by the customers based on the transactions data in our dataset and the minimum support value we’ve provided.
