# Note

In thesis, I implicitly suggested that association rule approach alone without any corrections is a dead end. We will now show the reasons behind this conclusion.

Throughout this guide we will be using our custom set of transactions due to its tiny size, because the execution is done in a small amount of time, as opposed to our real dataset with which the computation using either Apriori or FP-Growth algorithm from custom libraries would be unnecessary slow.

Transactions:

In [64]:
human = ['beer', 'water', 'milk']
alien = ['beer', 'red', 'orange']

And the whole dataset:

In [65]:
entries = [human, alien]

First we will show the problems with FP-Growth algorithm from custom library called pyfpgrowth, than we will show the Apriori approach.

# Using FP-Growth algorithm implementation from pyfpgrowth

In [66]:
import pyfpgrowth

This implementation expects as an input a list of transactions, where items are represented in integers. We can use LabelEncoder from scikit-learn for this purpose. 

In [67]:
from sklearn import preprocessing

encoder = preprocessing.LabelEncoder()
encoder.fit([x for y in entries for x in y])
labeled_entries = [encoder.transform(entry) for entry in entries]

Now we can feed the labeled data to the algorithm and find frequent patterns. The threshold is set to 0, so every transaction is recognized.

In [68]:
support = 0
patterns = pyfpgrowth.find_frequent_patterns(labeled_entries, support)

When the desired frequent patterns are found, the algorithm can generate rules based on them. The confidence is also set to 0, so every possible rule is generated.

In [69]:
confidence = 0
rules = pyfpgrowth.generate_association_rules(patterns, confidence)

To have a quick look how the rules look like, we can print them out.

In [70]:
for rule in rules:
    print(tuple(encoder.inverse_transform(rule)), encoder.inverse_transform(rules[rule][0]))

('beer', 'milk') ['water']
('beer',) ['orange']
('orange', 'red') ['beer']
('milk',) ['beer' 'water']
('beer', 'water') ['milk']
('orange',) ['beer']
('beer', 'red') ['orange']
('red',) ['beer']
('beer', 'orange') ['red']
('milk', 'water') ['beer']
('water',) ['beer']


One of the problem here is, that we cannot set the maximal lenght of a rule to be in shape of {x} -> {y} as suggested in thesis.

Another problem is the 'incompleteness' of the generated rules. Rules {(beer, milk) -> (water); (beer, water) -> (milk)} are present, but when it comes to milk, only (milk, water) -> (beer) showed up. The rule (milk, beer) -> (water) is actually missing.

To summarize, this implementation cannot be used for the lack of its API (missing option for maximal lenght of a rule). If we would forget about the maximal length of a rule, the only way to use this algorithm would be to use all the variations of rules (which would end up with a horrible space complexity). But even when using the algorithm this way, it is still not a correct apporach due to its incompleteness of generated rules as mentioned above.

# Using Apriori algorithm implementation from apyori

This library has a better API than the FP-Growth algorithm from pyfpgrowth, but it is unfortunatelly very slow in computing the rules from dataset.

In [1]:
from apyori import apriori

The association rule produces rules as an output. Each rule generally contains three main attributes. Confidence, support and optionally lift. If you want to know more about what each means, you can read about it in the thesis attached to this guidelines in chapter 9.

Let's assume that we set up the occurence of an item to number 1. This means that the algorithm will recognize every transaction in dataset, because the transaction must appear "at least 1 times" in the dataset.

In [82]:
occurence = 1

In order to make the algoritm "understand" what we mean, we need to convert the occurence to support. Again, more detailed information about how the association rule algorithms work can be found in the attached thesis.

In [83]:
support_num = occurence / len(entries)
print('Support number that will be used in apriori algorithm:', support_num)

Support number that will be used in apriori algorithm: 0.5


The confidence will be set to 0, again for the example purposes only.

In [84]:
confidence = 0

And last but not least, documentation says that we can specify parameter max_length - the maximum length of relations. So I set the maximal lenght of a rule to be 2. The reason behind this is described in attached thesis.

In [91]:
length = 2

The apriori algorithm is set and we can generate rules.

In [92]:
association_rules = apriori(entries, min_support=support_num, min_confidence=confidence, max_lenght=length)

Let's look at how these rules look in a pretty printed way.

In [93]:
iterations = 0
for item in association_rules:

    print(' -> '.join(item[0]))

    print('support: ', str(item[1]))
    print('confidence: ', str(item[2][0][2]))
    print('lift: ' + str(item[2][0][3]))
    print()
    iterations += 1

print("number of rules: ", iterations)

beer
support:  1.0
confidence:  1.0
lift: 1.0

milk
support:  0.5
confidence:  0.5
lift: 1.0

orange
support:  0.5
confidence:  0.5
lift: 1.0

red
support:  0.5
confidence:  0.5
lift: 1.0

water
support:  0.5
confidence:  0.5
lift: 1.0

milk -> beer
support:  0.5
confidence:  0.5
lift: 1.0

orange -> beer
support:  0.5
confidence:  0.5
lift: 1.0

red -> beer
support:  0.5
confidence:  0.5
lift: 1.0

water -> beer
support:  0.5
confidence:  0.5
lift: 1.0

water -> milk
support:  0.5
confidence:  1.0
lift: 2.0

orange -> red
support:  0.5
confidence:  1.0
lift: 2.0

water -> milk -> beer
support:  0.5
confidence:  1.0
lift: 2.0

orange -> red -> beer
support:  0.5
confidence:  1.0
lift: 2.0

number of rules:  13


But unfortunatelly again, we can see at the end that the rules were somehow expanded. The last two rules should not be there, because we specified the maximal lenght of the rule to be 2, but these rules happen to have a lenght of 3.

# Conclusion

Probably a custom algorithm implemented 'from scratch' should be the best solution, but the target of the thesis was rather to use an existing API available in Python, so I did not include these models in the results.