Improve performance of apriori #619

dbarbier · 2019-11-02T22:04:37Z

Description

This PR improves performance of apriori for medium to large datasets, in particular with low_memory=True. Situation is less clear for very small datasets.

Related issues or pull requests

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

old_combination is sorted, thus its max() is its last element. Since items_types_in_previous_step is a Numpy array, we can find all valid elements with a single call, which makes inner loop shorter.

Let generate_new_combinations return ints instead of tuples, and collect them with np.fromiter. Slower with low_memory=True, this will be fixed by next commit.

Verbose output has to be modified, since we loop on valid combinations only. Performance is now equivalent to better than version with low_memory=False. Adjust test_fpbase.py output.

dbarbier · 2019-11-02T22:12:58Z

Here is the benchmark script I used to compare master and this branch:

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from time import time

np.random.seed(42)
x = np.random.randint(0, high=5, size=100_000)
df = pd.DataFrame((x/4).astype(bool).reshape(-1, 100))
ds = df.to_sparse()

def bench(df, min_support, low_memory):
    tick = time()
    res = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=low_memory)
    print("min_support=", min_support, "low_memory=", low_memory, "sparse=", isinstance(d, pd.core.sparse.frame.SparseDataFrame), time() - tick)
    return res

for min_support in [0.8, 0.7, 0.6, 0.5]:
    for d in [df, ds]:
        for low_memory in [True, False]:
            if low_memory:
                low = bench(d, min_support, low_memory)
            else:
                assert low.equals(bench(d, min_support, low_memory))

Results:

dataframe 40x10                 master (fe0e22a)  db/perf (3056e4a2)
  
min_support= 0.8 low_memory=1 sparse=0   0.001493   0.001509
min_support= 0.8 low_memory=0 sparse=0   0.001202   0.001275
min_support= 0.8 low_memory=1 sparse=1   0.007267   0.007668
min_support= 0.8 low_memory=0 sparse=1   0.003154   0.003353

min_support= 0.7 low_memory=1 sparse=0   0.002396   0.002653
min_support= 0.7 low_memory=0 sparse=0   0.002384   0.002260
min_support= 0.7 low_memory=1 sparse=1   0.021657   0.014279
min_support= 0.7 low_memory=0 sparse=1   0.004684   0.005507

min_support= 0.6 low_memory=1 sparse=0   0.003882   0.003697
min_support= 0.6 low_memory=0 sparse=0   0.003035   0.003104
min_support= 0.6 low_memory=1 sparse=1   0.039360   0.030811
min_support= 0.6 low_memory=0 sparse=1   0.005814   0.006533

min_support= 0.5 low_memory=1 sparse=0   0.005517   0.006016
min_support= 0.5 low_memory=0 sparse=0   0.003736   0.004259
min_support= 0.5 low_memory=1 sparse=1   0.074175   0.075586
min_support= 0.5 low_memory=0 sparse=1   0.007868   0.007929

dataframe 40x100                master (fe0e22a)  db/perf (3056e4a2)

min_support= 0.8 low_memory=1 sparse=0    0.0334    0.0057
min_support= 0.8 low_memory=0 sparse=0    0.0076    0.0052
min_support= 0.8 low_memory=1 sparse=1    0.9123    0.1024
min_support= 0.8 low_memory=0 sparse=1    0.0197    0.0167

min_support= 0.7 low_memory=1 sparse=0    0.7691    0.0574
min_support= 0.7 low_memory=0 sparse=0    0.1541    0.0525
min_support= 0.7 low_memory=1 sparse=1   23.1043    1.3285
min_support= 0.7 low_memory=0 sparse=1    0.2057    0.1165

min_support= 0.6 low_memory=1 sparse=0   15.2329    1.0777
min_support= 0.6 low_memory=0 sparse=0    3.0321    0.9738
min_support= 0.6 low_memory=1 sparse=1  443.7326   27.6753
min_support= 0.6 low_memory=0 sparse=1    4.3635    2.4218

min_support= 0.5 low_memory=1 sparse=0  394.4081   32.4573
min_support= 0.5 low_memory=0 sparse=0   83.9022   28.8984
min_support= 0.5 low_memory=1 sparse=1 unfinished 834.6909
min_support= 0.5 low_memory=0 sparse=1  130.8865   78.8501

dataframe 1000x100              master (fe0e22a)  db/perf (3056e4a2)

min_support= 0.8 low_memory=1 sparse=0      0.02191      0.00567
min_support= 0.8 low_memory=0 sparse=0      0.00732      0.00714
min_support= 0.8 low_memory=1 sparse=1      0.46083      0.06502
min_support= 0.8 low_memory=0 sparse=1      0.03289      0.03401

min_support= 0.7 low_memory=1 sparse=0      0.08458      0.01312
min_support= 0.7 low_memory=0 sparse=0      0.02404      0.02052
min_support= 0.7 low_memory=1 sparse=1      1.90439      0.11286
min_support= 0.7 low_memory=0 sparse=1      0.09701      0.09027

min_support= 0.6 low_memory=1 sparse=0      2.84277      0.34374
min_support= 0.6 low_memory=0 sparse=0      0.87047      0.64033
min_support= 0.6 low_memory=1 sparse=1     66.11381      4.17402
min_support= 0.6 low_memory=0 sparse=1      3.75937      3.54072

min_support= 0.5 low_memory=1 sparse=0     58.83895      7.19620
min_support= 0.5 low_memory=0 sparse=0 memory error memory error
min_support= 0.5 low_memory=1 sparse=1   unfinished    100.69009
min_support= 0.5 low_memory=0 sparse=1 memory error memory error

Results with the smallest dataset are not really relevant IMO, times are too small, and it is unlikely that one would use sparse matrix or lowmem option. Anyway they are provided here to show that there is no dramatic regression.

coveralls · 2019-11-02T22:14:39Z

Coverage increased (+0.007%) to 92.354% when pulling a59cceb on dbarbier:db/perf into fe0e22a on rasbt:master.

rasbt · 2019-11-03T17:01:11Z

Thanks for the PR. The benchmarks look pretty convincing. And I agree with you, it will be unlikely that someone would apply that on small data frames. Also, then, the performance differences are negligible because in either case it would finish running in non-substantial time.

If all columns are boolean, there is nothing to check. In apriori.py, call valid_input_check.

dbarbier · 2019-11-03T17:25:16Z

I found #549 when looking for other benchmarks, and ran the same tests with this script:

import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder   
from time import time

with open("kosarak-50k.dat", "rt") as f:
    data = f.readlines() 

dataset = [list(map(int, f.split())) for f in data]    

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_) 

def bench(df, min_support, no_crash=False):
    if not no_crash:
        tick = time()
        res1 = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=False)
        print("min_support={:g} apriori low_memory=0 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res2 = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=True)
    print("min_support={:g} apriori low_memory=1 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res3 = fpgrowth(df, min_support=min_support, verbose=0, use_colnames=False)
    print("min_support={:g} fpgrowth {:.3g}".format(min_support, time() - tick))
    tick = time()
    if not no_crash:
        assert res1.equals(res2)
    # Replace frozenset by a sorted list so that sort is meaningful
    res2['itemsets'] = res2['itemsets'].apply(sorted).sort_values()
    res2 = res2.sort_values(by='itemsets').reset_index(drop=True)
    res3['itemsets'] = res3['itemsets'].apply(sorted).sort_values()
    res3 = res3.sort_values(by='itemsets').reset_index(drop=True)
    assert res2.equals(res3), pd.concat([res2, res3], axis=1)

for min_support in [0.05, 0.02, 0.01, 0.005]:
    bench(df, min_support)

Here timings are made on the whole functions. I just pushed a commit to improve checks on input dataframe. Results are:

kosarak-50k.dat
                                       master 3056e4a 6bd97d4
min_support=0.05 apriori low_memory=0   4.66   4.66    0.74
min_support=0.05 apriori low_memory=1   4.64   5.28    1.34
min_support=0.05 fpgrowth               4.96   4.92    1.07
min_support=0.02 apriori low_memory=0   5.08   5.10    1.16
min_support=0.02 apriori low_memory=1   5.04   5.67    1.76
min_support=0.02 fpgrowth               4.98   5.02    1.09
min_support=0.01 apriori low_memory=0   9.60  10.03    5.84
min_support=0.01 apriori low_memory=1   9.60   6.41    2.46
min_support=0.01 fpgrowth               5.27   5.22    1.23
min_support=0.005 apriori low_memory=0 memory memory  memory
min_support=0.005 apriori low_memory=1 72.71   9.16    5.28
min_support=0.005 fpgrowth              5.50   5.59    1.63

kosarak-100k.dat
                                        master 3056e4a 6bd97d4
min_support=0.05 apriori low_memory=0   11.62   11.74   1.81
min_support=0.05 apriori low_memory=1   11.57   13.00   3.22
min_support=0.05 fpgrowth               12.56   12.31   2.50
min_support=0.02 apriori low_memory=0   12.81   12.98   2.99
min_support=0.02 apriori low_memory=1   13.78   14.24   4.36
min_support=0.02 fpgrowth               14.22   12.52   2.64
min_support=0.01 apriori low_memory=0   28.24   25.38  15.50
min_support=0.01 apriori low_memory=1   27.03   15.88   6.03
min_support=0.01 fpgrowth               12.95   12.80   2.98
min_support=0.005 apriori low_memory=0 memory  memory  memory
min_support=0.005 apriori low_memory=1 220.58   23.18  11.90
min_support=0.005 fpgrowth              13.93   14.34   3.66

dbarbier · 2019-11-03T17:33:08Z

I forgot to mention that 6bd97d4 only improves timings for boolean dataframes; users should be advised to prefer them over integer dataframes, or there could be an option to disable this check, which takes most of the time.

rasbt · 2019-11-03T17:43:05Z

I agree with you. Maybe we could display a warning if people use integer arrays to subtly nudge people towards using bool arrays. Could be in the form of a deprecation warning, even, and then phase out integer arrays in the next versions.

rasbt · 2019-11-03T18:42:41Z

Was also just running the benchmarks (the 2nd set you posted, kosarak-50k.dat) using the old and the improved version(s).

		master	`3056e4a`	`6bd97d4`
min_support=0.05	apriori low_memory=0	5.09	5.14	5.06
min_support=0.05	apriori low_memory=1	4.9	4.87	5.02
min_support=0.05	fpgrowth	5.78	5.32	5.53
min_support=0.02	apriori low_memory=0	6.44	6.05	6.15
min_support=0.02	apriori low_memory=1	6.22	6.28	6.16
min_support=0.02	fpgrowth	5.7	5.85	5.5
min_support=0.01	apriori low_memory=0	19.7	19.8	19.3
min_support=0.01	apriori low_memory=1	19	19.4	19.2
min_support=0.01	fpgrowth	5.53	5.9	5.59
min_support=0.005	apriori low_memory=0	205	202	207
min_support=0.005	apriori low_memory=1	199	199	203
min_support=0.005	fpgrowth	6.23	6	6.02

Hm, for some reason I don't get the same improvements you got. I was running that on an MBP. I will run the same code on my workstation later (running some other stuff there now) to see what's going on.

rasbt · 2019-11-03T19:31:54Z

Pls ignore the numbers above ... somehow I had a clobbered environment. The improvements look great -- probably mostly due to the improved checking, but still, I think we should merge the whole PR. That's really great. Thanks! Could you add a changelog entry though?

PS: my updated benchmark results are shown below

		MBP			Workstation (Ubuntu)
		master	`3056e4a`	`6bd97d4`	master	`3056e4a`	`6bd97d4`
min_support=0.05	apriori low_memory=0	5.09	5.12	1.04	6.46	6.48	1.28
min_support=0.05	apriori low_memory=1	4.9	6.52	2.19	6.27	7.06	1.86
min_support=0.05	fpgrowth	5.78	5.64	1.29	6.73	6.76	1.54
min_support=0.02	apriori low_memory=0	6.44	6.02	2.01	7.5	7.5	2.3
min_support=0.02	apriori low_memory=1	6.22	7.03	2.7	7.48	7.61	2.41
min_support=0.02	fpgrowth	5.7	5.7	1.42	6.84	6.88	1.69
min_support=0.01	apriori low_memory=0	19.7	19.4	15.1	19.1	19.1	13.9
min_support=0.01	apriori low_memory=1	19	8.43	4.13	19.1	8.43	3.24
min_support=0.01	fpgrowth	5.53	5.57	1.53	7.01	7.04	1.84
min_support=0.005	apriori low_memory=0	205	207	204	175	174	169
min_support=0.005	apriori low_memory=1	199	13	9.29	175	11.4	6.23
min_support=0.005	fpgrowth	6.23	5.66	1.82	7.5	7.55	2.35

Replace 0/1 by False/True in docstrings of apriori, fpgrowth and fpmax to promote usage of boolean arrays.

dbarbier · 2019-11-03T21:15:52Z

Changelog entry added, please let me know if this is not clear.

About your results, they look consistent with mine; there is a constant difference between 3056e4a and 6bd97d4, which depends only on input dataframe size, for instance 4.2s on your workstation. MBP is more volatile, like my laptop.

Between master and 3056e4a, the major difference is with low_memory=1, it runs much faster except when itemsets are very small.

rasbt · 2019-11-03T21:56:17Z

Looks good, thanks! I guess it is good to merge then?

dbarbier · 2019-11-03T21:57:40Z

Okay for me.

rasbt · 2019-11-03T22:01:13Z

Great! I can open an issue reg warnings and encouraging users to use boolean type inputs and revisit this another day then :)

dbarbier added 4 commits November 2, 2019 15:40

minor performance improvements in apriori

b2a38af

old_combination is sorted, thus its max() is its last element. Since items_types_in_previous_step is a Numpy array, we can find all valid elements with a single call, which makes inner loop shorter.

minor performance improvements in apriori

e442082

Let generate_new_combinations return ints instead of tuples, and collect them with np.fromiter. Slower with low_memory=True, this will be fixed by next commit.

improve performance of apriori with low_memory=True

3056e4a

Verbose output has to be modified, since we loop on valid combinations only. Performance is now equivalent to better than version with low_memory=False. Adjust test_fpbase.py output.

fixes for flake8

62382bb

speed up valid_input_check for boolean dataframes

6bd97d4

If all columns are boolean, there is nothing to check. In apriori.py, call valid_input_check.

add changelog entry

a59cceb

Replace 0/1 by False/True in docstrings of apriori, fpgrowth and fpmax to promote usage of boolean arrays.

rasbt merged commit 2f928cb into rasbt:master Nov 3, 2019

rasbt mentioned this pull request Nov 3, 2019

Transition frequent itemset mining algorithms to use Boolean dataframes #620

Closed

dbarbier deleted the db/perf branch November 3, 2019 22:12

rasbt mentioned this pull request Jan 29, 2020

v0.17.1 #660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of apriori #619

Improve performance of apriori #619

dbarbier commented Nov 2, 2019

dbarbier commented Nov 2, 2019

coveralls commented Nov 2, 2019 •

edited

Loading

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

dbarbier commented Nov 3, 2019 •

edited

Loading

rasbt commented Nov 3, 2019

rasbt commented Nov 3, 2019

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

rasbt commented Nov 3, 2019

Improve performance of apriori #619

Improve performance of apriori #619

Conversation

dbarbier commented Nov 2, 2019

Description

Related issues or pull requests

Pull Request Checklist

dbarbier commented Nov 2, 2019

coveralls commented Nov 2, 2019 • edited Loading

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

dbarbier commented Nov 3, 2019 • edited Loading

rasbt commented Nov 3, 2019

rasbt commented Nov 3, 2019

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

rasbt commented Nov 3, 2019

dbarbier commented Nov 3, 2019

rasbt commented Nov 3, 2019

coveralls commented Nov 2, 2019 •

edited

Loading

dbarbier commented Nov 3, 2019 •

edited

Loading