Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of apriori #619

Merged
merged 6 commits into from
Nov 3, 2019
Merged

Improve performance of apriori #619

merged 6 commits into from
Nov 3, 2019

Conversation

dbarbier
Copy link
Contributor

@dbarbier dbarbier commented Nov 2, 2019

Description

This PR improves performance of apriori for medium to large datasets, in particular with low_memory=True. Situation is less clear for very small datasets.

Related issues or pull requests

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

old_combination is sorted, thus its max() is its last element.
Since items_types_in_previous_step is a Numpy array, we can find all
valid elements with a single call, which makes inner loop shorter.
Let generate_new_combinations return ints instead of tuples,
and collect them with np.fromiter.

Slower with low_memory=True, this will be fixed by next commit.
Verbose output has to be modified, since we loop on valid combinations
only.

Performance is now equivalent to better than version with low_memory=False.

Adjust test_fpbase.py output.
@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 2, 2019

Here is the benchmark script I used to compare master and this branch:

import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from time import time

np.random.seed(42)
x = np.random.randint(0, high=5, size=100_000)
df = pd.DataFrame((x/4).astype(bool).reshape(-1, 100))
ds = df.to_sparse()

def bench(df, min_support, low_memory):
    tick = time()
    res = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=low_memory)
    print("min_support=", min_support, "low_memory=", low_memory, "sparse=", isinstance(d, pd.core.sparse.frame.SparseDataFrame), time() - tick)
    return res

for min_support in [0.8, 0.7, 0.6, 0.5]:
    for d in [df, ds]:
        for low_memory in [True, False]:
            if low_memory:
                low = bench(d, min_support, low_memory)
            else:
                assert low.equals(bench(d, min_support, low_memory))

Results:

dataframe 40x10                 master (fe0e22a)  db/perf (3056e4a2)
  
min_support= 0.8 low_memory=1 sparse=0   0.001493   0.001509
min_support= 0.8 low_memory=0 sparse=0   0.001202   0.001275
min_support= 0.8 low_memory=1 sparse=1   0.007267   0.007668
min_support= 0.8 low_memory=0 sparse=1   0.003154   0.003353

min_support= 0.7 low_memory=1 sparse=0   0.002396   0.002653
min_support= 0.7 low_memory=0 sparse=0   0.002384   0.002260
min_support= 0.7 low_memory=1 sparse=1   0.021657   0.014279
min_support= 0.7 low_memory=0 sparse=1   0.004684   0.005507

min_support= 0.6 low_memory=1 sparse=0   0.003882   0.003697
min_support= 0.6 low_memory=0 sparse=0   0.003035   0.003104
min_support= 0.6 low_memory=1 sparse=1   0.039360   0.030811
min_support= 0.6 low_memory=0 sparse=1   0.005814   0.006533

min_support= 0.5 low_memory=1 sparse=0   0.005517   0.006016
min_support= 0.5 low_memory=0 sparse=0   0.003736   0.004259
min_support= 0.5 low_memory=1 sparse=1   0.074175   0.075586
min_support= 0.5 low_memory=0 sparse=1   0.007868   0.007929

dataframe 40x100                master (fe0e22a)  db/perf (3056e4a2)

min_support= 0.8 low_memory=1 sparse=0    0.0334    0.0057
min_support= 0.8 low_memory=0 sparse=0    0.0076    0.0052
min_support= 0.8 low_memory=1 sparse=1    0.9123    0.1024
min_support= 0.8 low_memory=0 sparse=1    0.0197    0.0167

min_support= 0.7 low_memory=1 sparse=0    0.7691    0.0574
min_support= 0.7 low_memory=0 sparse=0    0.1541    0.0525
min_support= 0.7 low_memory=1 sparse=1   23.1043    1.3285
min_support= 0.7 low_memory=0 sparse=1    0.2057    0.1165

min_support= 0.6 low_memory=1 sparse=0   15.2329    1.0777
min_support= 0.6 low_memory=0 sparse=0    3.0321    0.9738
min_support= 0.6 low_memory=1 sparse=1  443.7326   27.6753
min_support= 0.6 low_memory=0 sparse=1    4.3635    2.4218

min_support= 0.5 low_memory=1 sparse=0  394.4081   32.4573
min_support= 0.5 low_memory=0 sparse=0   83.9022   28.8984
min_support= 0.5 low_memory=1 sparse=1 unfinished 834.6909
min_support= 0.5 low_memory=0 sparse=1  130.8865   78.8501

dataframe 1000x100              master (fe0e22a)  db/perf (3056e4a2)

min_support= 0.8 low_memory=1 sparse=0      0.02191      0.00567
min_support= 0.8 low_memory=0 sparse=0      0.00732      0.00714
min_support= 0.8 low_memory=1 sparse=1      0.46083      0.06502
min_support= 0.8 low_memory=0 sparse=1      0.03289      0.03401

min_support= 0.7 low_memory=1 sparse=0      0.08458      0.01312
min_support= 0.7 low_memory=0 sparse=0      0.02404      0.02052
min_support= 0.7 low_memory=1 sparse=1      1.90439      0.11286
min_support= 0.7 low_memory=0 sparse=1      0.09701      0.09027

min_support= 0.6 low_memory=1 sparse=0      2.84277      0.34374
min_support= 0.6 low_memory=0 sparse=0      0.87047      0.64033
min_support= 0.6 low_memory=1 sparse=1     66.11381      4.17402
min_support= 0.6 low_memory=0 sparse=1      3.75937      3.54072

min_support= 0.5 low_memory=1 sparse=0     58.83895      7.19620
min_support= 0.5 low_memory=0 sparse=0 memory error memory error
min_support= 0.5 low_memory=1 sparse=1   unfinished    100.69009
min_support= 0.5 low_memory=0 sparse=1 memory error memory error

Results with the smallest dataset are not really relevant IMO, times are too small, and it is unlikely that one would use sparse matrix or lowmem option. Anyway they are provided here to show that there is no dramatic regression.

@coveralls
Copy link

coveralls commented Nov 2, 2019

Coverage Status

Coverage increased (+0.007%) to 92.354% when pulling a59cceb on dbarbier:db/perf into fe0e22a on rasbt:master.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

Thanks for the PR. The benchmarks look pretty convincing. And I agree with you, it will be unlikely that someone would apply that on small data frames. Also, then, the performance differences are negligible because in either case it would finish running in non-substantial time.

If all columns are boolean, there is nothing to check.

In apriori.py, call valid_input_check.
@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 3, 2019

I found #549 when looking for other benchmarks, and ran the same tests with this script:

import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder   
from time import time

with open("kosarak-50k.dat", "rt") as f:
    data = f.readlines() 

dataset = [list(map(int, f.split())) for f in data]    

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_) 

def bench(df, min_support, no_crash=False):
    if not no_crash:
        tick = time()
        res1 = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=False)
        print("min_support={:g} apriori low_memory=0 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res2 = apriori(df, min_support=min_support, verbose=0, use_colnames=False, low_memory=True)
    print("min_support={:g} apriori low_memory=1 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res3 = fpgrowth(df, min_support=min_support, verbose=0, use_colnames=False)
    print("min_support={:g} fpgrowth {:.3g}".format(min_support, time() - tick))
    tick = time()
    if not no_crash:
        assert res1.equals(res2)
    # Replace frozenset by a sorted list so that sort is meaningful
    res2['itemsets'] = res2['itemsets'].apply(sorted).sort_values()
    res2 = res2.sort_values(by='itemsets').reset_index(drop=True)
    res3['itemsets'] = res3['itemsets'].apply(sorted).sort_values()
    res3 = res3.sort_values(by='itemsets').reset_index(drop=True)
    assert res2.equals(res3), pd.concat([res2, res3], axis=1)

for min_support in [0.05, 0.02, 0.01, 0.005]:
    bench(df, min_support)

Here timings are made on the whole functions. I just pushed a commit to improve checks on input dataframe. Results are:

kosarak-50k.dat
                                       master 3056e4a 6bd97d4
min_support=0.05 apriori low_memory=0   4.66   4.66    0.74
min_support=0.05 apriori low_memory=1   4.64   5.28    1.34
min_support=0.05 fpgrowth               4.96   4.92    1.07
min_support=0.02 apriori low_memory=0   5.08   5.10    1.16
min_support=0.02 apriori low_memory=1   5.04   5.67    1.76
min_support=0.02 fpgrowth               4.98   5.02    1.09
min_support=0.01 apriori low_memory=0   9.60  10.03    5.84
min_support=0.01 apriori low_memory=1   9.60   6.41    2.46
min_support=0.01 fpgrowth               5.27   5.22    1.23
min_support=0.005 apriori low_memory=0 memory memory  memory
min_support=0.005 apriori low_memory=1 72.71   9.16    5.28
min_support=0.005 fpgrowth              5.50   5.59    1.63

kosarak-100k.dat
                                        master 3056e4a 6bd97d4
min_support=0.05 apriori low_memory=0   11.62   11.74   1.81
min_support=0.05 apriori low_memory=1   11.57   13.00   3.22
min_support=0.05 fpgrowth               12.56   12.31   2.50
min_support=0.02 apriori low_memory=0   12.81   12.98   2.99
min_support=0.02 apriori low_memory=1   13.78   14.24   4.36
min_support=0.02 fpgrowth               14.22   12.52   2.64
min_support=0.01 apriori low_memory=0   28.24   25.38  15.50
min_support=0.01 apriori low_memory=1   27.03   15.88   6.03
min_support=0.01 fpgrowth               12.95   12.80   2.98
min_support=0.005 apriori low_memory=0 memory  memory  memory
min_support=0.005 apriori low_memory=1 220.58   23.18  11.90
min_support=0.005 fpgrowth              13.93   14.34   3.66

@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 3, 2019

I forgot to mention that 6bd97d4 only improves timings for boolean dataframes; users should be advised to prefer them over integer dataframes, or there could be an option to disable this check, which takes most of the time.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

I agree with you. Maybe we could display a warning if people use integer arrays to subtly nudge people towards using bool arrays. Could be in the form of a deprecation warning, even, and then phase out integer arrays in the next versions.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

Was also just running the benchmarks (the 2nd set you posted, kosarak-50k.dat) using the old and the improved version(s).

      master 3056e4a 6bd97d4
  min_support=0.05 apriori low_memory=0 5.09 5.14 5.06
  min_support=0.05 apriori low_memory=1 4.9 4.87 5.02
  min_support=0.05 fpgrowth 5.78 5.32 5.53
  min_support=0.02 apriori low_memory=0 6.44 6.05 6.15
  min_support=0.02 apriori low_memory=1 6.22 6.28 6.16
  min_support=0.02 fpgrowth 5.7 5.85 5.5
  min_support=0.01 apriori low_memory=0 19.7 19.8 19.3
  min_support=0.01 apriori low_memory=1 19 19.4 19.2
  min_support=0.01 fpgrowth 5.53 5.9 5.59
  min_support=0.005 apriori low_memory=0 205 202 207
  min_support=0.005 apriori low_memory=1 199 199 203
  min_support=0.005 fpgrowth 6.23 6 6.02

Hm, for some reason I don't get the same improvements you got. I was running that on an MBP. I will run the same code on my workstation later (running some other stuff there now) to see what's going on.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

Pls ignore the numbers above ... somehow I had a clobbered environment. The improvements look great -- probably mostly due to the improved checking, but still, I think we should merge the whole PR. That's really great. Thanks! Could you add a changelog entry though?

PS: my updated benchmark results are shown below

      MBP       Workstation (Ubuntu)      
      master 3056e4a 6bd97d4   master 3056e4a 6bd97d4  
  min_support=0.05 apriori low_memory=0 5.09 5.12 1.04   6.46 6.48 1.28  
  min_support=0.05 apriori low_memory=1 4.9 6.52 2.19   6.27 7.06 1.86  
  min_support=0.05 fpgrowth 5.78 5.64 1.29   6.73 6.76 1.54  
  min_support=0.02 apriori low_memory=0 6.44 6.02 2.01   7.5 7.5 2.3  
  min_support=0.02 apriori low_memory=1 6.22 7.03 2.7   7.48 7.61 2.41  
  min_support=0.02 fpgrowth 5.7 5.7 1.42   6.84 6.88 1.69  
  min_support=0.01 apriori low_memory=0 19.7 19.4 15.1   19.1 19.1 13.9  
  min_support=0.01 apriori low_memory=1 19 8.43 4.13   19.1 8.43 3.24  
  min_support=0.01 fpgrowth 5.53 5.57 1.53   7.01 7.04 1.84  
  min_support=0.005 apriori low_memory=0 205 207 204   175 174 169  
  min_support=0.005 apriori low_memory=1 199 13 9.29   175 11.4 6.23  
  min_support=0.005 fpgrowth 6.23 5.66 1.82   7.5 7.55 2.35  

Replace 0/1 by False/True in docstrings of apriori, fpgrowth and fpmax
to promote usage of boolean arrays.
@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 3, 2019

Changelog entry added, please let me know if this is not clear.

About your results, they look consistent with mine; there is a constant difference between 3056e4a and 6bd97d4, which depends only on input dataframe size, for instance 4.2s on your workstation. MBP is more volatile, like my laptop.

Between master and 3056e4a, the major difference is with low_memory=1, it runs much faster except when itemsets are very small.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

Looks good, thanks! I guess it is good to merge then?

@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 3, 2019

Okay for me.

@rasbt
Copy link
Owner

rasbt commented Nov 3, 2019

Great! I can open an issue reg warnings and encouraging users to use boolean type inputs and revisit this another day then :)

@rasbt rasbt merged commit 2f928cb into rasbt:master Nov 3, 2019
@dbarbier dbarbier deleted the db/perf branch November 3, 2019 22:12
@rasbt rasbt mentioned this pull request Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants