Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

let fpgrowth and fpmax work directly with sparse input #622

Merged
merged 3 commits into from
Nov 6, 2019

Conversation

dbarbier
Copy link
Contributor

@dbarbier dbarbier commented Nov 6, 2019

Description

If transactions are stored in a sparse matrix, it was first converted
into a dense Numpy array in setup_fptree before processing.

This commit allows to build the FPTree directly from sparse input, which
may save memory and processing time. This is used by both fpgrowth and
fpmax. There is no change after FPTree is being built.

Related issues or pull requests

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

If transactions are stored in a sparse matrix, it was first converted
into a dense Numpy array in setup_fptree before processing.

This commit allows to build the FPTree directly from sparse input, which
may save memory and processing time.  This is used by both fpgrowth and
fpmax.  There is no change after FPTree is being built.
@coveralls
Copy link

coveralls commented Nov 6, 2019

Coverage Status

Coverage increased (+0.01%) to 92.348% when pulling 573882f on dbarbier:db/sparse-fpgrowth into fa643e2 on rasbt:master.

@dbarbier
Copy link
Contributor Author

dbarbier commented Nov 6, 2019

I used the following script with sparse=True and sparse=False. Apriori (with low_memory=1option) is used to check that results are not modified. Results below are given for a single run, just to give a rough idea of processing times.

This PR only modifies initial creation of the FPTree, so the difference between master and this PR does not depend on min_support (about 11s in my case). For much lower min_support values, gain is less impressive, for instance fpgrowth runs in 23s instead of 34s with min_support=5e-4.

Benchmark script (click to display)
import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder
from time import time

with open("kosarak-100k.dat", "rt") as f:
    data = f.readlines()

sparse = False

dataset = [list(map(int, f.split())) for f in data]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
if sparse:
    df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
    print("sparsity=", df.sparse.density)
else:
    df = pd.DataFrame(te_ary, columns=te.columns_)
df.columns = ["c"+str(i) for i in df.columns]

def bench(df, min_support, no_crash=False):
    kwargs = {'verbose': 0, 'use_colnames': True, 'max_len': 2}
    if not no_crash:
        tick = time()
        res1 = apriori(df, min_support=min_support, low_memory=False, **kwargs)
        print("min_support={:g} apriori low_memory=0 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res2 = apriori(df, min_support=min_support, low_memory=True, **kwargs)
    print("min_support={:g} apriori low_memory=1 {:.3g}".format(min_support, time() - tick))
    print("result={:d}".format(len(res2)))
    tick = time()
    res3 = fpgrowth(df, min_support=min_support, **kwargs)
    print("min_support={:g} fpgrowth {:.3g}".format(min_support, time() - tick))
    tick = time()
    res4 = fpmax(df, min_support=min_support, **kwargs)
    print("min_support={:g} fpmax {:.3g}".format(min_support, time() - tick))
    tick = time()
    # check that all results are equal
    if not no_crash:
        assert res1.equals(res2)
    # Replace frozenset by a sorted list so that sort is meaningful
    res2['itemsets'] = res2['itemsets'].apply(sorted).sort_values()
    res2 = res2.sort_values(by='itemsets').reset_index(drop=True)
    res3['itemsets'] = res3['itemsets'].apply(sorted).sort_values()
    res3 = res3.sort_values(by='itemsets').reset_index(drop=True)
    assert res2.equals(res3), pd.concat([res2, res3], axis=1)
    # fpmax gives a different output; since result dataframe is
    # very small, it is printed to manually check results
    #print("fpmax output:", res4)

for min_support in [0.05, 0.02, 0.01]:
    bench(df, min_support, True)

With sparse=False, results are similar, as expected.

master

min_support apriori fpgrowth fpmax
0.05 2.4 2.5 2.5
0.02 2.59 2.65 2.65
0.01 2.81 2.87 2.87

PR

min_support apriori fpgrowth fpmax
0.05 2.43 2.47 2.53
0.02 2.73 2.67 2.63
0.01 2.96 2.9 2.94

With sparse=True, there is an improvement. Memory usage is also much lower in PR.

master

min_support apriori fpgrowth fpmax
0.05 1.36 11.7 11.7
0.02 0.642 11.9 11.9
0.01 0.694 12.2 12.3

PR

min_support apriori fpgrowth fpmax
0.05 1.39 0.946 2.3
0.02 0.622 1.05 2.39
0.01 0.695 1.34 2.7

In a212806 we assumed that all values in sparse DataFrame are non null,
which should always be true.  In order to avoid this corner case, we
now call itemsets.eliminate_zeros().  The alternative would be to replace
   nonnull = itemsets.indices[itemsets.indptr[i]:itemsets.indptr[i+1]]
by
   values = itemsets.data[itemsets.indptr[i]:itemsets.indptr[i+1]]
   nonnull = itemsets.indices[itemsets.indptr[i] + values.nonzero()[0]]
but it is slower.

Add a test case.
@pep8speaks
Copy link

pep8speaks commented Nov 6, 2019

Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-06 21:09:49 UTC

@rasbt
Copy link
Owner

rasbt commented Nov 6, 2019

Thanks so much again for another great PR. The improvements are quite substantial with sparse=True. Many thanks!

@rasbt rasbt merged commit 23296d8 into rasbt:master Nov 6, 2019
@dbarbier dbarbier deleted the db/sparse-fpgrowth branch November 7, 2019 08:46
@rasbt rasbt mentioned this pull request Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants