let fpgrowth and fpmax work directly with sparse input #622

dbarbier · 2019-11-06T08:20:03Z

Description

If transactions are stored in a sparse matrix, it was first converted
into a dense Numpy array in setup_fptree before processing.

This commit allows to build the FPTree directly from sparse input, which
may save memory and processing time. This is used by both fpgrowth and
fpmax. There is no change after FPTree is being built.

Related issues or pull requests

Pull Request Checklist

Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
Checked for style issues by running flake8 ./mlxtend

If transactions are stored in a sparse matrix, it was first converted into a dense Numpy array in setup_fptree before processing. This commit allows to build the FPTree directly from sparse input, which may save memory and processing time. This is used by both fpgrowth and fpmax. There is no change after FPTree is being built.

coveralls · 2019-11-06T08:29:42Z

Coverage increased (+0.01%) to 92.348% when pulling 573882f on dbarbier:db/sparse-fpgrowth into fa643e2 on rasbt:master.

dbarbier · 2019-11-06T08:46:31Z

I used the following script with sparse=True and sparse=False. Apriori (with low_memory=1option) is used to check that results are not modified. Results below are given for a single run, just to give a rough idea of processing times.

This PR only modifies initial creation of the FPTree, so the difference between master and this PR does not depend on min_support (about 11s in my case). For much lower min_support values, gain is less impressive, for instance fpgrowth runs in 23s instead of 34s with min_support=5e-4.

Benchmark script (click to display)

import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder
from time import time

with open("kosarak-100k.dat", "rt") as f:
    data = f.readlines()

sparse = False

dataset = [list(map(int, f.split())) for f in data]

te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
if sparse:
    df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
    print("sparsity=", df.sparse.density)
else:
    df = pd.DataFrame(te_ary, columns=te.columns_)
df.columns = ["c"+str(i) for i in df.columns]

def bench(df, min_support, no_crash=False):
    kwargs = {'verbose': 0, 'use_colnames': True, 'max_len': 2}
    if not no_crash:
        tick = time()
        res1 = apriori(df, min_support=min_support, low_memory=False, **kwargs)
        print("min_support={:g} apriori low_memory=0 {:.3g}".format(min_support, time() - tick))
    tick = time()
    res2 = apriori(df, min_support=min_support, low_memory=True, **kwargs)
    print("min_support={:g} apriori low_memory=1 {:.3g}".format(min_support, time() - tick))
    print("result={:d}".format(len(res2)))
    tick = time()
    res3 = fpgrowth(df, min_support=min_support, **kwargs)
    print("min_support={:g} fpgrowth {:.3g}".format(min_support, time() - tick))
    tick = time()
    res4 = fpmax(df, min_support=min_support, **kwargs)
    print("min_support={:g} fpmax {:.3g}".format(min_support, time() - tick))
    tick = time()
    # check that all results are equal
    if not no_crash:
        assert res1.equals(res2)
    # Replace frozenset by a sorted list so that sort is meaningful
    res2['itemsets'] = res2['itemsets'].apply(sorted).sort_values()
    res2 = res2.sort_values(by='itemsets').reset_index(drop=True)
    res3['itemsets'] = res3['itemsets'].apply(sorted).sort_values()
    res3 = res3.sort_values(by='itemsets').reset_index(drop=True)
    assert res2.equals(res3), pd.concat([res2, res3], axis=1)
    # fpmax gives a different output; since result dataframe is
    # very small, it is printed to manually check results
    #print("fpmax output:", res4)

for min_support in [0.05, 0.02, 0.01]:
    bench(df, min_support, True)

With sparse=False, results are similar, as expected.

master

min_support	apriori	fpgrowth	fpmax
0.05	2.4	2.5	2.5
0.02	2.59	2.65	2.65
0.01	2.81	2.87	2.87

PR

min_support	apriori	fpgrowth	fpmax
0.05	2.43	2.47	2.53
0.02	2.73	2.67	2.63
0.01	2.96	2.9	2.94

With sparse=True, there is an improvement. Memory usage is also much lower in PR.

master

min_support	apriori	fpgrowth	fpmax
0.05	1.36	11.7	11.7
0.02	0.642	11.9	11.9
0.01	0.694	12.2	12.3

PR

min_support	apriori	fpgrowth	fpmax
0.05	1.39	0.946	2.3
0.02	0.622	1.05	2.39
0.01	0.695	1.34	2.7

mlxtend/frequent_patterns/fpcommon.py

In a212806 we assumed that all values in sparse DataFrame are non null, which should always be true. In order to avoid this corner case, we now call itemsets.eliminate_zeros(). The alternative would be to replace nonnull = itemsets.indices[itemsets.indptr[i]:itemsets.indptr[i+1]] by values = itemsets.data[itemsets.indptr[i]:itemsets.indptr[i+1]] nonnull = itemsets.indices[itemsets.indptr[i] + values.nonzero()[0]] but it is slower. Add a test case.

pep8speaks · 2019-11-06T21:04:28Z

Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-11-06 21:09:49 UTC

rasbt · 2019-11-06T22:05:33Z

Thanks so much again for another great PR. The improvements are quite substantial with sparse=True. Many thanks!

dbarbier commented Nov 6, 2019

View reviewed changes

mlxtend/frequent_patterns/fpcommon.py Show resolved Hide resolved

fix pep8 issue

573882f

rasbt merged commit 23296d8 into rasbt:master Nov 6, 2019

dbarbier deleted the db/sparse-fpgrowth branch November 7, 2019 08:46

rasbt mentioned this pull request Jan 29, 2020

v0.17.1 #660

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

let fpgrowth and fpmax work directly with sparse input #622

let fpgrowth and fpmax work directly with sparse input #622

dbarbier commented Nov 6, 2019

coveralls commented Nov 6, 2019 •

edited

dbarbier commented Nov 6, 2019

pep8speaks commented Nov 6, 2019 •

edited

rasbt commented Nov 6, 2019

let fpgrowth and fpmax work directly with sparse input #622

let fpgrowth and fpmax work directly with sparse input #622

Conversation

dbarbier commented Nov 6, 2019

Description

Related issues or pull requests

Pull Request Checklist

coveralls commented Nov 6, 2019 • edited

dbarbier commented Nov 6, 2019

pep8speaks commented Nov 6, 2019 • edited

Comment last updated at 2019-11-06 21:09:49 UTC

rasbt commented Nov 6, 2019

coveralls commented Nov 6, 2019 •

edited

pep8speaks commented Nov 6, 2019 •

edited