-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
let fpgrowth and fpmax work directly with sparse input #622
Conversation
If transactions are stored in a sparse matrix, it was first converted into a dense Numpy array in setup_fptree before processing. This commit allows to build the FPTree directly from sparse input, which may save memory and processing time. This is used by both fpgrowth and fpmax. There is no change after FPTree is being built.
I used the following script with This PR only modifies initial creation of the FPTree, so the difference between master and this PR does not depend on Benchmark script (click to display)import pandas as pd
from mlxtend.frequent_patterns import apriori, fpgrowth, fpmax
from mlxtend.preprocessing import TransactionEncoder
from time import time
with open("kosarak-100k.dat", "rt") as f:
data = f.readlines()
sparse = False
dataset = [list(map(int, f.split())) for f in data]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
if sparse:
df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
print("sparsity=", df.sparse.density)
else:
df = pd.DataFrame(te_ary, columns=te.columns_)
df.columns = ["c"+str(i) for i in df.columns]
def bench(df, min_support, no_crash=False):
kwargs = {'verbose': 0, 'use_colnames': True, 'max_len': 2}
if not no_crash:
tick = time()
res1 = apriori(df, min_support=min_support, low_memory=False, **kwargs)
print("min_support={:g} apriori low_memory=0 {:.3g}".format(min_support, time() - tick))
tick = time()
res2 = apriori(df, min_support=min_support, low_memory=True, **kwargs)
print("min_support={:g} apriori low_memory=1 {:.3g}".format(min_support, time() - tick))
print("result={:d}".format(len(res2)))
tick = time()
res3 = fpgrowth(df, min_support=min_support, **kwargs)
print("min_support={:g} fpgrowth {:.3g}".format(min_support, time() - tick))
tick = time()
res4 = fpmax(df, min_support=min_support, **kwargs)
print("min_support={:g} fpmax {:.3g}".format(min_support, time() - tick))
tick = time()
# check that all results are equal
if not no_crash:
assert res1.equals(res2)
# Replace frozenset by a sorted list so that sort is meaningful
res2['itemsets'] = res2['itemsets'].apply(sorted).sort_values()
res2 = res2.sort_values(by='itemsets').reset_index(drop=True)
res3['itemsets'] = res3['itemsets'].apply(sorted).sort_values()
res3 = res3.sort_values(by='itemsets').reset_index(drop=True)
assert res2.equals(res3), pd.concat([res2, res3], axis=1)
# fpmax gives a different output; since result dataframe is
# very small, it is printed to manually check results
#print("fpmax output:", res4)
for min_support in [0.05, 0.02, 0.01]:
bench(df, min_support, True) With master
PR
With master
PR
|
In a212806 we assumed that all values in sparse DataFrame are non null, which should always be true. In order to avoid this corner case, we now call itemsets.eliminate_zeros(). The alternative would be to replace nonnull = itemsets.indices[itemsets.indptr[i]:itemsets.indptr[i+1]] by values = itemsets.data[itemsets.indptr[i]:itemsets.indptr[i+1]] nonnull = itemsets.indices[itemsets.indptr[i] + values.nonzero()[0]] but it is slower. Add a test case.
Thanks so much again for another great PR. The improvements are quite substantial with |
Description
If transactions are stored in a sparse matrix, it was first converted
into a dense Numpy array in setup_fptree before processing.
This commit allows to build the FPTree directly from sparse input, which
may save memory and processing time. This is used by both fpgrowth and
fpmax. There is no change after FPTree is being built.
Related issues or pull requests
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend