-
Notifications
You must be signed in to change notification settings - Fork 872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement apriori-gen as in original paper #646
base: master
Are you sure you want to change the base?
Conversation
Grrr, UI for drafts PR is terrible, I forgot to change PR type to draft :-/ |
9266c65
to
16a2136
Compare
Here are some benchmark results; data must first be downloaded from http://fimi.uantwerpen.be/data/ and put inside a Benchmark scriptfrom mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd
import numpy as np
import gzip
import os
from time import time
import signal
from contextlib import contextmanager
@contextmanager
def timeout(time):
# Register a function to raise a TimeoutError on the signal.
signal.signal(signal.SIGALRM, raise_timeout)
# Schedule the signal to be sent after ``time``.
signal.alarm(time)
try:
yield
except TimeoutError:
pass
finally:
# Unregister the signal so it won't be triggered
# if the timeout is not reached.
signal.signal(signal.SIGALRM, signal.SIG_IGN)
def raise_timeout(signum, frame):
raise TimeoutError
files = [
#"chess.dat.gz",
"connect.dat.gz",
"mushroom.dat.gz",
"pumsb.dat.gz",
"pumsb_star.dat.gz",
# "T10I4D100K.dat.gz", these 3 files are too large
# "T40I10D100K.dat.gz",
# "kosarak.dat.gz"
]
# Modify these 2 variables
sparse = False
low_memory = True
for filename in files:
with gzip.open(os.path.join("data", filename)) if filename.endswith(
".gz"
) else open(os.path.join("data", filename)) as f:
data = f.readlines()
dataset = [list(map(int, line.split())) for line in data]
items = np.unique([item for itemset in dataset for item in itemset])
print(f"{filename} contains {len(dataset)} transactions and {len(items)} items")
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
if sparse:
try:
df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
except AttributeError:
# pandas < 0.25
df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
else:
df = pd.DataFrame(te_ary, columns=te.columns_)
df.columns = ["c"+str(i) for i in df.columns]
for min_support in [0.5, 0.3, 0.1, 0.05, 0.03, 0.01, 0.005]:
tick = time()
with timeout(120):
print(apriori(df, min_support=min_support, verbose=1, use_colnames=False, low_memory=low_memory))
print(f"\nmin_support={min_support} temps: {time() - tick}\n") Some commits have either an asterisk (*) or a plus sign (+), as well as (F) or (T).
With dense DataFrame: connect.dat.gz 67557 transactions and 129 items
mushroom.dat.gz 8124 transactions and 119 items
pumsb.dat.gz 49046 transactions and 2113 items
pumsb_star.dat.gz 49046 transactions and 2088 items
With sparse DataFrame connect.dat.gz 67557 transactions and 129 items
mushroom.dat.gz 8124 transactions and 119 items
pumsb.dat.gz 49046 transactions and 2113 items
pumsb_star.dat.gz 49046 transactions and 2088 items
|
More tests should be run, maybe with kosarak.dat; when the best option is decided, I will remove/squash/rearrange commits and update docstrings. IMHO the best option is with current head. |
Okay, I ran some tests with kosarak.dat and df = pd.DataFrame({col: df[col] for col in df.columns}) (F) and (T) still refers to 1k first lines of kosarak.dat.gz 1000 transactions and 3259 items, sparse=False
10k first lines of kosarak.dat.gz 10000 transactions and 10094 items, sparse=False
50k first lines of kosarak.dat.gz 50000 transactions and 18936 items, sparse=False
100k first lines of kosarak.dat.gz 100000 transactions and 23496 items, sparse=False
With all these resuls, it is not clear to me whether 1fabe25 is such a good idea, an alternative is to make users aware of the importance of storage order, and they have to check which one is best adapted to their use case. |
Sorry, I have been busy with grading and was then traveling over Xmas. The work in this PR is amazing, thanks a lot! Regarding the row vs column format, like you suggest, maybe it would be better to add this prominently in the docstring but have users decide what type of format they provide via the DataFrame. |
I just see the unit tests failing. I don't know why this happens to Appveyor, but it seems that switching installations from pip to conda helped. Maybe something got changed in the backend. Regarding Travis CI, some discrepancies occurred after things got switched to scikit-learn 0.22 in the "latest" version. I addressed this now as well. This has been done in #652. I can do a rebase of this PR if you like, or if you can do it yourself if you prefer -- wanted to ask before I started messing around here ;) |
f75d5e8
to
5a13f09
Compare
I just rebased but won't be able to take care of failures during 24h, feel free to push fixes. |
All good. Based on the error logs, these are "natural" ones due to WIP and not issues with the CI. |
5a13f09
to
a00ee50
Compare
There were indeed some bugs; because of these fixes, timings may be slightly different, I will rerun benchmarks in few days. |
bcc64c5
to
4c43f5c
Compare
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; first, the join step looks for itemsets with the same prefix, and creates new candidates by appending all pairs combinations to this prefix. Here is pseudocode copied from paper: select p.1, p.2, ..., p.k-1, q.k-1 from p in L(k-1), q in L(k-1) where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1 The reason is that if a sequence q with the same prefix as p does not belong to L(k-1), itemset p+(q.k-1,) cannot be frequent. Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1. The second step of apriori-gen function is called prune step, it will be implemented in a distinct commit. See discussion in rasbt#644.
What do you think about adding a/the benchmark script(s) to It may be useful to have this as a reference in case of future modifications to the codebase. |
*sry, not sure why this PR was closed. I must have hit some keyboard combination -- never happened before. This was certainly not intentional. |
The apriori-gen function described in section 2.1.1 of Apriori paper has two steps; the first step had been implemented in previous commit. The second step of apriori-gen function is called prune step, it takes candidates c from first step and check that all (k-1) tuples built by removing any single element from c is in L(k-1). As Numpy arrays are not hashable, we cannot use set() for itemset lookup, and define a very simple prefix tree class.
4c43f5c
to
b731fd2
Compare
Thanks to previous optimizations, processing with low_memory=True is now as efficient as with low_memory=False, and allows to process much larger datasets. Removing processing with low_memory=False makes code simpler. The downside is that we do not know in advance the number of itemsets to process, thus it is displayed afterwards. We now display the number of itemsets after prune step. Note that commit 2f928cb introduced a bug, the number of processing combinations was multiplied by itemset's length. Since vectorized operations are no more performed on frequent itemsets, they are stored as list of tuples.
This is now possible because tuples are hashable.
For unknbown reasons, np.sum is slow on a very large boolean array.
b731fd2
to
eb80667
Compare
I rearranged commits, they look good now IMHO. About benchmark script, I do not know how to do that, there are many parameters: data files, sparse=True/False, column_major=True/False, and list of min_support argument (which may depend on data files). Anyway, it has been committed. Should data files be copied into
|
58d34a2
to
f228762
Compare
Here are more benchmarks. In these tables, T10I4D100K.dat.gz 100000 transactions and 870 items, low_memory=True
T40I10D100K.dat.gz 100000 transactions and 942 items, low_memory=TrueNote that timeout had been expanded to 300s when running commit eb80667, which explains values above 120s.
kosarak-*k.dat.gz, low_memory=True
|
Some remarks:
|
This is a work in progress.
f228762
to
09e6e2f
Compare
Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan. In any case, I am really thankful for all the good work you put into this. This is really awesome. And I can take care of the automatic data downloads from here. regarding
Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).
I am not an expert at this by any means, but I think they are usually only memory efficient but not necessarily efficient when it comes to processing times. |
No worries, this issue is not trivial and requires careful thinking, please take your time. Other apriori implementations takes as input a list of transactions; there are several optimizations which can then be performed:
Here it is very likely that user loaded her dataset as a list of transactions and called If the former, optimizations mentioned above cannot be performed and this PR is almost done. If the latter, a lot more work is needed. [...]
Sorry I do not understand your point; dense pandas DataFrame can use either row major or column major storage, and it looks like this depends on which DataFrame constructor had been called(2d array vs dict). We could indeed add a cell to show how to convert input DataFrame to speed up apriori function. |
Sorry, still haven't had time to look into this more. Along with the new semester (lots of teaching) & 2 paper deadlines in January, there wasn't time for much else recently. I am currently making a 0.17.1 bugfix release with the recent changes -- someone from industry contacted me about this, because due to company firewalls, several people can only install it from PyPI (not from GitHub directly). Will revisit this PR soon though -- thanks for all the work on it so far! |
Description
Implement apriori-gen as in original Apriori paper.
This is a draft PR for discussion; different changes are proposed, they should be benchmarked.
low_memory=True
processing; thanks to previous optimizations, it is now as fast aslow_memory=False
and requires less memory; frequent itemsets are stored as list of tuples instead of Numpy arrays_support
function, which was really slow on some test casesRelated issues or pull requests
Reported in #644.
Pull Request Checklist
./docs/sources/CHANGELOG.md
file (if applicable)./mlxtend/*/tests
directories (if applicable)mlxtend/docs/sources/
(if applicable)PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
)flake8 ./mlxtend
Checklist empty for now since this is a draft pull request.