Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement apriori-gen as in original paper #646

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dbarbier
Copy link
Contributor

@dbarbier dbarbier commented Dec 18, 2019

Description

Implement apriori-gen as in original Apriori paper.
This is a draft PR for discussion; different changes are proposed, they should be benchmarked.

  • First commit implements join step
  • Second commit implements prune step
  • Third commit enforces low_memory=True processing; thanks to previous optimizations, it is now as fast as low_memory=False and requires less memory; frequent itemsets are stored as list of tuples instead of Numpy arrays
  • Fourth commit replaces trie implementation by set
  • Fifth commit replaces _support function, which was really slow on some test cases

Related issues or pull requests

Reported in #644.

Pull Request Checklist

  • Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
  • Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
  • Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
  • Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
  • Checked for style issues by running flake8 ./mlxtend

Checklist empty for now since this is a draft pull request.

@dbarbier
Copy link
Contributor Author

Grrr, UI for drafts PR is terrible, I forgot to change PR type to draft :-/

@dbarbier dbarbier changed the title Implement apriori-gen as in original paper [WIP] Implement apriori-gen as in original paper Dec 18, 2019
@pep8speaks
Copy link

pep8speaks commented Dec 20, 2019

Hello @dbarbier! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-06 13:19:09 UTC

@dbarbier
Copy link
Contributor Author

dbarbier commented Dec 20, 2019

Here are some benchmark results; data must first be downloaded from http://fimi.uantwerpen.be/data/ and put inside a data subdirectory. Some runs take a very long time, thus I decided to set a timeout of 120s, in which case NA is reported in tables below, with the itemset size being processed. Script below only works on Linux, I am not sure how to implement timeout on Windows.

Benchmark script
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
import pandas as pd
import numpy as np
import gzip
import os
from time import time
import signal
from contextlib import contextmanager


@contextmanager
def timeout(time):
    # Register a function to raise a TimeoutError on the signal.
    signal.signal(signal.SIGALRM, raise_timeout)
    # Schedule the signal to be sent after ``time``.
    signal.alarm(time)

    try:
        yield
    except TimeoutError:
        pass
    finally:
        # Unregister the signal so it won't be triggered
        # if the timeout is not reached.
        signal.signal(signal.SIGALRM, signal.SIG_IGN)


def raise_timeout(signum, frame):
    raise TimeoutError


files = [
    #"chess.dat.gz", 
    "connect.dat.gz",
    "mushroom.dat.gz", 
    "pumsb.dat.gz",
    "pumsb_star.dat.gz",
    # "T10I4D100K.dat.gz",  these 3 files are too large 
    # "T40I10D100K.dat.gz",
    # "kosarak.dat.gz"
]

# Modify these 2 variables
sparse = False
low_memory = True

for filename in files:
    with gzip.open(os.path.join("data", filename)) if filename.endswith(
        ".gz" 
    ) else open(os.path.join("data", filename)) as f:
        data = f.readlines() 
 
    dataset = [list(map(int, line.split())) for line in data] 
    items = np.unique([item for itemset in dataset for item in itemset])
    print(f"{filename} contains {len(dataset)} transactions and {len(items)} items")

    te = TransactionEncoder()
    te_ary = te.fit(dataset).transform(dataset, sparse=sparse)
    if sparse:
        try:
            df = pd.DataFrame.sparse.from_spmatrix(te_ary, columns=te.columns_)
        except AttributeError:
            # pandas < 0.25
            df = pd.SparseDataFrame(te_ary, columns=te.columns_, default_fill_value=False)
    else:
        df = pd.DataFrame(te_ary, columns=te.columns_)
    df.columns = ["c"+str(i) for i in df.columns]

    for min_support in [0.5, 0.3, 0.1, 0.05, 0.03, 0.01, 0.005]:
        tick = time()
        with timeout(120):
            print(apriori(df, min_support=min_support, verbose=1, use_colnames=False, low_memory=low_memory))
        print(f"\nmin_support={min_support} temps: {time() - tick}\n")

Some commits have either an asterisk (*) or a plus sign (+), as well as (F) or (T).

  • An asterisk tells that code has been modified to operate on smaller matrix blocks, because there would otherwise be memory errors on my machine. All such commits have the same block size, so their relative speed is still relevant.
  • A plus sign tells that 1 should be added to number between parentheses; the reason is that commit 4c82dcf displays the number of combinations after processing them, instead of before, so the last number is missing when timeout is reached.
  • F or T refers to low_memory argument (either False or True)

With dense DataFrame:

connect.dat.gz 67557 transactions and 129 items
min_support master*(F) master(T) 96dfd4d*(F) 4efeb8c*(F) b90a146*(F) 1fabe25*(F) 1fabe25(T) 58c95f1(F) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 NA(5) NA(4) NA(5) NA(5) NA(5) NA(6) NA(4) NA(7) NA(7) NA+(6) NA+(6)
0.3 NA(5) NA(4) NA(4) NA(5) NA(5) NA(6) NA(4) NA(6) NA(6) NA+(5) NA+(5)
0.1 NA(4) NA(4) NA(4) NA(4) NA(4) NA(5) NA(4) NA(5) NA(5) NA+(4) NA+(4)
0.05 NA(4) NA(4) NA(4) NA(4) NA(4) NA(4) NA(3) NA(5) NA(5) NA+(4) NA+(4)
0.03 NA(4) NA(4) NA(4) NA(4) NA(4) NA(4) NA(3) NA(4) NA(4) NA+(4) NA+(4)
0.01 NA(4) NA(3) NA(4) NA(4) NA(4) NA(4) NA(3) NA(4) NA(4) NA+(4) NA+(4)
0.005 NA(4) NA(3) NA(4) NA(4) NA(4) NA(4) NA(3) NA(4) NA(4) NA+(4) NA+(4)
mushroom.dat.gz 8124 transactions and 119 items
min_support master*(F) master(T) 96dfd4d*(F) 4efeb8c*(F) b90a146*(F) 1fabe25*(F) 1fabe25(T) 58c95f1(F) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 0.04 0.05 0.04 0.04 0.04 0.02 0.09 0.02 0.02 0.02 0.02
0.3 1.24 0.64 0.38 0.35 0.37 0.13 0.89 0.11 0.11 0.13 0.10
0.1 NA(7) NA(12) 107 103 109 35 89 32 33 38 25
0.05 NA(6) NA(7) NA(7) NA(7) NA(7) NA(7) NA(7) NA(9) NA(9) NA+(8) NA+(9)
0.03 NA(5) NA(6) NA(6) NA(6) NA(6) NA(7) NA(7) NA(7) NA(7) NA+(7) NA+(7)
0.01 NA(5) NA(5) NA(5) NA(5) NA(5) NA(6) NA(6) NA(6) NA(6) NA+(5) NA+(6)
0.005 NA(5) NA(5) NA(5) NA(5) NA(5) NA(5) NA(5) NA(5) NA(5) NA+(5) NA+(5)
pumsb.dat.gz 49046 transactions and 2113 items
min_support master*(F) master(T) 96dfd4d*(F) 4efeb8c*(F) b90a146*(F) 1fabe25*(F) 1fabe25(T) 58c95f1(F) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 NA(5) NA(3) NA(5) NA(5) NA(5) NA(6) NA(3) NA(6) NA(6) NA+(6) NA+(6)
0.3 NA(4) NA(3) NA(4) NA(4) NA(4) NA(5) NA(2) NA(5) NA(5) NA+(4) NA+(4)
0.1 NA(4) NA(3) NA(4) NA(4) NA(4) NA(4) NA(2) NA(4) NA(4) NA+(4) NA+(4)
0.05 NA(3) NA(3) NA(3) NA(3) NA(3) NA(4) NA(2) NA(4) NA(4) NA+(3) NA+(3)
0.03 NA(3) NA(3) NA(3) NA(3) NA(3) NA(4) NA(2) NA(4) NA(4) NA+(3) NA+(3)
0.01 NA(3) NA(3) NA(3) NA(3) NA(3) NA(3) NA(2) NA(3) NA(3) NA+(3) NA+(3)
0.005 NA(3) NA(2) NA(3) NA(3) NA(3) NA(3) NA(2) NA(3) NA(3) NA+(3) NA+(3)
pumsb_star.dat.gz 49046 transactions and 2088 items
min_support master*(F) master(T) 96dfd4d*(F) 4efeb8c*(F) b90a146*(F) 1fabe25*(F) 1fabe25(T) 58c95f1(F) 58c95f1(T) 4c82dcf+ 423f9e7
0.5 3.76 9.40 0.76 0.73 0.59 0.46 53 0.34 0.37 0.34 0.35
0.3 NA(5) NA(4) NA(7) NA(7) NA(7) 66 NA(3) 35 36 41 31
0.1 NA(4) NA(3) NA(4) NA(4) NA(4) NA(5) NA(2) NA(5) NA(5) NA+(5) NA+(5)
0.05 NA(4) NA(3) NA(4) NA(4) NA(4) NA(4) NA(2) NA(5) NA(5) NA+(4) NA+(4)
0.03 NA(3) NA(3) NA(4) NA(3) NA(4) NA(4) NA(2) NA(4) NA(4) NA+(4) NA+(4)
0.01 NA(3) NA(3) NA(3) NA(3) NA(3) NA(4) NA(2) NA(4) NA(4) NA+(3) NA+(3)
0.005 NA(3) NA(2) NA(3) NA(3) NA(3) NA(3) NA(2) NA(3) NA(3) NA+(3) NA+(3)

With sparse DataFrame

connect.dat.gz 67557 transactions and 129 items
min_support master(T) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 NA(4) NA(4) NA+(4) NA+(4)
0.3 NA(4) NA(4) NA+(3) NA+(3)
0.1 NA(3) NA(3) NA+(3) NA+(3)
0.05 NA(3) NA(3) NA+(3) NA+(3)
0.03 NA(3) NA(3) NA+(3) NA+(3)
0.01 NA(3) NA(3) NA+(3) NA+(3)
0.005 NA(3) NA(3) NA+(3) NA+(3)
mushroom.dat.gz 8124 transactions and 119 items
min_support master(T) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 0.18 0.18 0.07 0.06
0.3 3.07 3.19 0.92 0.84
0.1 NA(6) NA(6) NA+(7) NA+(8)
0.05 NA(5) NA(5) NA+(5) NA+(5)
0.03 NA(5) NA(5) NA+(5) NA+(5)
0.01 NA(4) NA(4) NA+(4) NA+(4)
0.005 NA(4) NA(4) NA+(4) NA+(4)
pumsb.dat.gz 49046 transactions and 2113 items
min_support master(T) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 NA(4) NA(4) NA+(3) NA+(3)
0.3 NA(3) NA(3) NA+(3) NA+(3)
0.1 NA(3) NA(3) NA+(2) NA+(2)
0.05 NA(3) NA(3) NA+(2) NA+(2)
0.03 NA(3) NA(2) NA+(2) NA+(2)
0.01 NA(2) NA(2) NA+(2) NA+(2)
0.005 NA(2) NA(2) NA+(2) NA+(2)
pumsb_star.dat.gz 49046 transactions and 2088 items
min_support master(T) 58c95f1(T) 4c82dcf+ 423f9e7+
0.5 2.97 3.08 1.16 1.16
0.3 NA(5) NA(5) NA+(5) NA+(5)
0.1 NA(3) NA(3) NA+(3) NA+(3)
0.05 NA(3) NA(3) NA+(3) NA+(3)
0.03 NA(3) NA(3) NA+(3) NA+(3)
0.01 NA(3) NA(3) NA+(2) NA+(2)
0.005 NA(2) NA(2) NA+(2) NA+(2)

@dbarbier
Copy link
Contributor Author

More tests should be run, maybe with kosarak.dat; when the best option is decided, I will remove/squash/rearrange commits and update docstrings. IMHO the best option is with current head.

@dbarbier
Copy link
Contributor Author

dbarbier commented Dec 20, 2019

Okay, I ran some tests with kosarak.dat and sparse=False. For small inputs, there is a noticeable overhead, which is caused by np.asfortranarray. All these tests had been run with a DataFrame wih row major storage format (order='C' in Numpy). In tables below, an asterisk means that run is performed with a DataFrame already in column storage format, so that np.asfortranarry becomes a noop; this is achieved by adding

df = pd.DataFrame({col: df[col] for col in df.columns})

(F) and (T) still refers to low_memory=False and low_memory=True arguments respectively.

1k first lines of kosarak.dat.gz 1000 transactions and 3259 items, sparse=False
min_support master(F) master(T) master*(F) master*(T) 423f9e7 423f9e7*
0.5 0.01 0.01 0.01 0.01 0.01 0.01
0.3 0.01 0.01 0.01 0.02 0.01 0.01
0.1 0.01 0.01 0.01 0.02 0.01 0.01
0.05 0.02 0.02 0.02 0.04 0.02 0.02
0.03 0.02 0.02 0.02 0.04 0.02 0.02
0.01 0.11 0.04 0.06 0.10 0.07 0.07
0.005 3.58 0.39 1.48 0.68 0.83 0.89
10k first lines of kosarak.dat.gz 10000 transactions and 10094 items, sparse=False
min_support master(F) master(T) master*(F) master*(T) 423f9e7 423f9e7*
0.5 0.10 0.15 0.10 0.22 0.33 0.09
0.3 0.10 0.17 0.10 0.47 0.33 0.09
0.1 0.10 0.19 0.10 0.70 0.33 0.09
0.05 0.11 0.20 0.11 1.24 0.33 0.09
0.03 0.13 0.22 0.11 1.77 0.33 0.09
0.01 0.69 0.30 0.24 4.93 0.37 0.13
0.005 8.43 0.54 2.16 11.4 0.61 0.40
50k first lines of kosarak.dat.gz 50000 transactions and 18936 items, sparse=False
min_support master(F) master(T) master*(F) master*(T) 423f9e7 423f9e7*
0.5 0.78 1.00 1.17 1.90 2.70 0.86
0.3 0.78 1.20 0.74 4.23 2.71 0.86
0.1 0.78 1.37 1.17 6.35 2.71 0.86
0.05 0.82 1.70 0.75 11.7 2.71 0.87
0.03 0.90 1.92 1.21 16.9 2.71 0.87
0.01 5.02 2.95 1.51 46.4 2.77 0.92
0.005 51.3 4.96 12.8 107 3.12 1.30
100k first lines of kosarak.dat.gz 100000 transactions and 23496 items, sparse=False
min_support master(F) master(T) master*(F) master*(T) 423f9e7 423f9e7*
0.5 - 2.83 - 4.79 7.51 2.89
0.3 - 3.02 - 10.7 6.43 2.89
0.1 - 3.41 - 16.1 6.43 2.90
0.05 - 4.44 - 29.9 6.43 2.90
0.03 - 4.75 - 43.0 6.43 2.90
0.01 - 7.46 - 116 6.50 2.97
0.005 - 13.8 - NA(2) 7.03 3.52

With all these resuls, it is not clear to me whether 1fabe25 is such a good idea, an alternative is to make users aware of the importance of storage order, and they have to check which one is best adapted to their use case.

@rasbt
Copy link
Owner

rasbt commented Dec 28, 2019

Sorry, I have been busy with grading and was then traveling over Xmas. The work in this PR is amazing, thanks a lot! Regarding the row vs column format, like you suggest, maybe it would be better to add this prominently in the docstring but have users decide what type of format they provide via the DataFrame.

@rasbt
Copy link
Owner

rasbt commented Dec 28, 2019

I just see the unit tests failing. I don't know why this happens to Appveyor, but it seems that switching installations from pip to conda helped. Maybe something got changed in the backend.

Regarding Travis CI, some discrepancies occurred after things got switched to scikit-learn 0.22 in the "latest" version. I addressed this now as well. This has been done in #652. I can do a rebase of this PR if you like, or if you can do it yourself if you prefer -- wanted to ask before I started messing around here ;)

@dbarbier
Copy link
Contributor Author

I just rebased but won't be able to take care of failures during 24h, feel free to push fixes.

@rasbt
Copy link
Owner

rasbt commented Dec 29, 2019

All good. Based on the error logs, these are "natural" ones due to WIP and not issues with the CI.

@dbarbier
Copy link
Contributor Author

There were indeed some bugs; because of these fixes, timings may be slightly different, I will rerun benchmarks in few days.

@dbarbier dbarbier force-pushed the db/apriori-gen branch 2 times, most recently from bcc64c5 to 4c43f5c Compare December 30, 2019 14:07
The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; first, the join step looks for itemsets with the same
prefix, and creates new candidates by appending all pairs combinations
to this prefix. Here is pseudocode copied from paper:

  select p.1, p.2, ..., p.k-1, q.k-1
  from p in L(k-1), q in L(k-1)
  where p.1 = q.1, ..., p.k-2 = q.k-2, p.k-1 < q.k-1

The reason is that if a sequence q with the same prefix as p does not
belong to L(k-1), itemset p+(q.k-1,) cannot be frequent.
Before this commit, we were considering p+(q.k-1,) for any q.k-1 > p.k-1.

The second step of apriori-gen function is called prune step, it will
be implemented in a distinct commit.

See discussion in rasbt#644.
@rasbt
Copy link
Owner

rasbt commented Jan 2, 2020

What do you think about adding a/the benchmark script(s) to mlxtend/frequent_patterns/tests/benchmark.py (not test_benchmark.py so it doesn't get executed by default when running pytest?

It may be useful to have this as a reference in case of future modifications to the codebase.

@rasbt rasbt closed this Jan 2, 2020
@rasbt rasbt reopened this Jan 2, 2020
@rasbt
Copy link
Owner

rasbt commented Jan 2, 2020

*sry, not sure why this PR was closed. I must have hit some keyboard combination -- never happened before. This was certainly not intentional.

The apriori-gen function described in section 2.1.1 of Apriori paper
has two steps; the first step had been implemented in previous commit.

The second step of apriori-gen function is called prune step, it takes
candidates c from first step and check that all (k-1) tuples built by
removing any single element from c is in L(k-1).

As Numpy arrays are not hashable, we cannot use set() for itemset lookup,
and define a very simple prefix tree class.
Thanks to previous optimizations, processing with low_memory=True is
now as efficient as with low_memory=False, and allows to process much
larger datasets.

Removing processing with low_memory=False makes code simpler.

The downside is that we do not know in advance the number of itemsets to
process, thus it is displayed afterwards.  We now display the number of
itemsets after prune step.
Note that commit 2f928cb introduced a bug, the number of processing
combinations was multiplied by itemset's length.

Since vectorized operations are no more performed on frequent itemsets,
they are stored as list of tuples.
This is now possible because tuples are hashable.
For unknbown reasons, np.sum is slow on a very large boolean array.
@dbarbier
Copy link
Contributor Author

dbarbier commented Jan 3, 2020

I rearranged commits, they look good now IMHO. About benchmark script, I do not know how to do that, there are many parameters: data files, sparse=True/False, column_major=True/False, and list of min_support argument (which may depend on data files). Anyway, it has been committed.

Should data files be copied into mlxtend/data/data with a Python function to load them? Here are their size (in bytes):

  chess.dat.gz         14K
  connect.dat.gz      362K
  kosarak-1k.dat.gz    14K
  kosarak-10k.dat.gz  126K
  kosarak-50k.dat.gz  616K
  kosarak-100k.dat.gz 1,3M
  kosarak-200k.dat.gz 2,5M
  kosarak.dat.gz       13M
  mushroom.dat.gz      34K
  pumsb.dat.gz        1,3M
  pumsb_star.dat.gz   1,2M
  T10I4D100K.dat.gz   1,4M
  T40I10D100K.dat.gz  4,8M

@dbarbier
Copy link
Contributor Author

dbarbier commented Jan 3, 2020

Here are more benchmarks. In these tables, s= means sparse variable, c= is col_major and T/F is True/False.

T10I4D100K.dat.gz 100000 transactions and 870 items, low_memory=True
min_support 0.05 0.03 0.01 0.005 0.003 0.001
master (s=F,c=T) 0.26 0.92 4.0 7.2 14.5 37.5
master (s=T,c=T) 0.06 0.24 3.9 8.9 29.3 NA(3)
eb80667(s=F,c=T) 0.01 0.05 2.0 4.8 7.3 11.8
eb80667(s=T,c=T) 0.05 0.44 11.3 23.5 33.5 57.2
T40I10D100K.dat.gz 100000 transactions and 942 items, low_memory=True

Note that timeout had been expanded to 300s when running commit eb80667, which explains values above 120s.

min_support 0.1 0.05 0.03 0.01 0.005 0.003 0.001
master (s=F,c=T) 2.0 7.2 15.0 NA(2) NA(2) NA(2) NA(2)
master (s=T,c=T) 0.67 5.4 13.3 NA(2) NA(2) NA(2) NA(2)
eb80667(s=F,c=T) 0.01 1.3 3.5 20.7 176 NA(6) NA(2)
eb80667(s=T,c=T) 1.32 14.0 32.9 232 NA(2) NA(2) NA(2)
kosarak-*k.dat.gz, low_memory=True
min_support 0.5 0.3 0.1 0.05 0.03 0.01 0.005 0.003 0.001
kosarak-1k.dat.gz
eb80667(s=F,c=T) 0.005 0.006 0.007 0.008 0.01 0.27 0.31 NA(11) NA(2)
eb80667(s=F,c=F) 0.009 0.009 0.01 0.01 0.01 0.05 0.60 NA(9) NA(2)
eb80667(s=T,c=T) 0.08 0.08 0.09 0.09 0.09 0.14 0.92 NA(8) NA(2)
kosarak-10k.dat.gz
eb80667(s=F,c=T) 0.02 0.02 0.02 0.02 0.02 0.04 0.14 0.58 NA(5)
eb80667(s=F,c=F) 0.26 0.26 0.26 0.26 0.27 0.36 1.02 4.4 NA(4)
eb80667(s=T,c=T) 0.26 0.26 0.26 0.27 0.28 0.36 0.80 2.4 NA(4)
kosarak-50k.dat.gz
eb80667(s=F,c=T) 0.09 0.09 0.09 0.09 0.09 0.12 0.33 1.06 61.1
eb80667(s=F,c=F) 2.0 2.0 2.0 2.0 2.0 2.6 6.7 20.1 NA(1)
eb80667(s=T,c=T) 0.49 0.49 0.50 0.51 0.53 0.78 2.1 5.9 NA(5)
kosarak-100k.dat.gz
eb80667(s=F,c=T) 0.20 0.20 0.20 0.20 0.20 0.24 0.60 1.85 67.9
eb80667(s=F,c=F) 4.9 4.9 4.9 5.0 5.1 6.4 15.7 - -
eb80667(s=T,c=T) 0.62 0.62 0.62 0.65 0.70 1.15 3.6 10.2 NA(9)
kosarak-200k.dat.gz
eb80667(s=F,c=T) 0.51 0.51 0.51 0.51 0.51 0.60 1.3 3.8 123
eb80667(s=T,c=T) 0.80 0.80 0.80 0.85 0.92 1.83 6.3 18.8 NA(5)

@dbarbier
Copy link
Contributor Author

dbarbier commented Jan 5, 2020

Some remarks:

  • Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated. Also note that TransactionEncoder uses a CSR format, which is logical, but it has to be converted to CSC for apriori, and this conversion takes time and memory.
  • Implementation does not use hash-trees. The reason is that current API expects a DataFrame, and with hash-trees it seems more logical to directly work on a list of transactions.

This is a work in progress.
@rasbt
Copy link
Owner

rasbt commented Jan 7, 2020

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

In any case, I am really thankful for all the good work you put into this. This is really awesome. And I can take care of the automatic data downloads from here.

regarding

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).

Processing times are much higher with sparse dataframes. I did not really investigate this issue, memory usage for dense array is now very low (except for the array itself), so using sparse dataframe may get deprecated.

I am not an expert at this by any means, but I think they are usually only memory efficient but not necessarily efficient when it comes to processing times.

@dbarbier
Copy link
Contributor Author

dbarbier commented Jan 7, 2020

Sorry for the sparse responses, I have been traveling over the holidays and am currently working on two manuscripts with submissions deadlines mid Jan.

No worries, this issue is not trivial and requires careful thinking, please take your time.

Other apriori implementations takes as input a list of transactions; there are several optimizations which can then be performed:

  • use a sparse representation of this list of transactions (Apriori paper suggests hash-trees, some other authors prefer prefix trees) which speeds up itemset counts (supports)
  • remove transactions during processing when they can no more contain frequent itemsets (for instance if their number of items is less than next_max_itemset)

Here it is very likely that user loaded her dataset as a list of transactions and called TransactionEncoder to convert it into a pandas DataFrame in order to call apriori function, thus it does not make sense to me to let apriori convert its argument internally back into a list of transactions. IMHO the first design decision to consider is: do you want to keep current input as a pandas DataFrame, or can it be changed to a list of transactions?

If the former, optimizations mentioned above cannot be performed and this PR is almost done. If the latter, a lot more work is needed.

[...]

but it has to be converted to CSC for apriori, and this conversion takes time and memory.

Maybe that's something we could add to the apriori docs. I.e., adding it as a cell to the notebooks for each example. What do you think? (I could take care of this then).
[...]

Sorry I do not understand your point; dense pandas DataFrame can use either row major or column major storage, and it looks like this depends on which DataFrame constructor had been called(2d array vs dict). We could indeed add a cell to show how to convert input DataFrame to speed up apriori function.
But as far as I can tell, this is different for sparse DataFrame, user has no control on internal storage, and all we can do is to call df.to_coo().tocsc(). If input argument could be passed as a Numpy array (or scipy sparse matrix), that would be different.

@rasbt
Copy link
Owner

rasbt commented Jan 29, 2020

Sorry, still haven't had time to look into this more. Along with the new semester (lots of teaching) & 2 paper deadlines in January, there wasn't time for much else recently. I am currently making a 0.17.1 bugfix release with the recent changes -- someone from industry contacted me about this, because due to company firewalls, several people can only install it from PyPI (not from GitHub directly). Will revisit this PR soon though -- thanks for all the work on it so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants