Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FPGrowth and Apriori disagreement with minsup=0 #574

Closed
harenbergsd opened this issue Aug 5, 2019 · 4 comments
Closed

FPGrowth and Apriori disagreement with minsup=0 #574

harenbergsd opened this issue Aug 5, 2019 · 4 comments

Comments

@harenbergsd
Copy link
Contributor

Noticed this when fixing other bug. There is a disagreement between FPGrowth and Apriori when minsupport=0. I am not sure what the answer should be.

Support your input itemsets are [[a], [b]].

itemsets = [['a'],['b']]
te = TransactionEncoder()
te_ary = te.fit(rs).transform(itemsets)
df = pd.DataFrame(te_ary, columns=te.columns_)
print(fpgrowth(df, min_support=0))
print(apriori(df, min_support=0))

This produces:

   support itemsets
0      0.5      (a)
1      0.5      (b)
   support itemsets
0      0.5      (a)
1      0.5      (b)
2      0.0   (b, a)

We should make it consistent, but which one? The second is correct in a theoretical sense (every possibly subset of all items appears at least 0 times), but the first is probably more useful in the practical sense.

@rasbt
Copy link
Owner

rasbt commented Aug 5, 2019

Good catch! I agree with your assessment regarding practically useful & theoretically correct. So, setting min_support to 0 will result in all possible itemsets, which is probably never useful in practice. However, it would be technically correct ...

The two options I have in mind are

a) Disallow min_support=0. values via

if min_support <= 0.:
    raise ValueError('`min_support must be a positive number within the interval (0, 1]`)

b) Allow 0 values in fpmax and fpgrowth with the behavior of returning all subsets -- who knows, maybe it is useful for people who just want to quickly get the number of total itemsets (possible combinations) although it may not be an efficient way for doing that ...

I would tend to a). Would be curious to hear what you think.

@harenbergsd
Copy link
Contributor Author

I guess using min_support=0 allows you to get the support of all the itemsets in your data, which may be useful, sort of, sometimes, maybe...

But, in the rare case someone wants that, you could also just do a small value; i.e., 1/nrows. The only thing 0 gives you over 1/nrows is all possible subsets, which doesn't seem useful to have.

Yeah, I am good with option (a). I can't see any reason they would need to do 0 over just doing 1/nrows.

@rasbt
Copy link
Owner

rasbt commented Aug 6, 2019

Sounds good, thanks for the feedback! I just added it to the existing PR at #573

@rasbt
Copy link
Owner

rasbt commented Aug 20, 2019

Should be fine now after #573

@rasbt rasbt closed this as completed Aug 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants