Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input dataset format #12

Closed
benpowis opened this issue Aug 29, 2019 · 4 comments
Closed

Input dataset format #12

benpowis opened this issue Aug 29, 2019 · 4 comments

Comments

@benpowis
Copy link

Hi there - love your work on this package! I have a question regarding input datasets, in your example this is a list of tuples, but is it possible to work with dataframes too? What are the restrictions around input data?

Many thanks,
Ben

@tommyod
Copy link
Owner

tommyod commented Aug 29, 2019

No problem at all.

# Original data
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]

# Convert to panas.DataFrame
df = pd.DataFrame(transactions)

# Convert back to list of tuples
transactions_from_df = [tuple(row) for row in df.values.tolist()]

# They are equal, so this evaluates to True
assert transactions == transactions_from_df

A list of lists will also work, it doesn't have to be a list of tuples.

@benpowis
Copy link
Author

Thank you @tommyod this looks great - how would you suggest dealing with NaN values? When feeding my df directly to apriori() I get the error:
TypeError: object of type 'int' has no len()

I can use your code above to transform into a list, but in my data I have a couple of baskets which are huge, leading to many 'nan' values in the lists, will these have an adverse effect on the results?

@tommyod
Copy link
Owner

tommyod commented Aug 29, 2019

NaN likely represents nothing, so convert ('bread', nan, 'milk', nan) to ('bread', 'milk'). It really depends on your problem at hand. Each tuple should represent a transaction, and having "none-tokens" in a transaction is a no-no. The values in the tuples should be strings.

@benpowis
Copy link
Author

benpowis commented Aug 29, 2019

Cool, thank you - should this help anyone else in the future, here is the method I used to remove nans from lists of varying sizes:

from math import isnan
for y in range(0,len(transactions_from_df)):
    
    transactions_from_df[y] = [x for x in transactions_from_df[y] if not (
                          type(x) == float # let's drop all float values…
                          and isnan(x) # … but only if they are nan
                          )]

@tommyod tommyod closed this as completed Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants