Input dataset format #12

benpowis · 2019-08-29T11:36:13Z

Hi there - love your work on this package! I have a question regarding input datasets, in your example this is a list of tuples, but is it possible to work with dataframes too? What are the restrictions around input data?

Many thanks,
Ben

tommyod · 2019-08-29T12:01:07Z

No problem at all.

# Original data
transactions = [('eggs', 'bacon', 'soup'),
                ('eggs', 'bacon', 'apple'),
                ('soup', 'bacon', 'banana')]

# Convert to panas.DataFrame
df = pd.DataFrame(transactions)

# Convert back to list of tuples
transactions_from_df = [tuple(row) for row in df.values.tolist()]

# They are equal, so this evaluates to True
assert transactions == transactions_from_df

A list of lists will also work, it doesn't have to be a list of tuples.

benpowis · 2019-08-29T12:13:56Z

Thank you @tommyod this looks great - how would you suggest dealing with NaN values? When feeding my df directly to apriori() I get the error:
TypeError: object of type 'int' has no len()

I can use your code above to transform into a list, but in my data I have a couple of baskets which are huge, leading to many 'nan' values in the lists, will these have an adverse effect on the results?

tommyod · 2019-08-29T12:28:35Z

NaN likely represents nothing, so convert ('bread', nan, 'milk', nan) to ('bread', 'milk'). It really depends on your problem at hand. Each tuple should represent a transaction, and having "none-tokens" in a transaction is a no-no. The values in the tuples should be strings.

benpowis · 2019-08-29T12:52:45Z

Cool, thank you - should this help anyone else in the future, here is the method I used to remove nans from lists of varying sizes:

from math import isnan
for y in range(0,len(transactions_from_df)):
    
    transactions_from_df[y] = [x for x in transactions_from_df[y] if not (
                          type(x) == float # let's drop all float values…
                          and isnan(x) # … but only if they are nan
                          )]

tommyod closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input dataset format #12

Input dataset format #12

benpowis commented Aug 29, 2019

tommyod commented Aug 29, 2019

benpowis commented Aug 29, 2019

tommyod commented Aug 29, 2019

benpowis commented Aug 29, 2019 •

edited

Loading

Input dataset format #12

Input dataset format #12

Comments

benpowis commented Aug 29, 2019

tommyod commented Aug 29, 2019

benpowis commented Aug 29, 2019

tommyod commented Aug 29, 2019

benpowis commented Aug 29, 2019 • edited Loading

benpowis commented Aug 29, 2019 •

edited

Loading