Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fillna() does not work when value parameter is a list #3435

Closed
ijmcf opened this issue Apr 23, 2013 · 16 comments · Fixed by #3585
Closed

fillna() does not work when value parameter is a list #3435

ijmcf opened this issue Apr 23, 2013 · 16 comments · Fixed by #3585
Labels
Milestone

Comments

@ijmcf
Copy link

ijmcf commented Apr 23, 2013

Should raise on a passed list to value

The results from the fillna() method are very strange when the value parameter is given a list.

For example, using a simple example DataFrame:

df = pandas.DataFrame({'A': [numpy.nan, 1, 2], 'B': [10, numpy.nan, 12], 'C': [[20, 21, 22], [23, 24, 25], numpy.nan]})
df
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 NaN

df.fillna(value=[100, 101, 102])
A B C
0 100 10 [20, 21, 22]
1 1 101 [23, 24, 25]
2 2 12 102

So it appears the values in the list are used to fill the 'holes' in order, if the list has the same length as number of holes. But if the the list is shorter than the number of holes, the behavior changes to using only the first value in the list:

df.fillna(value=[100, 101])
A B C
0 100 10 [20, 21, 22]
1 1 100 [23, 24, 25]
2 2 12 100

If the list is longer than the number of holes, you get something even more odd:

df.fillna(value=[100, 101, 102, 103])
A B C
0 100 10 [20, 21, 22]
1 1 100 [23, 24, 25]
2 2 12 102

If you specify provide a dict that specifies the fill values by column, the values from the list are used within that column only:

df.fillna(value={'C': [100, 101]})
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 100

Since it's not always practical to know the number of NaN values a priori, or to customize the length of the value list to match it, this is problematic. Furthermore, some desired values get over-interpreted and cannot be used:

For example, if you want to actually replace all NaN instances in a single column with the same list (either empty or non-empty), I can't figure out how to do it:

df.fillna(value={'C': [[100,101]]})
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 100

Indeed, if you specify the empty list nothing is filled:

df.fillna(value={'C': list()})
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 NaN

But a dict works fine:

f.fillna(value={'C': {0: 1}})
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 {0: 1}

df.fillna(value={'C': dict()})
A B C
0 NaN 10 [20, 21, 22]
1 1 NaN [23, 24, 25]
2 2 12 {}

So it appears the fillna() is making a lot of decisions about how the fill values should be applied, and certain desired outcomes can't be achieved because it's being too 'clever'.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2013

lists are not allowed (for the reasons you show), should raise on this (only scalar or dict are valid)

@jreback
Copy link
Contributor

jreback commented Apr 23, 2013

FYI keeping lists in a frame, while allowed, it not efficient at all, what excatly are you trying to accomplish?

@ijmcf
Copy link
Author

ijmcf commented Apr 23, 2013

Good question. I am creating a DataFrame containing a number of key elements of information on a daily process - some of those elements are singular (floats, integers, strings), but some are multiple - and the number of elements can vary day by day from 0 to n. I'm storing those elements currently as lists.

For example, something like the dummy data frame I used in the notes on the Issue.

If you have any suggestions for alternative approaches, I'd be glad to hear them.

Thanks
Iain

On Tuesday, April 23, 2013 at 12:50 PM, jreback wrote:

FYI keeping lists in a frame, while allowed, it not efficient at all, what excatly are you trying to accomplish?


Reply to this email directly or view it on GitHub (#3435 (comment)).

@jreback
Copy link
Contributor

jreback commented Apr 23, 2013

I would use multiple df's in this case, maybe indexed by a common element
(and then wrap a class around it to manage it)

for your singular elements it looks like a single df is good
for the multiple ones

use another frame that is indexed 0..n (could be along index or columns whatever makes sense)

when you are mixing hierarchical and non-hierarchial (singular data) better 2 use different objects

@jreback
Copy link
Contributor

jreback commented May 13, 2013

closed by #3585

@jreback jreback closed this as completed May 13, 2013
@ariddell
Copy link

Is there any alternative here? I frequently see R dataframes that contain lists. Sometimes one needs a little unnormalized data to be associated with a record.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2013

can you give an example if input and output?

@jtratner
Copy link
Contributor

Could you use a tuple?

@ariddell
Copy link

Just for the record, here's no less an authority than Trevor Hastie cramming
data structures inside a data frame in R.

> library(lars)
Loaded lars 1.2

> data(diabetes)
> str(diabetes)
'data.frame':   442 obs. of  3 variables:
$ x : AsIs [1:442, 1:10] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr  "age" "sex" "bmi" "map" ...
$ y : num  151 75 141 206 135 97 138 63 110 310 ...
$ x2: AsIs [1:442, 1:64] 0.038075.... -0.00188.... 0.085298.... -0.08906.... 0.005383.... ...
..- attr(*, ".Names")= chr  "age" "age" "age" "age" ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr  "1" "2" "3" "4" ...
.. ..$ : chr  "age" "sex" "bmi" "map" ...

Here's my more modest example:

In [3]: df = pd.DataFrame.from_records([dict(id=10, languages=('en','de')), dict(id=11)])

In [4]: df
Out[4]: 
   id languages
0  10  (en, de)
1  11       NaN

In [7]: # doesn't work

In [8]: df.fillna(tuple())
Out[8]: 
   id languages
0  10  (en, de)
1  11       NaN

In [9]: # doesn't work either

In [10]: df.fillna([])
Out[10]: 
   id languages
0  10  (en, de)
1  11       NaN

In [11]: # best I can do

In [12]: df.fillna(set())
Out[12]: 
   id languages
0  10  (en, de)
1  11        ()

I'm using a release version of pandas -- but I gather the list and tuple will raise exceptions.

@jreback
Copy link
Contributor

jreback commented Sep 10, 2013

in an object column (eg strings) this is easy and natural

my hesitation is if u did this is a float column then it would convert to an object dtype
that's the real issue

as from 'accidentally'' putting a list (when u don't mean it)

@cpcloud
Copy link
Member

cpcloud commented Sep 10, 2013

That data set is a nice example of how not to structure your data. Using I() to stuff things in a data.frame just seems like a terrible idea.

@ariddell
Copy link

I like my example of putting lists or tuples. They are perfectly valid NumPy object arrays. A string with comma delimiters just isn't a general option -- what if the underlying strings contain commas?

Now that I think about it -- what is the workaround? I can't do this:

In [10]: df = pd.DataFrame.from_records([dict(id=10, languages=('en','de')), dict(id=11)])

In [11]: df.languages[pd.isnull(df.languages)] = tuple()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

I suppose json.dumps() and json.loads() is probably the way to go?

@BrenBarn
Copy link

Is there an actual solution to this? What are you supposed to do if you actually want a DataFrame/Series whose values are lists, and you want to replace NaN values with an empty list?

@jreback
Copy link
Contributor

jreback commented Apr 27, 2014

@BrenBarn you are welcome to open an issue to support this, would be ok. But as you know supporting lists in a frame is problematic at best (eg. setting is pretty much impossible), so this have very limited uses, and would never recommend using it.

@Pranjalya
Copy link

The dict doesn't work now. :-(

@rohetoric
Copy link

Sorry but why is this issue closed? What is the solution here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants