ERR: better error message on invalid return from .apply #13820

mhabets · 2016-07-27T15:37:16Z

Code Sample

def apply_list (row):
     return [2*row['A'], 4*row['C'], 3*row['B']]

df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
df['E'] = pd.Timestamp('20130102')

df['L'] = df.apply(apply_list, axis=1)

(Wrong) Error Output

ValueError: Shape of passed values is (6, 3), indices imply (6, 7)

Expected Result & Use case

Expected: a Serie containing a list for each row which are in my real case the result of a matrix multiplication processed with values coming from other columns (see a complete example in the next section)
Note: It works with a DataFrame without a datetime column (but not with a DataFrame with it):

0  [0.513057023122, -0.473155481431, -4.51039058299]  
1    [-0.331758428452, 3.92166465759, 2.75920806524]  
2    [-2.07257568656, 1.22070341071, 0.809676040678]  
3       [3.38079201699, 2.77074189984, 1.4351938036]  
4      [2.27838740024, 3.04253558763, 3.57504651688]  
5     [1.55497385554, 8.01021203173, -2.22893240905]

Complete example with matrix multiplication

from numpy import pi, mat, cos, sin

def rotation(lon_rad, lat_rad):
    c_lat = cos(lat_rad); s_lat = sin(lat_rad)
    c_lon = cos(lon_rad); s_lon = sin(lon_rad)
    R = mat([[-s_lon, -s_lat*c_lon, c_lat*c_lon],
             [ c_lon, -s_lat*s_lon, c_lat*s_lon],
             [     0,        c_lat,       s_lat]])
    return R.T

def row_rotation(row):
    return rotation_ECEF_to_ENU(row['A']*pi/180.0, row['B']*pi/180).tolist()

df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
# If I don't add the line below, I get what I want
df['E'] = pd.Timestamp('20130102')

df['M'] = df.apply(row_rotation, axis=1)

It works as I want if there is no datetime column in df:

          A         B         C         D  \
0  0.498796  0.209267 -0.551132 -0.066941   
1 -0.576161 -0.160802  0.895587 -0.355315   
2 -0.604028  0.959712 -1.388824 -0.356734   
3 -0.351012  0.242736  0.339068 -0.531425   
4 -1.355900  0.058885  1.458610  1.891502   
5 -0.560297  0.750459  1.288340  0.904650   

                                                   M  
0  [[-0.008705521998618918, 0.9999621062253967, 0...  
1  [[0.010055737814803208, 0.9999494397903326, 0....  
2  [[0.010542084651556644, 0.9999444306816251, 0....  
3  [[0.006126282255403784, 0.9999812341567851, 0....  
4  [[0.023662708414555287, 0.99971999891494, 0.0]...  
5  [[0.00977887456308289, 0.9999521856630343, 0.0...

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-92-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 22.0.5
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.1
xlsxwriter: 0.8.9
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2016-07-27T16:50:42Z

Mind explaining your actual use-case a bit? In general Series / DataFrames don't work very well with nested data, though this could change in the future.

As a workaround you can achieve this by calling your function on each element of df.itertuples(), and wrapping the output in a new pd.Series call. It'll probably be faster anyway.

jreback · 2016-07-27T21:23:23Z

yeah this is completely non-idiomatic

you can do this:

In [6]: def apply_list (row):
   ...:      return Series([2*row['A'], 4*row['C'], 3*row['B']], index=list('ACB'))
   ...: 
   ...: 
   ...: 
   ...: df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
   ...: df['etime'] = pd.Timestamp('20130102')
   ...: 
   ...: df.apply(apply_list, axis=1)
   ...: 
   ...: 
Out[6]: 
          A         C         B
0  0.519831 -0.869646 -0.293619
1 -3.323751  3.305479  2.619803
2  0.515751 -1.192321 -4.504805
3  0.319642  3.488812  4.289383
4  2.328899 -3.196405  0.980369
5  0.254868  3.926900 -4.392621

pandas expects to be able to coerce non-same sized input according to the original size if its not labeled. So I would expect your original example to fail. I suppose it could raise a better error message though.

Its just not reasonable to return arbitrary output and have pandas automatically assign labels to it.

mhabets · 2016-07-28T08:54:09Z

Thank you both for your answer.
My issue description was not clear enough. I added my use case and some comments.
In short:

I would like to keep the list in a single column or a Serie because it is the result of a matrix multiplication.
It works fine with a DataFrame without a datetime column but it doesn't with this type of column.

So, I am not sure the corrected issue title matches with my case. @jreback could you double check with the new elements I added?

A workaround I found is to convert the list to string in order to force to keep it in a single column:

def apply_list (row):
     return str([2*row['A'], 4*row['C'], 3*row['B']])

mroeschke · 2019-10-09T05:10:41Z

This looks to work on master/0.25.1 now. Could use a regression test.

In [28]: df
Out[28]:
          A         B         C         D          E                                                  L
0  0.169594  0.347718  0.936113 -0.900721 2013-01-02  [0.3391883065385159, 3.7444520669189956, 1.043...
1 -0.712949  0.048832 -0.123305  0.151001 2013-01-02  [-1.4258988792277114, -0.4932202558102186, 0.1...
2  0.190819 -0.589087  0.681428 -0.260025 2013-01-02  [0.38163802383994455, 2.7257105668473915, -1.7...
3 -2.494228  0.216944 -1.817156  1.918589 2013-01-02  [-4.988455103347858, -7.268622296521761, 0.650...
4 -0.576449 -0.766641 -0.024570  0.015905 2013-01-02  [-1.152898189915345, -0.09828048570455293, -2....
5 -1.850767 -0.900046 -0.211328  0.853045 2013-01-02  [-3.701534292166406, -0.8453125545428867, -2.7...

In [29]: pd.__version__
Out[29]: '0.25.1'

TomAugspurger added the Bug label Jul 27, 2016

jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate and removed Bug labels Jul 27, 2016

jreback changed the title ~~Wrong shape error with apply function on a DF with a datetime column~~ ERR: better error message on invalid return from .apply Jul 27, 2016

jreback added the Usage Question label Jul 27, 2016

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Usage Question labels Oct 9, 2019

mroeschke mentioned this issue Jan 7, 2020

TST: Add tests for fixed issues #30769

Merged

8 tasks

simonjayhawkins added this to the 1.0 milestone Jan 7, 2020

mroeschke closed this as completed in #30769 Jan 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERR: better error message on invalid return from .apply #13820

ERR: better error message on invalid return from .apply #13820

mhabets commented Jul 27, 2016 •

edited

TomAugspurger commented Jul 27, 2016

jreback commented Jul 27, 2016

mhabets commented Jul 28, 2016 •

edited

mroeschke commented Oct 9, 2019

ERR: better error message on invalid return from .apply #13820

ERR: better error message on invalid return from .apply #13820

Comments

mhabets commented Jul 27, 2016 • edited

Code Sample

(Wrong) Error Output

Expected Result & Use case

Complete example with matrix multiplication

output of pd.show_versions()

TomAugspurger commented Jul 27, 2016

jreback commented Jul 27, 2016

mhabets commented Jul 28, 2016 • edited

mroeschke commented Oct 9, 2019

mhabets commented Jul 27, 2016 •

edited

output of `pd.show_versions()`

mhabets commented Jul 28, 2016 •

edited