Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: better error message on invalid return from .apply #13820

Closed
mhabets opened this issue Jul 27, 2016 · 4 comments · Fixed by #30769
Closed

ERR: better error message on invalid return from .apply #13820

mhabets opened this issue Jul 27, 2016 · 4 comments · Fixed by #30769
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@mhabets
Copy link

mhabets commented Jul 27, 2016

Code Sample

def apply_list (row):
     return [2*row['A'], 4*row['C'], 3*row['B']]

df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
df['E'] = pd.Timestamp('20130102')

df['L'] = df.apply(apply_list, axis=1)

(Wrong) Error Output

ValueError: Shape of passed values is (6, 3), indices imply (6, 7)

Expected Result & Use case

Expected: a Serie containing a list for each row which are in my real case the result of a matrix multiplication processed with values coming from other columns (see a complete example in the next section)
Note: It works with a DataFrame without a datetime column (but not with a DataFrame with it):

0  [0.513057023122, -0.473155481431, -4.51039058299]  
1    [-0.331758428452, 3.92166465759, 2.75920806524]  
2    [-2.07257568656, 1.22070341071, 0.809676040678]  
3       [3.38079201699, 2.77074189984, 1.4351938036]  
4      [2.27838740024, 3.04253558763, 3.57504651688]  
5     [1.55497385554, 8.01021203173, -2.22893240905]  

Complete example with matrix multiplication

from numpy import pi, mat, cos, sin

def rotation(lon_rad, lat_rad):
    c_lat = cos(lat_rad); s_lat = sin(lat_rad)
    c_lon = cos(lon_rad); s_lon = sin(lon_rad)
    R = mat([[-s_lon, -s_lat*c_lon, c_lat*c_lon],
             [ c_lon, -s_lat*s_lon, c_lat*s_lon],
             [     0,        c_lat,       s_lat]])
    return R.T

def row_rotation(row):
    return rotation_ECEF_to_ENU(row['A']*pi/180.0, row['B']*pi/180).tolist()

df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
# If I don't add the line below, I get what I want
df['E'] = pd.Timestamp('20130102')

df['M'] = df.apply(row_rotation, axis=1)

It works as I want if there is no datetime column in df:

          A         B         C         D  \
0  0.498796  0.209267 -0.551132 -0.066941   
1 -0.576161 -0.160802  0.895587 -0.355315   
2 -0.604028  0.959712 -1.388824 -0.356734   
3 -0.351012  0.242736  0.339068 -0.531425   
4 -1.355900  0.058885  1.458610  1.891502   
5 -0.560297  0.750459  1.288340  0.904650   

                                                   M  
0  [[-0.008705521998618918, 0.9999621062253967, 0...  
1  [[0.010055737814803208, 0.9999494397903326, 0....  
2  [[0.010542084651556644, 0.9999444306816251, 0....  
3  [[0.006126282255403784, 0.9999812341567851, 0....  
4  [[0.023662708414555287, 0.99971999891494, 0.0]...  
5  [[0.00977887456308289, 0.9999521856630343, 0.0...  

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-92-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 22.0.5
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.1
xlsxwriter: 0.8.9
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext)
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None
@TomAugspurger
Copy link
Contributor

Mind explaining your actual use-case a bit? In general Series / DataFrames don't work very well with nested data, though this could change in the future.

As a workaround you can achieve this by calling your function on each element of df.itertuples(), and wrapping the output in a new pd.Series call. It'll probably be faster anyway.

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

yeah this is completely non-idiomatic

you can do this:

In [6]: def apply_list (row):
   ...:      return Series([2*row['A'], 4*row['C'], 3*row['B']], index=list('ACB'))
   ...: 
   ...: 
   ...: 
   ...: df = pd.DataFrame(np.random.randn(6,4), columns=list('ABCD'))
   ...: df['etime'] = pd.Timestamp('20130102')
   ...: 
   ...: df.apply(apply_list, axis=1)
   ...: 
   ...: 
Out[6]: 
          A         C         B
0  0.519831 -0.869646 -0.293619
1 -3.323751  3.305479  2.619803
2  0.515751 -1.192321 -4.504805
3  0.319642  3.488812  4.289383
4  2.328899 -3.196405  0.980369
5  0.254868  3.926900 -4.392621

pandas expects to be able to coerce non-same sized input according to the original size if its not labeled. So I would expect your original example to fail. I suppose it could raise a better error message though.

Its just not reasonable to return arbitrary output and have pandas automatically assign labels to it.

@jreback jreback added Error Reporting Incorrect or improved errors from pandas Difficulty Intermediate and removed Bug labels Jul 27, 2016
@jreback jreback changed the title Wrong shape error with apply function on a DF with a datetime column ERR: better error message on invalid return from .apply Jul 27, 2016
@mhabets
Copy link
Author

mhabets commented Jul 28, 2016

Thank you both for your answer.
My issue description was not clear enough. I added my use case and some comments.
In short:

  • I would like to keep the list in a single column or a Serie because it is the result of a matrix multiplication.
  • It works fine with a DataFrame without a datetime column but it doesn't with this type of column.

So, I am not sure the corrected issue title matches with my case. @jreback could you double check with the new elements I added?

A workaround I found is to convert the list to string in order to force to keep it in a single column:

def apply_list (row):
     return str([2*row['A'], 4*row['C'], 3*row['B']])

@mroeschke
Copy link
Member

This looks to work on master/0.25.1 now. Could use a regression test.

In [28]: df
Out[28]:
          A         B         C         D          E                                                  L
0  0.169594  0.347718  0.936113 -0.900721 2013-01-02  [0.3391883065385159, 3.7444520669189956, 1.043...
1 -0.712949  0.048832 -0.123305  0.151001 2013-01-02  [-1.4258988792277114, -0.4932202558102186, 0.1...
2  0.190819 -0.589087  0.681428 -0.260025 2013-01-02  [0.38163802383994455, 2.7257105668473915, -1.7...
3 -2.494228  0.216944 -1.817156  1.918589 2013-01-02  [-4.988455103347858, -7.268622296521761, 0.650...
4 -0.576449 -0.766641 -0.024570  0.015905 2013-01-02  [-1.152898189915345, -0.09828048570455293, -2....
5 -1.850767 -0.900046 -0.211328  0.853045 2013-01-02  [-3.701534292166406, -0.8453125545428867, -2.7...

In [29]: pd.__version__
Out[29]: '0.25.1'

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Difficulty Intermediate Error Reporting Incorrect or improved errors from pandas Usage Question labels Oct 9, 2019
@simonjayhawkins simonjayhawkins added this to the 1.0 milestone Jan 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants