Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame.replace is broken in 0.12-dev #4115

Closed
vfilimonov opened this issue Jul 3, 2013 · 18 comments · Fixed by #4117
Closed

BUG: DataFrame.replace is broken in 0.12-dev #4115

vfilimonov opened this issue Jul 3, 2013 · 18 comments · Fixed by #4117
Labels
Milestone

Comments

@vfilimonov
Copy link
Contributor

the DataFrame.replace seems to be broken in 0.12 development version

this code:

df = pd.DataFrame({'Type':['Q','T','Q','Q','T'], 'tmp':2})
print df
print df.replace({'Type': {'Q': 0, 'T': 1}})

Return correct result in 0.11.0:

  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
  Type  tmp
0    0    2
1    1    2
2    0    2
3    0    2
4    1    2
5    1    2

and in 0.12.0.dev-795004d it does not work:


  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
   Type  tmp
0     0    2
1     1    2
2     0    2
3     1    2
4     0    2
5     1    2

Series.replace it works well, so the workaround for this bug is the following:

df['Type'] = df['Type'].replace({'Q': 0, 'T': 1})
@jreback
Copy link
Contributor

jreback commented Jul 3, 2013

@cpcloud take a look?

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

sure

while i'm fixing you can also do

df.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

ah i remember this being a bit tricky. was the last change i made before submitting the original pr. @vfilimonov thanks for the report

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

it would be cool to look at the pandas devs "reaction time" (time to submit a pr after a bug report)...i wonder if the gh api has such information

@vfilimonov
Copy link
Contributor Author

@cpcloud, thank you very much for taking care!

Aside: the variant with Series seems to be much faster if there are a lot of columns of different types. When the number of columns is not big the difference is not that large.

N = 1999999
df = pd.DataFrame({'Type':['T' if x>0.5 else 'Q' for x in np.random.rand(N)],
                   'tmp1':np.random.rand(N),'tmp2':np.random.rand(N),
                   'tmp3':'tmp','tmp4':np.random.rand(N),
                   'tmp5':np.random.rand(N),'tmp6':2,
                   'tmp7':np.random.rand(N),'tmp8':3,
                   'tmp9':pd.Timestamp('2000-01-01'),'tmp10':np.random.rand(N)})
df2 = df.copy()

%time df['Type']=df['Type'].replace({'Q': 0, 'T': 1})
%time df2=df2.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})

CPU times: user 0.16 s, sys: 0.01 s, total: 0.17 s
Wall time: 0.17 s
CPU times: user 0.73 s, sys: 0.18 s, total: 0.91 s
Wall time: 0.91 s

@ghost ghost assigned cpcloud Jul 3, 2013
@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

series is doing something a bit different...it uses a lot of cythonized internals to do the work at the expense of being slightly less general, e.g., it doesn't accept regex. this will most likely change in 0.13 when series becomes an ndframe since all of the replace stuff is happening at the Block level

@vfilimonov
Copy link
Contributor Author

The reaction time is really incredible. It is the second time when I submit a report and you guys fix the bug in 30 minutes! Thanks a lot

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

@vfilimonov thanks for the kind words! we appreciate it.

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

@cpcloud honestly seems like it would be pretty easy to change Series internals... Only issue is that it (might be) a bit slower...

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

i agree w/ u. just commenting on why it's a bit faster. it uses masking via cython whereas i'm using np.vectorize. a while back @jreback and i compared timings using pandas cythonized vectorize replacement and it was actually slower than vectorize, but not by much.

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

@cpcloud if pandas dropped the vectorize, it could use nearly the same functions for arithmetic for Series and Frame (and probably even for Panel).

@jtratner
Copy link
Contributor

jtratner commented Jul 3, 2013

@cpcloud that said, having code in 3 places is not a big deal...performance is more important.

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

there's are only 3 places where vectorize is used in pandas

./pandas/core/internals.py:
  830 :         f = np.vectorize(re_replacer, otypes=[self.dtype])
./pandas/io/pytables.py:
 3383 :         f = np.vectorize(lambda x: x.encode(encoding), otypes=[np.object])
 3402 :         f = np.vectorize(lambda x: x.decode(encoding),otypes=[np.object])

are you referring to use of things like map_infer_types?

@jreback
Copy link
Contributor

jreback commented Jul 3, 2013

keep in kind this is all going to change once series refactor
replace is a bit tricky but the underlying methods will all be the same (as what frame uses now)

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

yes we should wait for the motherload

@jreback
Copy link
Contributor

jreback commented Jul 3, 2013

@cpcloud the last 2 are because I saw u using it!
and didn't feel like dealing with strings/Unicode in cython

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

ha! str and unicode are super annoying in Cython when it comes to py2/3 issues and using char*.

@cpcloud
Copy link
Member

cpcloud commented Jul 3, 2013

which is really just because using char* for unicode is asking for trouble

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants