BUG: DataFrame.replace is broken in 0.12-dev #4115

vfilimonov · 2013-07-03T15:35:54Z

the DataFrame.replace seems to be broken in 0.12 development version

this code:

df = pd.DataFrame({'Type':['Q','T','Q','Q','T'], 'tmp':2})
print df
print df.replace({'Type': {'Q': 0, 'T': 1}})

Return correct result in 0.11.0:

  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
  Type  tmp
0    0    2
1    1    2
2    0    2
3    0    2
4    1    2
5    1    2

and in 0.12.0.dev-795004d it does not work:


  Type  tmp
0    Q    2
1    T    2
2    Q    2
3    Q    2
4    T    2
5    T    2
   Type  tmp
0     0    2
1     1    2
2     0    2
3     1    2
4     0    2
5     1    2

Series.replace it works well, so the workaround for this bug is the following:

df['Type'] = df['Type'].replace({'Q': 0, 'T': 1})

The text was updated successfully, but these errors were encountered:

jreback · 2013-07-03T15:41:04Z

@cpcloud take a look?

cpcloud · 2013-07-03T15:46:24Z

sure

while i'm fixing you can also do

df.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})

cpcloud · 2013-07-03T15:48:51Z

ah i remember this being a bit tricky. was the last change i made before submitting the original pr. @vfilimonov thanks for the report

cpcloud · 2013-07-03T16:11:33Z

it would be cool to look at the pandas devs "reaction time" (time to submit a pr after a bug report)...i wonder if the gh api has such information

vfilimonov · 2013-07-03T16:11:53Z

@cpcloud, thank you very much for taking care!

Aside: the variant with Series seems to be much faster if there are a lot of columns of different types. When the number of columns is not big the difference is not that large.

N = 1999999
df = pd.DataFrame({'Type':['T' if x>0.5 else 'Q' for x in np.random.rand(N)],
                   'tmp1':np.random.rand(N),'tmp2':np.random.rand(N),
                   'tmp3':'tmp','tmp4':np.random.rand(N),
                   'tmp5':np.random.rand(N),'tmp6':2,
                   'tmp7':np.random.rand(N),'tmp8':3,
                   'tmp9':pd.Timestamp('2000-01-01'),'tmp10':np.random.rand(N)})
df2 = df.copy()

%time df['Type']=df['Type'].replace({'Q': 0, 'T': 1})
%time df2=df2.replace({'Type': {'Q': 0}}).replace({'Type': {'T': 1}})

CPU times: user 0.16 s, sys: 0.01 s, total: 0.17 s
Wall time: 0.17 s
CPU times: user 0.73 s, sys: 0.18 s, total: 0.91 s
Wall time: 0.91 s

cpcloud · 2013-07-03T16:13:42Z

series is doing something a bit different...it uses a lot of cythonized internals to do the work at the expense of being slightly less general, e.g., it doesn't accept regex. this will most likely change in 0.13 when series becomes an ndframe since all of the replace stuff is happening at the Block level

vfilimonov · 2013-07-03T16:14:02Z

The reaction time is really incredible. It is the second time when I submit a report and you guys fix the bug in 30 minutes! Thanks a lot

cpcloud · 2013-07-03T16:14:58Z

@vfilimonov thanks for the kind words! we appreciate it.

jtratner · 2013-07-03T21:48:08Z

@cpcloud honestly seems like it would be pretty easy to change Series internals... Only issue is that it (might be) a bit slower...

cpcloud · 2013-07-03T22:23:40Z

i agree w/ u. just commenting on why it's a bit faster. it uses masking via cython whereas i'm using np.vectorize. a while back @jreback and i compared timings using pandas cythonized vectorize replacement and it was actually slower than vectorize, but not by much.

jtratner · 2013-07-03T22:25:06Z

@cpcloud if pandas dropped the vectorize, it could use nearly the same functions for arithmetic for Series and Frame (and probably even for Panel).

jtratner · 2013-07-03T22:26:01Z

@cpcloud that said, having code in 3 places is not a big deal...performance is more important.

cpcloud · 2013-07-03T22:27:45Z

there's are only 3 places where vectorize is used in pandas

./pandas/core/internals.py:
  830 :         f = np.vectorize(re_replacer, otypes=[self.dtype])
./pandas/io/pytables.py:
 3383 :         f = np.vectorize(lambda x: x.encode(encoding), otypes=[np.object])
 3402 :         f = np.vectorize(lambda x: x.decode(encoding),otypes=[np.object])

are you referring to use of things like map_infer_types?

jreback · 2013-07-03T22:29:06Z

keep in kind this is all going to change once series refactor
replace is a bit tricky but the underlying methods will all be the same (as what frame uses now)

cpcloud · 2013-07-03T22:29:58Z

yes we should wait for the motherload

jreback · 2013-07-03T22:30:48Z

@cpcloud the last 2 are because I saw u using it!
and didn't feel like dealing with strings/Unicode in cython

cpcloud · 2013-07-03T22:38:48Z

ha! str and unicode are super annoying in Cython when it comes to py2/3 issues and using char*.

cpcloud · 2013-07-03T22:39:18Z

which is really just because using char* for unicode is asking for trouble

cpcloud mentioned this issue Jul 3, 2013

BUG: fix replace bug when a nested dict was passed #4117

Merged

ghost assigned cpcloud Jul 3, 2013

cpcloud closed this as completed in #4117 Jul 3, 2013

wesm unassigned cpcloud Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame.replace is broken in 0.12-dev #4115

BUG: DataFrame.replace is broken in 0.12-dev #4115

vfilimonov commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

vfilimonov commented Jul 3, 2013

cpcloud commented Jul 3, 2013

vfilimonov commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

BUG: DataFrame.replace is broken in 0.12-dev #4115

BUG: DataFrame.replace is broken in 0.12-dev #4115

Comments

vfilimonov commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013

vfilimonov commented Jul 3, 2013

cpcloud commented Jul 3, 2013

vfilimonov commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jtratner commented Jul 3, 2013

jtratner commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

jreback commented Jul 3, 2013

cpcloud commented Jul 3, 2013

cpcloud commented Jul 3, 2013