## Reindexing and altering labels

In [2]:
import numpy as np
import pandas as pd

``reindex()`` is the __fundamental data alignment__ method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. __To reindex means to conform the data to match a given set of labels along a particular axis.__ This accomplishes several things:

1. __Reorders the existing data to match a new set of labels__
2. __Inserts missing value (NA) markers__ in label locations where no data for that label existed
3. If specified, __fill data for missing labels__ using logic (highly relevant to working with time series data)

Here is a simple example:

In [3]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    0.601915
b   -0.341331
c    0.124636
d   -0.175199
e    1.031125
dtype: float64

In [4]:
s.reindex(['e', 'b', 'f', 'd'])

e    1.031125
b   -0.341331
f         NaN
d   -0.175199
dtype: float64

With a DataFrame, you can simultaneously reindex the index and columns:

In [6]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-0.483223,-0.359918,
b,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
d,,0.666355,0.508566


In [7]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

Unnamed: 0,three,two,one
c,1.971402,0.750475,-1.061522
f,,,
b,-0.161395,-0.380172,-0.32042


You may also use reindex with an axis keyword:

In [8]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,-1.061522,0.750475,1.971402
f,,,
b,-0.32042,-0.380172,-0.161395


Note that __the ``Index`` objects containing the actual axis labels can be shared between objects__. So if we have a Series and a DataFrame, the following can be done:

In [14]:
s = pd.Series(np.random.randn(4), index=list('abcd'))
s

a   -0.652953
b   -0.738724
c   -1.134679
d    0.006017
dtype: float64

In [15]:
rs = s.reindex(df.index)

In [16]:
rs

a   -0.652953
b   -0.738724
c   -1.134679
d    0.006017
dtype: float64

In [17]:
rs.index == df.index

array([ True,  True,  True,  True])

In [18]:
rs.index is df.index # True

False

This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.

New in version 0.21.0.

DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels argument and the axis it applies to.

In [20]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,-1.061522,0.750475,1.971402
f,,,
b,-0.32042,-0.380172,-0.161395


In [21]:
df.reindex(['three', 'two', 'one'], axis='columns')

Unnamed: 0,three,two,one
a,,-0.359918,-0.483223
b,-0.161395,-0.380172,-0.32042
c,1.971402,0.750475,-1.061522
d,0.508566,0.666355,


__Note__: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing ninja: __many operations are faster on pre-aligned data__. Adding two unaligned DataFrames internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.

### Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object

In [23]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan], 'B': [np.nan, 2., 3., np.nan, 6.]})
df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.], 'B': [np.nan, np.nan, 3., 4., 6., 8.]})
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [24]:
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [25]:
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [27]:
df2.reindex_like(df3)

Unnamed: 0,A
e,
d,
c,
b,
a,


In [28]:
df

Unnamed: 0,one,two,three
a,-0.483223,-0.359918,
b,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
d,,0.666355,0.508566


In [30]:
df3 = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(3), index=['a', 'b', 'c'])})
df3

Unnamed: 0,one,two
a,-0.630136,-0.481235
b,0.134179,0.90305
c,-0.620753,-1.243463


In [31]:
df.reindex_like(df3)

Unnamed: 0,one,two
a,-0.483223,-0.359918
b,-0.32042,-0.380172
c,-1.061522,0.750475


### Aligning objects with each other with ``align``

#### The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

1. join='outer': take the union of the indexes (default)
2. join='left': use the calling object’s index
3. join='right': use the passed object’s index
4. join='inner': intersect the indexes

#### It returns a tuple with both of the reindexed Series

In [32]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [33]:
s1 = s[:4]
s2 = s[1:]

In [34]:
s1

a    0.678457
b    0.004071
c   -0.340283
d    0.002467
dtype: float64

In [35]:
s2

b    0.004071
c   -0.340283
d    0.002467
e    1.387496
dtype: float64

In [36]:
s1.align(s2)

(a    0.678457
 b    0.004071
 c   -0.340283
 d    0.002467
 e         NaN
 dtype: float64, a         NaN
 b    0.004071
 c   -0.340283
 d    0.002467
 e    1.387496
 dtype: float64)

In [37]:
s1.align(s2, join='inner')

(b    0.004071
 c   -0.340283
 d    0.002467
 dtype: float64, b    0.004071
 c   -0.340283
 d    0.002467
 dtype: float64)

In [38]:
s1.align(s2, join='left')

(a    0.678457
 b    0.004071
 c   -0.340283
 d    0.002467
 dtype: float64, a         NaN
 b    0.004071
 c   -0.340283
 d    0.002467
 dtype: float64)

In [39]:
90s1.align(s2, join='right')

(b    0.004071
 c   -0.340283
 d    0.002467
 e         NaN
 dtype: float64, b    0.004071
 c   -0.340283
 d    0.002467
 e    1.387496
 dtype: float64)

__For DataFrames__, the join method will be applied __to both the index and the columns by default__:

In [40]:
df2 = df3.copy()

In [41]:
df.align(df3)

(        one     three       two
 a -0.483223       NaN -0.359918
 b -0.320420 -0.161395 -0.380172
 c -1.061522  1.971402  0.750475
 d       NaN  0.508566  0.666355,         one  three       two
 a -0.630136    NaN -0.481235
 b  0.134179    NaN  0.903050
 c -0.620753    NaN -1.243463
 d       NaN    NaN       NaN)

If you __pass a Series to DataFrame.align()__, you can choose to align both objects __either on the DataFrame’s index or columns using the axis argument__:

In [42]:
df.align(df.iloc[0], axis=1)

(        one       two     three
 a -0.483223 -0.359918       NaN
 b -0.320420 -0.380172 -0.161395
 c -1.061522  0.750475  1.971402
 d       NaN  0.666355  0.508566, one     -0.483223
 two     -0.359918
 three         NaN
 Name: a, dtype: float64)

### Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method	-  Action
1. pad / ffill	-  Fill values forward
2. bfill / backfill	 -  Fill values backward
3. nearest	-  Fill from the nearest index value

In [43]:
rng = pd.date_range('1/3/2000', periods=8)

In [44]:
ts = pd.Series(np.random.randn(8), index=rng)

In [45]:
ts2 = ts[[0, 3, 6]]

In [46]:
ts

2000-01-03    1.090728
2000-01-04    0.818119
2000-01-05   -0.228301
2000-01-06    0.056388
2000-01-07    0.448287
2000-01-08   -0.597665
2000-01-09    1.288116
2000-01-10   -0.700284
Freq: D, dtype: float64

In [47]:
ts2

2000-01-03    1.090728
2000-01-06    0.056388
2000-01-09    1.288116
dtype: float64

In [48]:
ts.reindex(ts2)

1.090728   NaN
0.056388   NaN
1.288116   NaN
dtype: float64

In [49]:
ts.reindex(ts2.index)

2000-01-03    1.090728
2000-01-06    0.056388
2000-01-09    1.288116
dtype: float64

In [52]:
ts2.reindex(ts.index)

2000-01-03    1.090728
2000-01-04         NaN
2000-01-05         NaN
2000-01-06    0.056388
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    1.288116
2000-01-10         NaN
Freq: D, dtype: float64

In [53]:
ts2.reindex(ts.index, method='ffill')

2000-01-03    1.090728
2000-01-04    1.090728
2000-01-05    1.090728
2000-01-06    0.056388
2000-01-07    0.056388
2000-01-08    0.056388
2000-01-09    1.288116
2000-01-10    1.288116
Freq: D, dtype: float64

In [54]:
ts2.reindex(ts.index, method='bfill')

2000-01-03    1.090728
2000-01-04    0.056388
2000-01-05    0.056388
2000-01-06    0.056388
2000-01-07    1.288116
2000-01-08    1.288116
2000-01-09    1.288116
2000-01-10         NaN
Freq: D, dtype: float64

In [55]:
ts2.reindex(ts.index, method='nearest')

2000-01-03    1.090728
2000-01-04    1.090728
2000-01-05    0.056388
2000-01-06    0.056388
2000-01-07    0.056388
2000-01-08    1.288116
2000-01-09    1.288116
2000-01-10    1.288116
Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.

#### Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [57]:
ts2.reindex(ts.index).fillna(method='ffill')

2000-01-03    1.090728
2000-01-04    1.090728
2000-01-05    1.090728
2000-01-06    0.056388
2000-01-07    0.056388
2000-01-08    0.056388
2000-01-09    1.288116
2000-01-10    1.288116
Freq: D, dtype: float64

#### ``reindex()`` will raise a ValueError if the index is not monotonically increasing or decreasing. ``fillna() and interpolate()`` will not perform any checks on the order of the index.

### Limits on filling while reindexing

__The limit and tolerance arguments__ provide additional control over filling while reindexing. __Limit specifies the maximum count of consecutive matches__:

In [60]:
ts2.reindex(ts.index, method='ffill', limit=1)

2000-01-03    1.090728
2000-01-04    1.090728
2000-01-05         NaN
2000-01-06    0.056388
2000-01-07    0.056388
2000-01-08         NaN
2000-01-09    1.288116
2000-01-10    1.288116
Freq: D, dtype: float64

#### In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [59]:
ts2.reindex(ts.index, method='ffill', tolerance='1 day')

2000-01-03    1.090728
2000-01-04    1.090728
2000-01-05         NaN
2000-01-06    0.056388
2000-01-07    0.056388
2000-01-08         NaN
2000-01-09    1.288116
2000-01-10    1.288116
Freq: D, dtype: float64

#### Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

### Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [61]:
df

Unnamed: 0,one,two,three
a,-0.483223,-0.359918,
b,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
d,,0.666355,0.508566


In [62]:
df.drop(['a', 'b'], axis=0)

Unnamed: 0,one,two,three
c,-1.061522,0.750475,1.971402
d,,0.666355,0.508566


In [64]:
df.drop(['two', 'three'], axis=1)

Unnamed: 0,one
a,-0.483223
b,-0.32042
c,-1.061522
d,


Note that the following also works, but is a bit less obvious / clean:

In [65]:
df.reindex(df.index.difference(['a', 'd']))

Unnamed: 0,one,two,three
b,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402


### Renaming / mapping labels

#### The ``rename()`` method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [66]:
s

a    0.678457
b    0.004071
c   -0.340283
d    0.002467
e    1.387496
dtype: float64

In [67]:
s.rename(str.upper)

A    0.678457
B    0.004071
C   -0.340283
D    0.002467
E    1.387496
dtype: float64

If you __pass a function, it must return a value__ when called with any of the labels (and __must produce a set of unique values__). A dict or Series can also be used:

In [68]:
df.rename(columns={'one': 'foo', 'two': 'bar'}, index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three
apple,-0.483223,-0.359918,
banana,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
durian,,0.666355,0.508566


If the mapping doesn’t include a column/index label, it isn’t renamed. __Note that extra labels in the mapping don’t throw an error.__

New in version 0.21.0.

__DataFrame.rename() also supports an “axis-style”__ calling convention, where you specify a single mapper and the axis to apply that mapping to.

In [69]:
df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')

Unnamed: 0,foo,bar,three
a,-0.483223,-0.359918,
b,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
d,,0.666355,0.508566


In [70]:
df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

Unnamed: 0,one,two,three
apple,-0.483223,-0.359918,
banana,-0.32042,-0.380172,-0.161395
c,-1.061522,0.750475,1.971402
durian,,0.666355,0.508566


The __rename() method also provides an ``inplace`` named parameter that is by default False__ and copies the underlying data. Pass inplace=True to rename the data in place.

New in version 0.18.0.

#### Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [71]:
s.rename('scalar-name')

a    0.678457
b    0.004071
c   -0.340283
d    0.002467
e    1.387496
Name: scalar-name, dtype: float64

In [72]:
s

a    0.678457
b    0.004071
c   -0.340283
d    0.002467
e    1.387496
dtype: float64

New in version 0.24.0.

#### The method rename_axis() allows specific names of a MultiIndex to be changed (as opposed to the labels).

In [73]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
                'y': [10, 20, 30, 40, 50, 60]},
                    index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
                   names=['let', 'num']))

In [74]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y
let,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [77]:
df.rename_axis(index={'let': 'abc'})

TypeError: rename_axis() got an unexpected keyword argument 'index'

Out[243]: 
         x   y
abc num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [78]:
df.rename_axis(index=str.upper)

TypeError: rename_axis() got an unexpected keyword argument 'index'

Out[244]: 
         x   y
LET NUM       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60