## Reindexing and altering labels

In [1]:
import numpy as np
import pandas as pd

``reindex()`` is the __fundamental data alignment__ method in pandas. It is used to implement nearly all other features relying on label-alignment functionality. __To reindex means to conform the data to match a given set of labels along a particular axis.__ This accomplishes several things:

1. __Reorders the existing data to match a new set of labels__
2. __Inserts missing value (NA) markers__ in label locations where no data for that label existed
3. If specified, __fill data for missing labels__ using logic (highly relevant to working with time series data)

Here is a simple example:

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a    2.496729
b    0.069824
c    0.144166
d   -0.199828
e    0.016900
dtype: float64

In [3]:
s.reindex(['e', 'b', 'f', 'd'])

e    0.016900
b    0.069824
f         NaN
d   -0.199828
dtype: float64

With a DataFrame, you can simultaneously reindex the index and columns:

In [4]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
df

Unnamed: 0,one,two,three
a,-0.076037,-0.95942,
b,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
d,,-1.180769,-1.326337


In [5]:
df.reindex(index=['c', 'f', 'b'], columns=['three', 'two', 'one'])

Unnamed: 0,three,two,one
c,-0.586647,-2.00799,0.128989
f,,,
b,0.676317,-0.211883,-0.402473


You may also use reindex with an axis keyword:

In [6]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,0.128989,-2.00799,-0.586647
f,,,
b,-0.402473,-0.211883,0.676317


Note that __the ``Index`` objects containing the actual axis labels can be shared between objects__. So if we have a Series and a DataFrame, the following can be done:

In [7]:
s = pd.Series(np.random.randn(4), index=list('abcd'))
s

a   -0.815751
b   -0.809489
c    0.301807
d    0.761103
dtype: float64

In [8]:
rs = s.reindex(df.index)

In [9]:
rs

a   -0.815751
b   -0.809489
c    0.301807
d    0.761103
dtype: float64

In [10]:
rs.index == df.index

array([ True,  True,  True,  True])

In [11]:
rs.index is df.index # True

False

This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.

New in version 0.21.0.

DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels argument and the axis it applies to.

In [12]:
df.reindex(['c', 'f', 'b'], axis='index')

Unnamed: 0,one,two,three
c,0.128989,-2.00799,-0.586647
f,,,
b,-0.402473,-0.211883,0.676317


In [13]:
df.reindex(['three', 'two', 'one'], axis='columns')

Unnamed: 0,three,two,one
a,,-0.95942,-0.076037
b,0.676317,-0.211883,-0.402473
c,-0.586647,-2.00799,0.128989
d,-1.326337,-1.180769,


__Note__: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing ninja: __many operations are faster on pre-aligned data__. Adding two unaligned DataFrames internally triggers a reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.

### Reindexing to align with another object

You may wish to take an object and reindex its axes to be labeled the same as another object

In [14]:
df1 = pd.DataFrame({'A': [1., np.nan, 3., 5., np.nan], 'B': [np.nan, 2., 3., np.nan, 6.]})
df2 = pd.DataFrame({'A': [5., 2., 4., np.nan, 3., 7.], 'B': [np.nan, np.nan, 3., 4., 6., 8.]})
df3 = pd.DataFrame([2, 1, 1, 3, np.nan], columns=['A'], index=list('edcba'))

In [15]:
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [16]:
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [17]:
df2.reindex_like(df3)

Unnamed: 0,A
e,
d,
c,
b,
a,


In [18]:
df

Unnamed: 0,one,two,three
a,-0.076037,-0.95942,
b,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
d,,-1.180769,-1.326337


In [19]:
df3 = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(3), index=['a', 'b', 'c'])})
df3

Unnamed: 0,one,two
a,-0.237808,0.093898
b,-0.543743,-0.742616
c,-0.701779,0.458126


In [20]:
df.reindex_like(df3)

Unnamed: 0,one,two
a,-0.076037,-0.95942
b,-0.402473,-0.211883
c,0.128989,-2.00799


### Aligning objects with each other with ``align``

#### The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

1. join='outer': take the union of the indexes (default)
2. join='left': use the calling object’s index
3. join='right': use the passed object’s index
4. join='inner': intersect the indexes

#### It returns a tuple with both of the reindexed Series

In [21]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])

In [22]:
s1 = s[:4]
s2 = s[1:]

In [23]:
s1

a    0.447339
b    0.257893
c    1.251237
d   -0.059313
dtype: float64

In [24]:
s2

b    0.257893
c    1.251237
d   -0.059313
e    0.367377
dtype: float64

In [25]:
s1.align(s2)

(a    0.447339
 b    0.257893
 c    1.251237
 d   -0.059313
 e         NaN
 dtype: float64, a         NaN
 b    0.257893
 c    1.251237
 d   -0.059313
 e    0.367377
 dtype: float64)

In [26]:
s1.align(s2, join='inner')

(b    0.257893
 c    1.251237
 d   -0.059313
 dtype: float64, b    0.257893
 c    1.251237
 d   -0.059313
 dtype: float64)

In [27]:
s1.align(s2, join='left')

(a    0.447339
 b    0.257893
 c    1.251237
 d   -0.059313
 dtype: float64, a         NaN
 b    0.257893
 c    1.251237
 d   -0.059313
 dtype: float64)

In [28]:
s1.align(s2, join='right')

(b    0.257893
 c    1.251237
 d   -0.059313
 e         NaN
 dtype: float64, b    0.257893
 c    1.251237
 d   -0.059313
 e    0.367377
 dtype: float64)

__For DataFrames__, the join method will be applied __to both the index and the columns by default__:

In [29]:
df2 = df3.copy()

In [30]:
df.align(df3)

(        one     three       two
 a -0.076037       NaN -0.959420
 b -0.402473  0.676317 -0.211883
 c  0.128989 -0.586647 -2.007990
 d       NaN -1.326337 -1.180769,         one  three       two
 a -0.237808    NaN  0.093898
 b -0.543743    NaN -0.742616
 c -0.701779    NaN  0.458126
 d       NaN    NaN       NaN)

If you __pass a Series to DataFrame.align()__, you can choose to align both objects __either on the DataFrame’s index or columns using the axis argument__:

In [31]:
df.align(df.iloc[0], axis=1)

(        one       two     three
 a -0.076037 -0.959420       NaN
 b -0.402473 -0.211883  0.676317
 c  0.128989 -2.007990 -0.586647
 d       NaN -1.180769 -1.326337, one     -0.076037
 two     -0.959420
 three         NaN
 Name: a, dtype: float64)

### Filling while reindexing

reindex() takes an optional parameter method which is a filling method chosen from the following table:

Method	-  Action
1. pad / ffill	-  Fill values forward
2. bfill / backfill	 -  Fill values backward
3. nearest	-  Fill from the nearest index value

In [32]:
rng = pd.date_range('1/3/2000', periods=8)

In [33]:
ts = pd.Series(np.random.randn(8), index=rng)

In [34]:
ts2 = ts[[0, 3, 6]]

In [35]:
ts

2000-01-03    1.861953
2000-01-04    0.231965
2000-01-05   -1.716378
2000-01-06    0.103358
2000-01-07   -1.731932
2000-01-08    0.311691
2000-01-09    0.981686
2000-01-10    0.660765
Freq: D, dtype: float64

In [36]:
ts2

2000-01-03    1.861953
2000-01-06    0.103358
2000-01-09    0.981686
dtype: float64

In [37]:
ts.reindex(ts2)

1.861953   NaN
0.103358   NaN
0.981686   NaN
dtype: float64

In [38]:
ts.reindex(ts2.index)

2000-01-03    1.861953
2000-01-06    0.103358
2000-01-09    0.981686
dtype: float64

In [39]:
ts2.reindex(ts.index)

2000-01-03    1.861953
2000-01-04         NaN
2000-01-05         NaN
2000-01-06    0.103358
2000-01-07         NaN
2000-01-08         NaN
2000-01-09    0.981686
2000-01-10         NaN
Freq: D, dtype: float64

In [40]:
ts2.reindex(ts.index, method='ffill')

2000-01-03    1.861953
2000-01-04    1.861953
2000-01-05    1.861953
2000-01-06    0.103358
2000-01-07    0.103358
2000-01-08    0.103358
2000-01-09    0.981686
2000-01-10    0.981686
Freq: D, dtype: float64

In [41]:
ts2.reindex(ts.index, method='bfill')

2000-01-03    1.861953
2000-01-04    0.103358
2000-01-05    0.103358
2000-01-06    0.103358
2000-01-07    0.981686
2000-01-08    0.981686
2000-01-09    0.981686
2000-01-10         NaN
Freq: D, dtype: float64

In [42]:
ts2.reindex(ts.index, method='nearest')

2000-01-03    1.861953
2000-01-04    1.861953
2000-01-05    0.103358
2000-01-06    0.103358
2000-01-07    0.103358
2000-01-08    0.981686
2000-01-09    0.981686
2000-01-10    0.981686
Freq: D, dtype: float64

These methods require that the indexes are ordered increasing or decreasing.

#### Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:

In [43]:
ts2.reindex(ts.index).fillna(method='ffill')

2000-01-03    1.861953
2000-01-04    1.861953
2000-01-05    1.861953
2000-01-06    0.103358
2000-01-07    0.103358
2000-01-08    0.103358
2000-01-09    0.981686
2000-01-10    0.981686
Freq: D, dtype: float64

#### ``reindex()`` will raise a ValueError if the index is not monotonically increasing or decreasing. ``fillna() and interpolate()`` will not perform any checks on the order of the index.

### Limits on filling while reindexing

__The limit and tolerance arguments__ provide additional control over filling while reindexing. __Limit specifies the maximum count of consecutive matches__:

In [44]:
ts2.reindex(ts.index, method='ffill', limit=1)

2000-01-03    1.861953
2000-01-04    1.861953
2000-01-05         NaN
2000-01-06    0.103358
2000-01-07    0.103358
2000-01-08         NaN
2000-01-09    0.981686
2000-01-10    0.981686
Freq: D, dtype: float64

#### In contrast, tolerance specifies the maximum distance between the index and indexer values:

In [45]:
ts2.reindex(ts.index, method='ffill', tolerance='1 day')

2000-01-03    1.861953
2000-01-04    1.861953
2000-01-05         NaN
2000-01-06    0.103358
2000-01-07    0.103358
2000-01-08         NaN
2000-01-09    0.981686
2000-01-10    0.981686
Freq: D, dtype: float64

#### Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.

### Dropping labels from an axis

A method closely related to reindex is the drop() function. It removes a set of labels from an axis:

In [46]:
df

Unnamed: 0,one,two,three
a,-0.076037,-0.95942,
b,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
d,,-1.180769,-1.326337


In [47]:
df.drop(['a', 'b'], axis=0)

Unnamed: 0,one,two,three
c,0.128989,-2.00799,-0.586647
d,,-1.180769,-1.326337


In [48]:
df.drop(['two', 'three'], axis=1)

Unnamed: 0,one
a,-0.076037
b,-0.402473
c,0.128989
d,


Note that the following also works, but is a bit less obvious / clean:

In [49]:
df.reindex(df.index.difference(['a', 'd']))

Unnamed: 0,one,two,three
b,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647


### Renaming / mapping labels

#### The ``rename()`` method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [50]:
s

a    0.447339
b    0.257893
c    1.251237
d   -0.059313
e    0.367377
dtype: float64

In [51]:
s.rename(str.upper)

A    0.447339
B    0.257893
C    1.251237
D   -0.059313
E    0.367377
dtype: float64

If you __pass a function, it must return a value__ when called with any of the labels (and __must produce a set of unique values__). A dict or Series can also be used:

In [52]:
df.rename(columns={'one': 'foo', 'two': 'bar'}, index={'a': 'apple', 'b': 'banana', 'd': 'durian'})

Unnamed: 0,foo,bar,three
apple,-0.076037,-0.95942,
banana,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
durian,,-1.180769,-1.326337


If the mapping doesn’t include a column/index label, it isn’t renamed. __Note that extra labels in the mapping don’t throw an error.__

New in version 0.21.0.

__DataFrame.rename() also supports an “axis-style”__ calling convention, where you specify a single mapper and the axis to apply that mapping to.

In [53]:
df.rename({'one': 'foo', 'two': 'bar'}, axis='columns')

Unnamed: 0,foo,bar,three
a,-0.076037,-0.95942,
b,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
d,,-1.180769,-1.326337


In [54]:
df.rename({'a': 'apple', 'b': 'banana', 'd': 'durian'}, axis='index')

Unnamed: 0,one,two,three
apple,-0.076037,-0.95942,
banana,-0.402473,-0.211883,0.676317
c,0.128989,-2.00799,-0.586647
durian,,-1.180769,-1.326337


The __rename() method also provides an ``inplace`` named parameter that is by default False__ and copies the underlying data. Pass inplace=True to rename the data in place.

New in version 0.18.0.

#### Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.

In [55]:
s.rename('scalar-name')

a    0.447339
b    0.257893
c    1.251237
d   -0.059313
e    0.367377
Name: scalar-name, dtype: float64

In [56]:
s

a    0.447339
b    0.257893
c    1.251237
d   -0.059313
e    0.367377
dtype: float64

New in version 0.24.0.

#### The method rename_axis() allows specific names of a MultiIndex to be changed (as opposed to the labels).

In [57]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5, 6],
                'y': [10, 20, 30, 40, 50, 60]},
                    index=pd.MultiIndex.from_product([['a', 'b', 'c'], [1, 2]],
                   names=['let', 'num']))

In [58]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,x,y
let,num,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,10
a,2,2,20
b,1,3,30
b,2,4,40
c,1,5,50
c,2,6,60


In [59]:
df.rename_axis(index={'let': 'abc'})

TypeError: rename_axis() got an unexpected keyword argument 'index'

Out[243]: 
         x   y
abc num       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

In [None]:
df.rename_axis(index=str.upper)

Out[244]: 
         x   y
LET NUM       
a   1    1  10
    2    2  20
b   1    3  30
    2    4  40
c   1    5  50
    2    6  60

## Iteration

The behavior of basic __iteration over pandas objects depends on the type__. When iterating over a __Series, it is regarded as array-like__, and basic iteration produces the values. __DataFrames follow the dict-like convention of iterating over the “keys” of the objects__.

In short, basic iteration (for i in object) produces:

#### 1. Series: values
#### 2. DataFrame: column labels

Thus, for example, iterating over a DataFrame gives you the column names:

In [60]:
df = pd.DataFrame({'col1': np.random.randn(3), 'col2': np.random.randn(3)}, index=['a', 'b', 'c'])

In [61]:
for col in df:
    print(col)

col1
col2


In [62]:
for col in df:
    print(df[col])

a    1.032768
b    0.916797
c    0.547213
Name: col1, dtype: float64
a   -0.790261
b   -0.204644
c   -1.300533
Name: col2, dtype: float64


Pandas objects also have the __dict-like ``items()`` method to iterate over the (key, value) pairs__.

In [63]:
for col_label in df.items():
    print(col_label)

('col1', a    1.032768
b    0.916797
c    0.547213
Name: col1, dtype: float64)
('col2', a   -0.790261
b   -0.204644
c   -1.300533
Name: col2, dtype: float64)


__To iterate over the rows of a DataFrame__, you can use the following methods:

__``iterrows():``__ Iterate over the rows of a DataFrame as __(index, Series) pairs__. This converts the rows to Series objects, which can __change the dtypes and has some performance implications.__

__``itertuples():``__ Iterate over the rows of a DataFrame __as namedtuples of the values__. This is __a lot faster than iterrows()__, and is in most cases __preferable__ to use to iterate over the values of a DataFrame.

>__Warning:__

Iterating through pandas objects is generally __slow__. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:

1. __Look for a vectorized solution:__ many operations can be performed using built-in methods or __NumPy functions, (boolean) indexing__, …
2. When you have a function that __cannot work__ on the full DataFrame/Series at once, it is better to use __``apply()`` instead of iterating over the values__.
3. If you need to do iterative manipulations on the values but __performance is important__, consider writing the inner loop with cython or numba.

>__Warning:__

You should never modify something you are iterating over. This is not guaranteed to work in all cases. __Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!__

For example, in the following case setting the value has no effect:

In [64]:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})

In [65]:
df

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c


In [66]:
for index, row in df.iterrows():
   row['a'] = 10

In [67]:
df

Unnamed: 0,a,b
0,1,a
1,2,b
2,3,c


### items

Consistent with the __dict-like interface__, items() __iterates through key-value pairs__:

#### 1. Series: (index, scalar value) pairs
#### 2. DataFrame: (column, Series) pairs
For example:

In [68]:
for label , ser in df.items():
    print(label)
    print(ser)

a
0    1
1    2
2    3
Name: a, dtype: int64
b
0    a
1    b
2    c
Name: b, dtype: object


### iterrows

iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator __yielding each index value along with a Series containing the data in each row__:

In [69]:
for row_index, row in df.iterrows():
    print(row_index, row, sep='\n')

0
a    1
b    a
Name: 0, dtype: object
1
a    2
b    b
Name: 1, dtype: object
2
a    3
b    c
Name: 2, dtype: object


>__Note__

Because __iterrows()__ returns a Series for each row, __it does not preserve dtypes across the rows__ (dtypes are preserved across columns for DataFrames). 

For example,

In [70]:
df_orig = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])

In [71]:
df_orig.dtypes

int        int64
float    float64
dtype: object

In [72]:
row = next(df_orig.iterrows())[1]

In [73]:
row

int      1.0
float    1.5
Name: 0, dtype: float64

All values in row, returned as a Series, are now __upcasted to floats__, also the original integer value in column x:

In [74]:
row['int'].dtype

dtype('float64')

In [75]:
df_orig['int'].dtype

dtype('int64')

#### ``To preserve dtypes`` while iterating over the rows, it is better to use ``itertuples()`` which returns namedtuples of the values and which is generally much faster than iterrows().

For instance, a contrived way to __transpose the DataFrame__ would be:

In [76]:
df2 = pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
df2

Unnamed: 0,x,y
0,1,4
1,2,5
2,3,6


In [77]:
df2.T

Unnamed: 0,0,1,2
x,1,2,3
y,4,5,6


In [78]:
df2_t = pd.DataFrame({idx: values for idx, values in df2.iterrows()})

In [79]:
df2_t

Unnamed: 0,0,1,2
x,1,2,3
y,4,5,6


### itertuples

The itertuples() method will return an iterator __yielding a namedtuple for each row__ in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [80]:
for row in df.itertuples():
    print(row)

Pandas(Index=0, a=1, b='a')
Pandas(Index=1, a=2, b='b')
Pandas(Index=2, a=3, b='c')


#### This method does not convert the row to a Series object; it merely returns the values inside a namedtuple. Therefore, itertuples() preserves the data type of the values and is generally faster as iterrows().

>__Note__ 

__The column names will be renamed to positional names__ if they are invalid Python identifiers, repeated, or start with an underscore. With a large number of columns (>255), regular tuples are returned.

## .dt accessor

#### Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a datetime/period like Series. This will return a Series, indexed like the existing Series.

In [81]:
s = pd.Series(pd.date_range('17/08/2019 13:17:33', periods=7))

In [82]:
s

0   2019-08-17 13:17:33
1   2019-08-18 13:17:33
2   2019-08-19 13:17:33
3   2019-08-20 13:17:33
4   2019-08-21 13:17:33
5   2019-08-22 13:17:33
6   2019-08-23 13:17:33
dtype: datetime64[ns]

In [83]:
s.dt.hour

0    13
1    13
2    13
3    13
4    13
5    13
6    13
dtype: int64

In [84]:
s.dt.date

0    2019-08-17
1    2019-08-18
2    2019-08-19
3    2019-08-20
4    2019-08-21
5    2019-08-22
6    2019-08-23
dtype: object

In [85]:
s.dt.day

0    17
1    18
2    19
3    20
4    21
5    22
6    23
dtype: int64

In [86]:
s.dt.second

0    33
1    33
2    33
3    33
4    33
5    33
6    33
dtype: int64

This enables nice expressions like this:

In [87]:
s[s.dt.day == 18]

1   2019-08-18 13:17:33
dtype: datetime64[ns]

You can easily produces tz aware transformations:

In [None]:
stz = s.dt.tz_localize('US/Eastern')

In [None]:
stz

You can also chain these types of operations:

In [None]:
s.dt.tz_localize('UTC').dt.tz_convert('US/Eastern')

You can also __format datetime values as strings with ``Series.dt.strftime()``__ which supports the same format as the standard strftime().

In [None]:
# DatetimeIndex
s = pd.Series(pd.date_range('20130101', periods=4))

In [None]:
s

In [None]:
s.dt.strftime('%Y/%m/%d')

In [None]:
# PeriodIndex
s = pd.Series(pd.period_range('20130101', periods=4))

In [None]:
s

In [None]:
s.dt.strftime('%Y/%m/%d')

#### The .dt accessor works for period and timedelta dtypes.

In [None]:
# period
s = pd.Series(pd.period_range('20130101', periods=4, freq='D'))

In [None]:
s

In [None]:
s.dt.year

In [None]:
s.dt.day

In [None]:
# timedelta
s = pd.Series(pd.timedelta_range('1 days 00:00:03', periods=4, freq='s'))

In [None]:
s

In [None]:
s.dt.days

In [None]:
s.dt.seconds

In [None]:
s.dt.components

>Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.

## Vectorized string methods

__Series is equipped with a set of string processing methods__ that make it easy to operate on each element of the array. Perhaps most importantly, these methods __exclude missing/NA values automatically__. These are accessed via the Series’s str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:

In [None]:
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])

In [None]:
s.str.lower()

__Powerful pattern-matching methods__ are provided as well, but note that pattern-matching generally __uses regular expressions__ by default

## Sorting

Pandas supports three kinds of sorting: 
    __1. sorting by index labels__
    __2. sorting by column values__
    __3. sorting by a combination of both__.

### By index

The __Series.sort_index()__ and __DataFrame.sort_index()__ methods are used to sort a pandas object by its index levels.

In [None]:
df = pd.DataFrame({
        'one': pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
        'two': pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
        'three': pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})
    

In [None]:
unsorted_df = df.reindex(index=['a', 'd', 'c', 'b'], columns=['three', 'two', 'one'])

In [None]:
unsorted_df

In [None]:
unsorted_df.sort_index()

In [None]:
unsorted_df.sort_index(ascending=False)

In [None]:
unsorted_df.sort_index(axis=1)

In [None]:
unsorted_df['three'].sort_index()

### By values

The __Series.sort_values()__ method is used to sort a Series by its values. The __DataFrame.sort_values()__ method is used to sort a DataFrame by its column or row values. The __optional ``by`` parameter to DataFrame.sort_values()__ may used to specify one or more columns to use to determine the sorted order.

In [None]:
df1 = pd.DataFrame({'one': [2, 1, 1, 1], 'two': [1, 3, 2, 4], 'three': [5, 4, 3, 2]})

In [None]:
df1.sort_values(by='two')

The by parameter can take a list of column names, e.g.:

In [None]:
df1[['one', 'two', 'three']].sort_values(by=['one', 'two'])

These methods have __special treatment of NA values__ via the __``na_position``__ argument:

In [None]:
s[2] = np.nan

In [None]:
s.sort_values()

In [None]:
s.sort_values(ascending=False)

In [None]:
s.sort_values(na_position='first')

### By indexes and values

New in version 0.23.0.

#### Strings passed as the by parameter to ``DataFrame.sort_values()`` may refer to either columns or index level names.

In [None]:
idx = pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('a', 2), ('b', 2), ('b', 1), ('b', 1)])

In [None]:
idx

In [None]:
idx

In [None]:
idx2 = idx.copy()
idx2.names = ['first', 'second']
idx2

In [None]:
df_multi = pd.DataFrame({'A': np.arange(6, 0, -1)}, index=idx)

In [None]:
df_multi

In [None]:
idx.names = ['first', 'second']

In [None]:
df_multi

Sort by ‘second’ (index) and ‘A’ (column)

In [None]:
df_multi.sort_values(by=['second', 'A'])

> #### Note: 
If a string matches both a column name and an index level name then a warning is issued and the column takes precedence. This will result in an ambiguity error in a future version.

### searchsorted

Series has the searchsorted() method, which works similarly to numpy.ndarray.searchsorted().

In [None]:
ser = pd.Series([1, 2, 3, 0])

In [None]:
ser.searchsorted([0, 3, 2, 1])

In [None]:
ser2 = pd.Series([0, 1, 2, 3])

In [None]:
ser2.searchsorted([0, 3, 2, 1])

In [None]:
ser3 = pd.Series([3, 0, 1, 2])

In [None]:
ser3.searchsorted([0, 3, 2, 1])

In [None]:
ser.searchsorted([0, 4])

In [None]:
ser.searchsorted([1, 3], side='right')

In [None]:
ser.searchsorted([1, 3], side='left')

In [None]:
ser = pd.Series([3, 1, 2])

In [None]:
ser.searchsorted([0, 3], sorter=np.argsort(ser))

In [None]:
ser = pd.Series([3, 1, 2, 2, 0])

In [None]:
ser.searchsorted([0, 3], sorter=np.argsort(ser))

### smallest / largest values

Series has the 
__nsmallest()__ and __nlargest()__ methods which return the smallest or largest n values. __For a large Series this can be much faster than sorting the entire Series and calling head(n) on the result.__

In [None]:
s = pd.Series(np.random.permutation(10))

In [None]:
s

In [None]:
s.sort_values()

In [None]:
s.nsmallest(3)

In [None]:
s.nlargest(3)

DataFrame also has the nlargest and nsmallest methods.

In [None]:
df = pd.DataFrame({'a': [-2, 1, 1, 10, 8, 11, -1], 'b': list('fbdceaf'), 'c': [5.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})

In [None]:
df.nlargest(3, 'a')

In [None]:
df.nlargest(3, 'c')

In [None]:
df.nlargest(5, ['a', 'c']) #if two elements are same in column 'a' then column 'c' values are used to compare

In [None]:
df = pd.DataFrame({'a': [-2, 8, 1, 10, 8, 11, -1], 'b': list('fbdceaf'), 'c': [5.0, 2.0, 4.0, 3.2, np.nan, 3.0, 4.0]})

In [None]:
df.nlargest(5, ['a', 'c']) #for 2nd 8 in column 'a' corresponding element in column 'c' is NAN, so it wdf.nsmallest(3, 'a')as ignored

In [None]:
df.nlargest(5, ['a', 'b'])

In [None]:
df.nsmallest(3, 'a')

In [None]:
df.nsmallest(5, ['a', 'c'])

### Sorting by a MultiIndex column

You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.

In [88]:
df1.columns = pd.MultiIndex.from_tuples([('a', 'one'), ('a', 'two'), ('b', 'three')])

ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements

In [89]:
df1.sort_values(by=('a', 'two'))

KeyError: ('a', 'two')

In [90]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


## Copying

__The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are immutable)__ and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a handful of ways to alter a DataFrame in-place:

1. Inserting, deleting, or modifying a column.
2. Assigning to the index or columns attributes.
3. For homogeneous data, directly modifying the values via the values attribute or advanced indexing.

To be clear, __no pandas method has the side effect of modifying your data__; almost every method returns a new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly.

## dtypes

#### For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for ``float, int, bool, timedelta64[ns] and datetime64[ns]`` (note that NumPy does not support timezone-aware datetimes).

The following table lists all of pandas extension types

#### Pandas uses the object dtype for storing strings.

Finally, arbitrary objects may be stored using the object dtype, but should be avoided to the extent possible (for performance and interoperability with other libraries and methods.).

#### A convenient __dtypes__ attribute for DataFrame __returns a Series with the data type of each column.__

In [91]:
dft = pd.DataFrame({'A': np.random.rand(3),
                        'B': 1,
                        'C': 'foo',
                        'D': pd.Timestamp('20010102'),
                        'E': pd.Series([1.0] * 3).astype('float32'),
                        'F': False,
                        'G': pd.Series([1] * 3, dtype='int8')})

In [92]:
dft

Unnamed: 0,A,B,C,D,E,F,G
0,0.586424,1,foo,2001-01-02,1.0,False,1
1,0.130134,1,foo,2001-01-02,1.0,False,1
2,0.574744,1,foo,2001-01-02,1.0,False,1


In [93]:
dft.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

#### On a Series object, use the dtype attribute.

In [94]:
dft['A'].dtype

dtype('float64')

If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be __chosen to accommodate all of the data types (object is the most general).__

In [95]:
# these ints are coerced to floats
pd.Series([1, 2, 3, 4, 5, 6.])

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

In [96]:
# string data forces an ``object`` dtype
pd.Series([1, 2, 3, 6., 'foo'])

0      1
1      2
2      3
3      6
4    foo
dtype: object

#### The number of columns of each type in a DataFrame can be found by calling ``DataFrame.dtypes.value_counts()``.

In [97]:
dft.dtypes.value_counts()

datetime64[ns]    1
bool              1
float64           1
int8              1
int64             1
float32           1
object            1
dtype: int64

Numeric dtypes will propagate and can coexist in DataFrames. __If a dtype is passed either directly via the dtype keyword, a passed ndarray, or a passed Series, then it will be preserved in DataFrame operations.__ Furthermore, different numeric dtypes will NOT be combined. 

The following example will give you a taste.

In [98]:
df1 = pd.DataFrame(np.random.randn(8, 1), columns=['A'], dtype='float32')

In [99]:
df1

Unnamed: 0,A
0,-0.734983
1,-0.210326
2,-0.591916
3,0.348375
4,0.152316
5,-0.882287
6,-1.008109
7,0.067841


In [100]:
df1.dtypes

A    float32
dtype: object

In [101]:
df2 = pd.DataFrame({'A': pd.Series(np.random.randn(8), dtype='float16'),
                        'B': pd.Series(np.random.randn(8)),
                        'C': pd.Series(np.array(np.random.randn(8), dtype='uint8')),
                        'D': pd.Series(np.random.randn(8), dtype='uint8'),
                        'E': pd.Series(np.random.randn(8), dtype='int16')})

In [102]:
df2

Unnamed: 0,A,B,C,D,E
0,-1.960938,0.968477,255,0,0
1,1.480469,-0.544675,0,0,0
2,-0.790527,0.332688,0,0,0
3,0.241089,0.701401,0,0,0
4,0.321289,-0.502171,0,0,0
5,1.055664,-0.191218,0,0,0
6,0.937988,0.568605,0,0,-2
7,0.478516,0.356362,0,0,0


In [103]:
df2.dtypes

A    float16
B    float64
C      uint8
D      uint8
E      int16
dtype: object

### defaults

By default __integer types are int64__ and __float types are float64, regardless of platform (32-bit or 64-bit)__. The following will all result in int64 dtypes

In [104]:
pd.DataFrame([1, 2], columns=['a']).dtypes

a    int64
dtype: object

In [105]:
pd.DataFrame({'a': [1, 2]}).dtypes

a    int64
dtype: object

In [106]:
pd.DataFrame({'a': 1}, index=list(range(2))).dtypes

a    int64
dtype: object

#### Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32 on 32-bit platform

In [107]:
frame = pd.DataFrame(np.array([1, 2]))

In [189]:
frame.dtypes

0    int32
dtype: object

In [192]:
frame = pd.DataFrame(np.random.randint(0, 5, size=(7,2)), columns=list('AB'))
frame

Unnamed: 0,A,B
0,0,2
1,4,1
2,2,1
3,1,1
4,1,0
5,2,0
6,0,2


In [194]:
frame.dtypes

A    int32
B    int32
dtype: object

In [198]:
frame['C'] = 2 #pandas default dtype for explictly craeated column
frame.dtypes

A    int32
B    int32
C    int64
dtype: object

### upcasting

Types can potentially be __upcasted when combined with other types__, meaning they are promoted from the current type (e.g. int to float).

In [109]:
df2.dtypes

A    float16
B    float64
C      uint8
D      uint8
E      int16
dtype: object

In [110]:
df3 = df1.reindex_like(df2).fillna(value=0.0) + df2

In [111]:
df3

Unnamed: 0,A,B,C,D,E
0,-2.695921,0.968477,255.0,0.0,0.0
1,1.270143,-0.544675,0.0,0.0,0.0
2,-1.382443,0.332688,0.0,0.0,0.0
3,0.589463,0.701401,0.0,0.0,0.0
4,0.473605,-0.502171,0.0,0.0,0.0
5,0.173377,-0.191218,0.0,0.0,0.0
6,-0.070121,0.568605,0.0,0.0,-2.0
7,0.546357,0.356362,0.0,0.0,0.0


In [112]:
df3.dtypes

A    float32
B    float64
C    float64
D    float64
E    float64
dtype: object

__DataFrame.to_numpy() will return the lower-common-denominator of the dtypes__, meaning the dtype that can __accommodate ALL of the types in the resulting homogeneous dtyped NumPy array__. This can force some upcasting.

In [113]:
df3.to_numpy().dtype #dtype('float64')

AttributeError: 'DataFrame' object has no attribute 'to_numpy'

### astype

You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an exception if the astype operation is invalid

__Upcasting is always according to the numpy rules__. If two different dtypes are involved in an operation, then the more general one will be used as the result of the operation.

In [114]:
df3

Unnamed: 0,A,B,C,D,E
0,-2.695921,0.968477,255.0,0.0,0.0
1,1.270143,-0.544675,0.0,0.0,0.0
2,-1.382443,0.332688,0.0,0.0,0.0
3,0.589463,0.701401,0.0,0.0,0.0
4,0.473605,-0.502171,0.0,0.0,0.0
5,0.173377,-0.191218,0.0,0.0,0.0
6,-0.070121,0.568605,0.0,0.0,-2.0
7,0.546357,0.356362,0.0,0.0,0.0


In [115]:
df3.dtypes

A    float32
B    float64
C    float64
D    float64
E    float64
dtype: object

In [116]:
df3.astype('float32').dtypes

A    float32
B    float32
C    float32
D    float32
E    float32
dtype: object

In [117]:
df3.astype('int16').dtypes

A    int16
B    int16
C    int16
D    int16
E    int16
dtype: object

In [118]:
dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [119]:
dft

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [120]:
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

In [121]:
dft[['a', 'b']] = dft[['a', 'b']].astype(np.uint8)

In [122]:
dft

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [123]:
dft.dtypes

a    uint8
b    uint8
c    int64
dtype: object

New in version 0.19.0.

Convert certain columns to a specific dtype by passing a dict to astype().

In [124]:
dft1 = pd.DataFrame({'a': [1, 0, 1], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [125]:
dft1.astype({'a': np.bool, 'c': np.float32})

Unnamed: 0,a,b,c
0,True,4,7.0
1,False,5,8.0
2,True,6,9.0


In [126]:
dft1 = dft1.astype({'a': 'bool', 'c': np.float32})

>#### Note: 

#### When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.

loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from the right hand side. Therefore the following piece of code produces the unintended result.

In [127]:
dft = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})

In [128]:
dft.loc[:, ['a', 'b']].astype(np.uint8).dtypes

a    uint8
b    uint8
dtype: object

In [129]:
dft.loc[:, ['a', 'b']] = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [130]:
dft.dtypes

a    int64
b    int64
c    int64
dtype: object

In [131]:
dft = dft.loc[:, ['a', 'b']].astype(np.uint8)

In [132]:
dft.dtypes

a    uint8
b    uint8
dtype: object

### object conversion

pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases where the data is already of the correct type, but stored in an object array, the __``DataFrame.infer_objects()`` and ``Series.infer_objects()`` methods can be used to soft convert to the correct type__.

In [133]:
import datetime

df = pd.DataFrame([[1, 2], ['a', 'b'], [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3, 2)]])

In [134]:
df

Unnamed: 0,0,1
0,1,2
1,a,b
2,2016-03-02 00:00:00,2016-03-02 00:00:00


In [135]:
df = df.T

In [136]:
df

Unnamed: 0,0,1,2
0,1,a,2016-03-02 00:00:00
1,2,b,2016-03-02 00:00:00


In [137]:
df.dtypes

0    object
1    object
2    object
dtype: object

In [138]:
df.infer_objects()

Unnamed: 0,0,1,2
0,1,a,2016-03-02
1,2,b,2016-03-02


Because the data was transposed the original inference stored all columns as object, which infer_objects will correct.

In [139]:
df.infer_objects().dtypes

0             int64
1            object
2    datetime64[ns]
dtype: object

__to_numeric()__ (conversion to numeric dtypes)

In [140]:
m = ['1.1', 2, 3]

In [141]:
for x in m:
    print(x,'-->',type(x))

1.1 --> <class 'str'>
2 --> <class 'int'>
3 --> <class 'int'>


In [142]:
pd.to_numeric(m)

array([1.1, 2. , 3. ])

__to_datetime()__ (conversion to datetime objects)

In [143]:
import datetime

m = ['2016-07-09', datetime.datetime(2016, 3, 2)]
pd.to_datetime(m)

DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]', freq=None)

__to_timedelta()__ (conversion to timedelta objects)

In [144]:
m = ['5us', pd.Timedelta('1day')]

In [145]:
pd.to_timedelta(m)

TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype='timedelta64[ns]', freq=None)

To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements that cannot be converted to desired dtype or object. __By default, ``errors='raise'``__, meaning that any errors encountered will be raised during the conversion process. However, __if ``errors='coerce'``, these errors will be ignored and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric)__. This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but occasionally has non-conforming elements intermixed that you want to represent as missing:

In [146]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2), pd.Timedelta('1day')]
pd.to_datetime(m, errors='coerce')

DatetimeIndex(['NaT', '2016-03-02', 'NaT'], dtype='datetime64[ns]', freq=None)

In [147]:
# ValueError: Unable to parse string "apple" at position 0
pd.to_numeric(m, errors='coerce') 

array([nan, nan, nan])

In [148]:
m = ['apple', 2, 3]

pd.to_numeric(m, errors='coerce')

array([nan,  2.,  3.])

In [149]:
m = ['apple', datetime.datetime(2016, 3, 2), pd.Timedelta('1day')]
pd.to_timedelta(m, errors='coerce')

TimedeltaIndex([NaT, NaT, '1 days'], dtype='timedelta64[ns]', freq=None)

The errors parameter has a third option of __errors='ignore', which will simply return the passed in data__ if it encounters any errors with the conversion to a desired data type:

In [150]:
import datetime

m = ['apple', datetime.datetime(2016, 3, 2)]
pd.to_datetime(m, errors='ignore')

array(['apple', datetime.datetime(2016, 3, 2, 0, 0)], dtype=object)

In [151]:
m = ['apple', 2, 3]
pd.to_numeric(m, errors='ignore')

array(['apple', 2, 3], dtype=object)

In [152]:
m = ['apple', pd.Timedelta('1day')]
pd.to_timedelta(m, errors='ignore')

array(['apple', Timedelta('1 days 00:00:00')], dtype=object)

In addition to object conversion, __to_numeric() provides another argument downcast__, which gives the option of downcasting the newly (or already) numeric data to a smaller dtype, __which can conserve memory__:

In [153]:
m = ['1', 2, 3]
pd.to_numeric(m, downcast='integer')   # smallest signed int dtype

array([1, 2, 3], dtype=int8)

In [154]:
pd.to_numeric(m)

array([1, 2, 3], dtype=int64)

In [155]:
pd.to_numeric(m, downcast='signed')    # same as 'integer'

array([1, 2, 3], dtype=int8)

In [156]:
pd.to_numeric(m, downcast='int32')    # same as 'integer'

ValueError: invalid downcasting method provided

In [157]:
pd.to_numeric(m, downcast='unsigned')  # smallest unsigned int dtype

array([1, 2, 3], dtype=uint8)

In [158]:
pd.to_numeric(m, downcast='float')     # smallest float dtype

array([1., 2., 3.], dtype=float32)

#### As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column efficiently:

In [None]:
df = pd.DataFrame([['2016-07-09', datetime.datetime(2016, 3, 2)]] * 2, dtype='O')

In [None]:
df

#### pd.to_datetime(df) - AttributeError: 'int' object has no attribute 'lower'

In [None]:
df.apply(pd.to_datetime)

In [None]:
df = pd.DataFrame([['1.1', 2, 3]] * 2, dtype='O')
df

In [None]:
df.apply(pd.to_numeric)

In [None]:
df = pd.DataFrame([['5us', pd.Timedelta('1day')]] * 2, dtype='O')

In [None]:
df

In [None]:
df.apply(pd.to_timedelta)

### gotchas

#### Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the input data will be preserved in cases where nans are not introduced.

In [160]:
df3 = pd.DataFrame([[np.random.randint(0, 5, 8)], [np.random.randint(0, 5, 8)]], columns=list('A'))

In [161]:
df3

Unnamed: 0,A
0,"[1, 4, 2, 1, 0, 3, 0, 3]"
1,"[4, 1, 2, 4, 1, 2, 3, 0]"


In [200]:
df3 = pd.DataFrame(np.random.randint(0, 3, size=(7,2)), columns=list('AB'), index=np.arange(7), dtype='int32')
df3

Unnamed: 0,A,B
0,1,1
1,0,2
2,0,2
3,1,0
4,2,2
5,2,2
6,2,0


In [201]:
df3['C'] = 1 #Explicitely assigned column has int64  data type

In [202]:
df3.dtypes

A    int32
B    int32
C    int64
dtype: object

In [203]:
casted = df3[df3 > 0]

In [204]:
casted

Unnamed: 0,A,B,C
0,1.0,1.0,1
1,,2.0,1
2,,2.0,1
3,1.0,,1
4,2.0,2.0,1
5,2.0,2.0,1
6,2.0,,1


In [205]:
casted.dtypes

A    float64
B    float64
C      int64
dtype: object

#### While float dtypes are unchanged.

In [210]:
dfa =df3.copy()

In [214]:
dfa= dfa.astype('float32')

In [215]:
casted = dfa[dfa>0]

In [216]:
casted.dtypes

A    float32
B    float32
C    float32
dtype: object

## Selecting columns based on dtype

#### The select_dtypes() method implements subsetting of columns based on their dtype.

In [217]:
df = pd.DataFrame({'string': list('abc'),
                      'int64': list(range(1, 4)),
                      'uint8': np.arange(3, 6).astype('u1'),
                      'float64': np.arange(4.0, 7.0),
                      'bool1': [True, False, True],
                      'bool2': [False, True, False],
                      'dates': pd.date_range('now', periods=3),
                      'category': pd.Series(list("ABC")).astype('category')})

In [218]:
df['tdeltas'] = df.dates.diff()

In [219]:
df['uint64'] = np.arange(3, 6).astype('u8')

In [220]:
df['other_dates'] = pd.date_range('20130101', periods=3)

In [221]:
df['tz_aware_dates'] = pd.date_range('20130101', periods=3, tz='US/Eastern')

In [222]:
df

Unnamed: 0,string,int64,uint8,float64,bool1,bool2,dates,category,tdeltas,uint64,other_dates,tz_aware_dates
0,a,1,3,4.0,True,False,2019-08-17 22:29:30.173647,A,NaT,3,2013-01-01,2013-01-01 00:00:00-05:00
1,b,2,4,5.0,False,True,2019-08-18 22:29:30.173647,B,1 days,4,2013-01-02,2013-01-02 00:00:00-05:00
2,c,3,5,6.0,True,False,2019-08-19 22:29:30.173647,C,1 days,5,2013-01-03,2013-01-03 00:00:00-05:00


In [223]:
df.dtypes

string                                object
int64                                  int64
uint8                                  uint8
float64                              float64
bool1                                   bool
bool2                                   bool
dates                         datetime64[ns]
category                            category
tdeltas                      timedelta64[ns]
uint64                                uint64
other_dates                   datetime64[ns]
tz_aware_dates    datetime64[ns, US/Eastern]
dtype: object

__select_dtypes() has two parameters ``include`` and ``exclude``__ that allow you to say “give me the columns with these dtypes” (include) and/or “give the columns without these dtypes” (exclude).

For example, to select bool columns:

In [227]:
df.select_dtypes(include=[bool])

Unnamed: 0,bool1,bool2
0,True,False
1,False,True
2,True,False


You can also pass the name of a dtype in the __NumPy dtype hierarchy:__

In [229]:
df.select_dtypes(include=['bool'])

Unnamed: 0,bool1,bool2
0,True,False
1,False,True
2,True,False


select_dtypes() also works with __generic dtypes as well.__

For example, to select all numeric and boolean columns while excluding unsigned integers:

In [230]:
df.select_dtypes(include=['number', 'bool'], exclude=['unsignedinteger'])

Unnamed: 0,int64,float64,bool1,bool2,tdeltas
0,1,4.0,True,False,NaT
1,2,5.0,False,True,1 days
2,3,6.0,True,False,1 days


To select string columns you must use the object dtype:

In [231]:
df.select_dtypes(include=['object'])

Unnamed: 0,string
0,a
1,b
2,c


To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a tree of child dtypes:

In [233]:
def subdtypes(dtype):
        subs = dtype.__subclasses__()
        if not subs:
            return dtype
        return [dtype, [subdtypes(dt) for dt in subs]]

All NumPy dtypes are subclasses of numpy.generic:

In [235]:
subdtypes(np.generic)

[numpy.generic,
 [[numpy.number,
   [[numpy.integer,
     [[numpy.signedinteger,
       [numpy.int8,
        numpy.int16,
        numpy.int32,
        numpy.int32,
        numpy.int64,
        numpy.timedelta64]],
      [numpy.unsignedinteger,
       [numpy.uint8,
        numpy.uint16,
        numpy.uint32,
        numpy.uint32,
        numpy.uint64]]]],
    [numpy.inexact,
     [[numpy.floating,
       [numpy.float16, numpy.float32, numpy.float64, numpy.float64]],
      [numpy.complexfloating,
       [numpy.complex64, numpy.complex128, numpy.complex128]]]]]],
  [numpy.flexible,
   [[numpy.character, [numpy.bytes_, numpy.str_]],
    [numpy.void, [numpy.record]]]],
  numpy.bool_,
  numpy.datetime64,
  numpy.object_]]

>#### Note

##### Pandas also defines the types category, and datetime64[ns, tz], which are not integrated into the normal NumPy hierarchy and won’t show up with the above function.