# 5 New Features in pandas 1.0 You Should Know About

pandas 1.0 was released on January 29, 2020. 
While the version jumped from 0.25 to 1.0 there aren't any drastic changes as some pandas users expect. The version increase merely echoes the maturity of the data processing library.

While there aren't many groundbreaking changes, there are a few that you should know about. 

Let's start with a change in deprecation policy: 
 - deprecations will be introduced in minor releases (e.g. 1.1.0),
 - deprecations and API-breaking changes will be enforced in major releases (e.g. 2.0.0).
Should we upgrade or stay with the current pandas version? The new deprecation policy makes this question easier to answer. It also seems that we can expect more frequent major releases in the future.

## Setup

In [1]:
import os
import platform
import random
from platform import python_version

import jupyterlab
import numpy as np
import pandas as pd

print("System")
print("os name: %s" % os.name)
print("system: %s" % platform.system())
print("release: %s" % platform.release())
print()
print("Python")
print("version: %s" % python_version())
print()
print("Python Packages")
print("jupterlab==%s" % jupyterlab.__version__)
print("pandas==%s" % pd.__version__)
print("numpy==%s" % np.__version__)



System
os name: posix
system: Darwin
release: 19.2.0

Python
version: 3.8.0

Python Packages
jupterlab==1.2.4
pandas==1.0.0
numpy==1.18.0


## 1. Dynamic window size with rolling functions

Rolling window functions are very useful when working with time-series data (eg. calculation of moving average).
The previous version of pandas required that we pass the window size parameter, eg. calculate moving average on 3 periods.
With pandas 1.0 we can bypass this requirement as we show in the example below.

Let's calculate the moving average of values until the current number is not greater than 10. 
First, we create a DataFrame with 3 values greater or equal than 10. 

In [2]:
df = pd.DataFrame({'col1': [1, 2, 3, 10, 2, 3, 11, 2, 3, 12, 1, 2]})
df

Unnamed: 0,col1
0,1
1,2
2,3
3,10
4,2
5,3
6,11
7,2
8,3
9,12


Window function should expand until a value greater or equal to 10 is not reached.

In [3]:
use_expanding =  (df.col1 >= 10).tolist()
use_expanding

[False,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False,
 True,
 False,
 False]

For dynamic size window functions, we need to implement a custom indexer, which inherits from pandas BaseIndexer class.
BaseIndexer class has a get_window_bounds function, which calculates the start and end for each window. 

In [8]:
from pandas.api.indexers import BaseIndexer

class CustomIndexer(BaseIndexer):

    def get_window_bounds(self, num_values, min_periods, center, closed):
        start = np.empty(num_values, dtype=np.int64)
        end = np.empty(num_values, dtype=np.int64)
        start_i = 0
        for i in range(num_values):
            if self.use_expanding[i]:
                start[i] = start_i
                start_i = end[i] = i + 1
            else:
                start[i] = start_i
                end[i] = i + self.window_size
        print('start', start)
        print('end', end)
        return start, end


indexer = CustomIndexer(window_size=1, use_expanding=use_expanding)

We put the indexer class in the rolling function and we calculate the mean for each window.
We can also observe the start and the end indices of each window.

In [50]:
df.rolling(indexer).mean()

start [0 0 0 0 4 4 4 7]
end [1 2 3 4 5 6 7 8]


Unnamed: 0,col1
0,1.0
1,2.0
2,3.0
3,2.75
4,3.0
5,5.0
6,3.666667
7,2.0


## 2. Faster Rolling apply

Pandas uses Cython as a default execution engine with rolling apply. 
In pandas 1.0, we can specify Numba as an execution engine and get a decent speedup.

There are a few things to note:
- Numba dependency needs to be installed: pip install numba,
- the first time a function is run using the Numba engine will be slow as Numba will have some function compilation overhead. However, rolling objects will cache the function and subsequent calls will be fast,
- the Numba engine is performant with a larger amount of data points (e.g. 1+ million),
- the raw argument needs to be set to True, which means that the function will receive numpy objects instead of pandas Series to achieve better performance.

Let's create a DataFrame with 1 million values.

In [12]:
df = pd.DataFrame({"col1": pd.Series(range(1_000_000))})
df.head()

Unnamed: 0,col1
0,0
1,1
2,2
3,3
4,4


some_function calculates the sum of values and adds 5. 

In [13]:
def some_function(x):
    return np.sum(x) + 5

Let's measure execution time with the Cython execution engine.

In [14]:
%%timeit

df.col1.rolling(100).apply(some_function, engine='cython', raw=True)

4.03 s ± 76.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Cython needed 4.03 seconds to calculate the function. Is Numba faster? Let's try it.

In [16]:
%%timeit

df.col1.rolling(100).apply(some_function, engine='numba', raw=True)

500 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


We see that Numba is 8 times faster with this toy example.

## 3. New NA value

pandas 1.0 introduces a new experimental pd.NA value to represent scalar missing values.

I know what you are thinking - yet another null value? Isn't there nan, None and NaT already?

The goal of pd.NA is to provide consistency across data types. It is currently used by the Int64, boolean and the new string data type

Let's create a Series of integers with None.

In [20]:
s = pd.Series([3, 6, 9, None], dtype="Int64")
s

0       3
1       6
2       9
3    <NA>
dtype: Int64

What surprises me is that the NA == NA produces NA while np.nan == np.nan produces False. 

In [21]:
s.loc[3] == s.loc[3]

<NA>

In [22]:
np.nan == np.nan

False

## 4. New String type

pandas 1.0 has finally a dedicated (experimental) string type. 
Before 1.0, strings were stored as objects, so we couldn't be sure if the series contains just strings or it is mixed with other data types as I demonstrate below.

In [27]:
s = pd.Series(['an', 'ban', 'pet', 'podgan', None])
s

0        an
1       ban
2       pet
3    podgan
4      None
dtype: object

Storing strings as objects become a problem, when we unintentionally mix them with integers or floats - data type stays object.

In [51]:
s = pd.Series(['an', 'ban', 5, 'pet', 5.0, 'podgan', None])
s

0        an
1       ban
2         5
3       pet
4         5
5    podgan
6      None
dtype: object

To test the new string dtype we need to set dtype='string'. It would return an exception if we would add an integer or float into the series. Great improvement!

In [52]:
s = pd.Series(['an', 'ban', 'pet', 'podgan', None], dtype='string')
s

0        an
1       ban
2       pet
3    podgan
4      <NA>
dtype: string

## 5. Ignore index on a sorted DataFrame

When we sort a DataFrame by a certain column, the index also gets sorted. 
Sometimes we don't want that.
In pandas 1.0, sort_values function takes ignore index, which does as the name of the argument suggests.

In [47]:
df = pd.DataFrame({"col1": [1, 3, 5, 2, 3, 7, 1, 2]})

In [48]:
df.sort_values('col1')

Unnamed: 0,col1
0,1
6,1
3,2
7,2
1,3
4,3
2,5
5,7


In [49]:
df.sort_values('col1', ignore_index=True)

Unnamed: 0,col1
0,1
1,1
2,2
3,2
4,3
5,3
6,5
7,7


## Conclusion

These were the 5 most interesting pandas features based on my opinion.
In the long term, new NA for missing values could bring a lot of clarity to pandas. Eg. how functions handle missing values, do they skip them or not.
To learn more about new features in pandas 1.0 read [What’s new in 1.0.0](https://pandas.pydata.org/docs/whatsnew/v1.0.0.html).

Did you enjoy the post? Let me know in the comments below.