## Options and settings
- From https://pandas.pydata.org/docs/user_guide/options.html
#### Overview
- pandas has an options API configure and customize global behavior related to DataFrame display, data behavior and more.

- Options have a full “dotted-style”, case-insensitive name (e.g. display.max_rows). You can get/set options directly as attributes of the top-level options attribute:

In [23]:
import pandas as pd
import numpy as np

In [2]:
pd.options.display.max_rows

60

In [3]:
pd.options.display.max_rows = 999
pd.options.display.max_rows

999

- The API is composed of 5 relevant functions, available directly from the pandas namespace:
    - get_option() / set_option() - get/set the value of a single option.
    - reset_option() - reset one or more options to their default value.
    - describe_option() - print the descriptions of one or more options.
    - option_context() - execute a codeblock with a set of options that revert to prior settings after execution.

- **Note**: Developers can check out pandas/core/config_init.py for more information.

- All of the functions above accept a regexp pattern (re.search style) as an argument, to match an unambiguous substring:

In [4]:
pd.get_option("display.chop_threshold")

In [5]:
pd.set_option("display.chop_threshold", 2)
pd.get_option("display.chop_threshold")

2

In [6]:
pd.set_option("chop", 4)
pd.get_option("display.chop_threshold")

4

- The following will not work because it matches multiple option names, e.g. display.max_colwidth, display.max_rows, display.max_columns:

In [7]:
try:
    pd.get_option("max")
except Exception as e:
    print(e)

Pattern matched multiple keys


- **Warning**: Using this form of shorthand may cause your code to break if new options with similar names are added in future versions.

#### Available options
- You can get a list of available options and their descriptions with describe_option(). When called with no argument describe_option() will print out the descriptions for all available options.

In [8]:
pd.describe_option()

compute.use_bottleneck : bool
    Use the bottleneck library to accelerate if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]
compute.use_numba : bool
    Use the numba engine option for select operations if it is installed,
    the default is False
    Valid values: False,True
    [default: False] [currently: False]
compute.use_numexpr : bool
    Use the numexpr library to accelerate computation if it is installed,
    the default is True
    Valid values: False,True
    [default: True] [currently: True]
display.chop_threshold : float or None
    if set to a float value, all float values smaller than the given threshold
    will be displayed as exactly 0 by repr and friends.
    [default: None] [currently: 4]
display.colheader_justify : 'left'/'right'
    Controls the justification of column headers. used by DataFrameFormatter.
    [default: right] [currently: right]
display.date_dayfirst : boolean
    When True, prints and p

#### Getting and setting options
- As described above, get_option() and set_option() are available from the pandas namespace. To change an option, call set_option('option regex', new_value).

In [9]:
pd.get_option("mode.sim_interactive")

False

In [11]:
pd.set_option("mode.sim_interactive", True)
pd.get_option("mode.sim_interactive")

True

- **Note**: The option 'mode.sim_interactive' is mostly used for debugging purposes.

- You can use reset_option() to revert to a setting’s default value

In [13]:
pd.reset_option("display.max_rows")
pd.get_option("display.max_rows")

60

In [15]:
pd.set_option("display.max_rows", 999)
pd.get_option("display.max_rows")


999

In [16]:
pd.reset_option("display.max_rows")
pd.get_option("display.max_rows")

60

- It’s also possible to reset multiple options at once (using a regex):

In [17]:
pd.reset_option("^display")

- option_context() context manager has been exposed through the top-level API, allowing you to execute code with given option values. Option values are restored automatically when you exit the with block:

In [18]:
with pd.option_context("display.max_rows", 10, "display.max_columns", 5):
    print(pd.get_option("display.max_rows"))
    print(pd.get_option("display.max_columns"))

10
5


In [19]:
print(pd.get_option("display.max_rows"))

60


In [20]:
print(pd.get_option("display.max_columns"))

20


#### Setting startup options in Python/IPython environment
- Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where the startup folder is in a default IPython profile can be found at:

`$IPYTHONDIR/profile_default/startup`
- More information can be found in the IPython documentation. An example startup script for pandas is displayed below:
```
pd.set_option("display.max_rows", 999)
pd.set_option("display.precision", 5)
```

#### Frequently used options
- The following is a demonstrates the more frequently used display options.

- display.max_rows and display.max_columns sets the maximum number of rows and columns displayed when a frame is pretty-printed. Truncated lines are replaced by an ellipsis.

In [24]:
df = pd.DataFrame(np.random.randn(7, 2))
pd.set_option("display.max_rows", 7)
df

Unnamed: 0,0,1
0,-1.41881,0.219813
1,0.894406,2.235602
2,-0.08626,-0.65912
3,-0.457991,1.079203
4,0.733295,-0.866521
5,-0.818501,0.129374
6,-1.920624,0.489486


In [25]:
pd.set_option("display.max_rows", 5)
df

Unnamed: 0,0,1
0,-1.418810,0.219813
1,0.894406,2.235602
...,...,...
5,-0.818501,0.129374
6,-1.920624,0.489486


In [26]:
pd.reset_option("display.max_rows")

- Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are shown in the truncated repr.

In [27]:
pd.set_option("display.max_rows", 8)

In [28]:
pd.set_option("display.min_rows", 4)

In [29]:
# below max_rows -> all rows shown
df = pd.DataFrame(np.random.randn(7, 2))
df

Unnamed: 0,0,1
0,-0.892951,-0.942119
1,0.883813,-0.516302
2,0.100339,1.049957
3,1.019165,0.474998
4,-0.545581,0.298873
5,1.45422,1.568611
6,0.911065,-0.076305


In [30]:
# above max_rows -> only min_rows (4) rows shown
df = pd.DataFrame(np.random.randn(9, 2))
df

Unnamed: 0,0,1
0,0.083025,0.345263
1,-0.716125,0.432014
...,...,...
7,0.078505,-0.863045
8,-1.410607,0.373723


In [31]:
pd.reset_option("display.max_rows")

In [32]:
pd.reset_option("display.min_rows")

- display.expand_frame_repr allows for the representation of a DataFrame to stretch across pages, wrapped over the all the columns.

In [33]:
df = pd.DataFrame(np.random.randn(5, 10))
pd.set_option("expand_frame_repr", True)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.165463,2.405511,1.440528,-1.05888,-0.972618,-0.311289,0.414952,0.461477,0.954443,-0.046303
1,0.56,-0.525217,-1.689576,0.258014,0.591623,0.321571,0.120592,-0.28438,1.083424,-0.630289
2,-0.06899,0.185447,1.420214,-1.096215,-0.456867,1.252336,-1.204527,0.845049,-0.218032,1.050561
3,1.666696,0.124911,-0.418924,-0.194222,1.87733,0.814498,0.370914,-0.200904,-0.868371,1.172553
4,0.450305,-0.017505,1.0441,-0.887876,-1.345452,0.211718,0.65765,0.850819,0.389997,2.097916


In [34]:
pd.set_option("expand_frame_repr", False)
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,-1.165463,2.405511,1.440528,-1.05888,-0.972618,-0.311289,0.414952,0.461477,0.954443,-0.046303
1,0.56,-0.525217,-1.689576,0.258014,0.591623,0.321571,0.120592,-0.28438,1.083424,-0.630289
2,-0.06899,0.185447,1.420214,-1.096215,-0.456867,1.252336,-1.204527,0.845049,-0.218032,1.050561
3,1.666696,0.124911,-0.418924,-0.194222,1.87733,0.814498,0.370914,-0.200904,-0.868371,1.172553
4,0.450305,-0.017505,1.0441,-0.887876,-1.345452,0.211718,0.65765,0.850819,0.389997,2.097916


In [35]:
pd.reset_option("expand_frame_repr")

- display.large_repr displays a DataFrame that exceed max_columns or max_rows as a truncated frame or summary.

In [36]:
df = pd.DataFrame(np.random.randn(10, 10))
pd.set_option("display.max_rows", 5)
pd.set_option("large_repr", "truncate")
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.956541,-1.421720,-1.846848,0.049126,1.108337,0.594252,-0.797085,0.144054,0.543875,0.811759
1,-1.449604,0.032663,0.126410,0.843034,-0.198385,-0.582485,-0.707214,-0.458690,-0.158021,-0.846513
...,...,...,...,...,...,...,...,...,...,...
8,0.452494,0.767900,-0.556128,-0.102928,-1.992757,-0.351006,0.425198,0.230334,-0.556987,-1.007726
9,0.301072,-0.910749,-1.006256,-0.430117,2.719589,-1.192059,0.627744,-0.321896,-1.054234,1.846205


In [37]:
pd.set_option("large_repr", "info")
df

In [38]:
pd.reset_option("large_repr")

In [39]:
pd.reset_option("display.max_rows")

- display.max_colwidth sets the maximum width of columns. Cells of this length or longer will be truncated with an ellipsis.

In [40]:
df = pd.DataFrame(
    np.array(
        [
            ["foo", "bar", "bim", "uncomfortably long string"],
            ["horse", "cow", "banana", "apple"],
        ]
    )
)
pd.set_option("max_colwidth", 40)
df


Unnamed: 0,0,1,2,3
0,foo,bar,bim,uncomfortably long string
1,horse,cow,banana,apple


In [41]:
pd.set_option("max_colwidth", 6)
df

Unnamed: 0,0,1,2,3
0,foo,bar,bim,un...
1,horse,cow,ba...,apple


In [42]:
pd.reset_option("max_colwidth")

- display.max_info_columns sets a threshold for the number of columns displayed when calling info().

In [43]:
df = pd.DataFrame(np.random.randn(10, 10))
pd.set_option("max_info_columns", 11)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       10 non-null     float64
 1   1       10 non-null     float64
 2   2       10 non-null     float64
 3   3       10 non-null     float64
 4   4       10 non-null     float64
 5   5       10 non-null     float64
 6   6       10 non-null     float64
 7   7       10 non-null     float64
 8   8       10 non-null     float64
 9   9       10 non-null     float64
dtypes: float64(10)
memory usage: 932.0 bytes


In [44]:
pd.set_option("max_info_columns", 5)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Columns: 10 entries, 0 to 9
dtypes: float64(10)
memory usage: 932.0 bytes


In [45]:
pd.reset_option("max_info_columns")

- display.max_info_rows: info() will usually show null-counts for each column. For a large DataFrame, this can be quite slow. max_info_rows and max_info_cols limit this null check to the specified rows and columns respectively. The info() keyword argument show_counts=True will override this.

In [46]:
df = pd.DataFrame(np.random.choice([0, 1, np.nan], size=(10, 10)))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,1.0,,1.0,0.0,0.0,0.0,,0.0,,
1,0.0,0.0,0.0,0.0,,1.0,0.0,0.0,,1.0
2,,0.0,0.0,,1.0,0.0,,0.0,0.0,
3,0.0,0.0,0.0,,,1.0,0.0,0.0,0.0,
4,,0.0,,,0.0,1.0,,0.0,1.0,
5,,1.0,0.0,0.0,1.0,0.0,,,1.0,
6,1.0,0.0,0.0,,0.0,,1.0,0.0,0.0,
7,1.0,0.0,1.0,,1.0,0.0,,0.0,0.0,
8,1.0,0.0,,1.0,1.0,,,,,1.0
9,,0.0,0.0,0.0,0.0,,1.0,,,0.0


In [47]:
pd.set_option("max_info_rows", 11)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       6 non-null      float64
 1   1       9 non-null      float64
 2   2       8 non-null      float64
 3   3       5 non-null      float64
 4   4       8 non-null      float64
 5   5       7 non-null      float64
 6   6       4 non-null      float64
 7   7       7 non-null      float64
 8   8       6 non-null      float64
 9   9       3 non-null      float64
dtypes: float64(10)
memory usage: 932.0 bytes


In [48]:
pd.set_option("max_info_rows", 5)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
 #   Column  Dtype  
---  ------  -----  
 0   0       float64
 1   1       float64
 2   2       float64
 3   3       float64
 4   4       float64
 5   5       float64
 6   6       float64
 7   7       float64
 8   8       float64
 9   9       float64
dtypes: float64(10)
memory usage: 932.0 bytes


In [49]:
pd.reset_option("max_info_rows")

- display.precision sets the output display precision in terms of decimal places.

In [50]:
df = pd.DataFrame(np.random.randn(5, 5))
pd.set_option("display.precision", 7)
df

Unnamed: 0,0,1,2,3,4
0,0.4918806,0.9787373,-0.1019045,-0.3706899,-0.5364794
1,1.6367906,-0.9271002,-0.3449427,0.270177,0.6623656
2,0.2386898,0.404493,0.5760232,-0.1726571,-2.1650113
3,-0.3888557,-2.6312682,0.3140012,-0.0792154,0.1011701
4,-0.9750703,1.033209,-0.5068128,-0.222984,-2.4617529


In [51]:
pd.set_option("display.precision", 4)
df

Unnamed: 0,0,1,2,3,4
0,0.4919,0.9787,-0.1019,-0.3707,-0.5365
1,1.6368,-0.9271,-0.3449,0.2702,0.6624
2,0.2387,0.4045,0.576,-0.1727,-2.165
3,-0.3889,-2.6313,0.314,-0.0792,0.1012
4,-0.9751,1.0332,-0.5068,-0.223,-2.4618


- display.chop_threshold sets the rounding threshold to zero when displaying a Series or DataFrame. This setting does not change the precision at which the number is stored.

In [52]:
df = pd.DataFrame(np.random.randn(6, 6))
pd.set_option("chop_threshold", 0)
df

Unnamed: 0,0,1,2,3,4,5
0,-1.0821,0.9194,-0.6762,1.1658,0.9404,-0.0962
1,0.4862,0.8577,-1.3034,1.4159,2.3017,-0.2921
2,1.8785,0.0211,0.1108,0.0648,1.2811,-0.1875
3,-2.0213,0.4576,0.9917,0.0077,0.1266,-1.6913
4,0.7665,-0.7889,1.3337,0.0981,-0.105,-0.1647
5,-0.1225,-0.6952,1.1362,0.3446,0.5566,0.0465


In [53]:
pd.set_option("chop_threshold", 0.5)
df

Unnamed: 0,0,1,2,3,4,5
0,-1.0821,0.9194,-0.6762,1.1658,0.9404,0.0
1,0.0,0.8577,-1.3034,1.4159,2.3017,0.0
2,1.8785,0.0,0.0,0.0,1.2811,0.0
3,-2.0213,0.0,0.9917,0.0,0.0,-1.6913
4,0.7665,-0.7889,1.3337,0.0,0.0,0.0
5,0.0,-0.6952,1.1362,0.0,0.5566,0.0


In [54]:
pd.reset_option("chop_threshold")

- display.colheader_justify controls the justification of the headers. The options are 'right', and 'left'.

In [55]:
df = pd.DataFrame(
    np.array([np.random.randn(6), np.random.randint(1, 9, 6) * 0.1, np.zeros(6)]).T,
    columns=["A", "B", "C"],
    dtype="float",
)
pd.set_option("colheader_justify", "right")
df

Unnamed: 0,A,B,C
0,0.4858,0.8,0.0
1,0.5151,0.5,0.0
2,-0.6092,0.6,0.0
3,0.3047,0.8,0.0
4,-0.6208,0.1,0.0
5,-0.6921,0.2,0.0


In [56]:
pd.set_option("colheader_justify", "left")
df

Unnamed: 0,A,B,C
0,0.4858,0.8,0.0
1,0.5151,0.5,0.0
2,-0.6092,0.6,0.0
3,0.3047,0.8,0.0
4,-0.6208,0.1,0.0
5,-0.6921,0.2,0.0


In [57]:
pd.reset_option("colheader_justify")

#### Number formatting
- pandas also allows you to set how numbers are displayed in the console. This option is not set through the set_options API.

- Use the set_eng_float_format function to alter the floating-point formatting of pandas objects to produce a particular format.

In [58]:
pd.set_eng_float_format(accuracy=3, use_eng_prefix=True)
s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
s / 1.0e3

a     -1.206m
b   -719.562u
c    202.092u
d     -1.134m
e     -1.351m
dtype: float64

In [59]:
s / 1.0e6

a     -1.206u
b   -719.562n
c    202.092n
d     -1.134u
e     -1.351u
dtype: float64

- Use round() to specifically control rounding of an individual DataFrame
- **Warning**: Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times slower). Use only when it is actually required.
- Some East Asian countries use Unicode characters whose width corresponds to two Latin characters. If a DataFrame or Series contains these characters, the default output mode may not align them properly.

In [60]:
df = pd.DataFrame({"国籍": ["UK", "日本"], "名前": ["Alice", "しのぶ"]})
df

Unnamed: 0,国籍,名前
0,UK,Alice
1,日本,しのぶ


- Enabling display.unicode.east_asian_width allows pandas to check each character’s “East Asian Width” property. These characters can be aligned properly by setting this option to True. However, this will result in longer render times than the standard len function.

In [61]:
pd.set_option("display.unicode.east_asian_width", True)
df

Unnamed: 0,国籍,名前
0,UK,Alice
1,日本,しのぶ


- In addition, Unicode characters whose width is “ambiguous” can either be 1 or 2 characters wide depending on the terminal setting or encoding. The option display.unicode.ambiguous_as_wide can be used to handle the ambiguity.

- By default, an “ambiguous” character’s width, such as “¡” (inverted exclamation) in the example below, is taken to be 1.

In [62]:
df = pd.DataFrame({"a": ["xxx", "¡¡"], "b": ["yyy", "¡¡"]})
df

Unnamed: 0,a,b
0,xxx,yyy
1,¡¡,¡¡


- Enabling display.unicode.ambiguous_as_wide makes pandas interpret these characters’ widths to be 2. (Note that this option will only be effective when display.unicode.east_asian_width is enabled.)

- However, setting this option incorrectly for your terminal will cause these characters to be aligned incorrectly:

In [63]:
pd.set_option("display.unicode.ambiguous_as_wide", True)
df

Unnamed: 0,a,b
0,xxx,yyy
1,¡¡,¡¡


#### Table schema display
- DataFrame and Series will publish a Table Schema representation by default. This can be enabled globally with the display.html.table_schema option:
`pd.set_option("display.html.table_schema", True)`
- Only 'display.max_rows' are serialized and published.

### Enhancing performance
- From https://pandas.pydata.org/docs/user_guide/enhancingperf.html
- In this part of the tutorial, we will investigate how to speed up certain functions operating on pandas DataFrame using Cython, Numba and pandas.eval(). Generally, using Cython and Numba can offer a larger speedup than using pandas.eval() but will require a lot more code.
- **Note**: In addition to following the steps in this tutorial, users interested in enhancing performance are highly encouraged to install the recommended dependencies for pandas. These dependencies are often not installed by default, but will offer speed improvements if present.
#### Cython (writing C extensions for pandas)
- For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy applications however, it can be possible to achieve sizable speed-ups by offloading work to cython.

- This tutorial assumes you have refactored as much as possible in Python, for example by trying to remove for-loops and making use of NumPy vectorization. It’s always worth optimising in Python first.

- This tutorial walks through a “typical” process of cythonizing a slow computation. We use an example from the Cython documentation but in the context of pandas. Our final cythonized solution is around 100 times faster than the pure Python solution.

#### Pure Python
- We have a DataFrame to which we want to apply a function row-wise.

In [64]:
df = pd.DataFrame(
    {
        "a": np.random.randn(1000),
        "b": np.random.randn(1000),
        "N": np.random.randint(100, 1000, (1000)),
        "x": "x",
    }
)
df

Unnamed: 0,a,b,N,x
0,22.699m,-125.598m,137,x
1,1.573,360.089m,980,x
2,1.446,47.228m,715,x
3,-90.639m,213.699m,574,x
4,1.315,256.140m,759,x
...,...,...,...,...
995,232.979m,72.869m,102,x
996,-299.586m,408.607m,440,x
997,-36.579m,751.297m,809,x
998,-1.523,325.623m,894,x


- Here’s the function in pure Python:

In [65]:
def f(x):
    return x * (x - 1)

def integrate_f(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f(a + i * dx)
    return s * dx


- We achieve our result by using DataFrame.apply() (row-wise):

In [66]:
%timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)

60.1 ms ± 5.12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


- Let’s take a look and see where the time is spent during this operation using the prun ipython magic function:

In [68]:
# most time consuming 4 calls
# %prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)  # noqa E999


- By far the majority of time is spend inside either integrate_f or f, hence we’ll concentrate our efforts cythonizing these two functions.

#### Plain Cython
- First we're going to need to import the Cython magic function to IPython:

In [70]:
%load_ext Cython

- Now, let’s simply copy our functions over to Cython:

In [71]:
%%cython
def f_plain(x):
    return x * (x - 1)
def integrate_f_plain(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_plain(a + i * dx)
    return s * dx

Content of stdout:
_cython_magic_5eb720acfe6df9fd279f2348ccd85f2b9f4b4a61627a547fed4b863a821c739d.c
   Creating library C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_5eb720acfe6df9fd279f2348ccd85f2b9f4b4a61627a547fed4b863a821c739d.cp313-win_amd64.lib and object C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_5eb720acfe6df9fd279f2348ccd85f2b9f4b4a61627a547fed4b863a821c739d.cp313-win_amd64.exp
Generating code
Finished generating code

In [72]:
%timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)


57.7 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


- This has improved the performance compared to the pure Python approach by one-third. (supposely....)

#### Declaring C types
- We can annotate the function variables and return types as well as use cdef and cpdef to improve performance:

In [73]:
%%cython
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx

Content of stdout:
_cython_magic_a8c2fd7c142b810f8c051d87c4b407266b9a8cd305dbc688bbb844bd2a1ffc4f.c
   Creating library C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_a8c2fd7c142b810f8c051d87c4b407266b9a8cd305dbc688bbb844bd2a1ffc4f.cp313-win_amd64.lib and object C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_a8c2fd7c142b810f8c051d87c4b407266b9a8cd305dbc688bbb844bd2a1ffc4f.cp313-win_amd64.exp
Generating code
Finished generating code

In [74]:
%timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)


7.29 ms ± 974 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- Annotating the functions with C types yields an over ten times performance improvement compared to the original Python implementation.

#### Using ndarray
- When re-profiling, time is spent creating a Series from each row, and calling __getitem__ from both the index and the series (three times for each row). These Python function calls are expensive and can be improved by passing an np.ndarray.

In [75]:
%%cython
cimport numpy as np
import numpy as np
cdef double f_typed(double x) except? -2:
    return x * (x - 1)
cpdef double integrate_f_typed(double a, double b, int N):
    cdef int i
    cdef double s, dx
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx
cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,
                                           np.ndarray col_N):
    assert (col_a.dtype == np.float64
            and col_b.dtype == np.float64 and col_N.dtype == np.dtype(int))
    cdef Py_ssize_t i, n = len(col_N)
    assert (len(col_a) == len(col_b) == n)
    cdef np.ndarray[double] res = np.empty(n)
    for i in range(len(col_a)):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

Content of stdout:
_cython_magic_a935f04eeda5bca0757809b01d89a37ada8d50e6f855b26665f957ddf4c304a6.c
   Creating library C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_a935f04eeda5bca0757809b01d89a37ada8d50e6f855b26665f957ddf4c304a6.cp313-win_amd64.lib and object C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_a935f04eeda5bca0757809b01d89a37ada8d50e6f855b26665f957ddf4c304a6.cp313-win_amd64.exp
Generating code
Finished generating code

- This implementation creates an array of zeros and inserts the result of integrate_f_typed applied over each row. Looping over an ndarray is faster in Cython than looping over a Series object.

- Since apply_integrate_f is typed to accept an np.ndarray, Series.to_numpy() calls are needed to utilize this function.

In [79]:
# %timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())


- Performance has improved from the prior implementation by almost ten times.

#### Disabling compiler directives
- The majority of the time is now spent in apply_integrate_f. Disabling Cython’s boundscheck and wraparound checks can yield more performance.

In [80]:
# %prun -l 4 apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())


In [81]:
%%cython
cimport cython
cimport numpy as np
import numpy as np
cdef np.float64_t f_typed(np.float64_t x) except? -2:
    return x * (x - 1)
cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N):
    cdef np.int64_t i
    cdef np.float64_t s = 0.0, dx
    dx = (b - a) / N
    for i in range(N):
        s += f_typed(a + i * dx)
    return s * dx
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap(
    np.ndarray[np.float64_t] col_a,
    np.ndarray[np.float64_t] col_b,
    np.ndarray[np.int64_t] col_N
):
    cdef np.int64_t i, n = len(col_N)
    assert len(col_a) == len(col_b) == n
    cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64)
    for i in range(n):
        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
    return res

Content of stdout:
_cython_magic_5467937f7aa6ea7a116f5f2767d3035779423f023df9710bfbe656483ecd6471.c
   Creating library C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_5467937f7aa6ea7a116f5f2767d3035779423f023df9710bfbe656483ecd6471.cp313-win_amd64.lib and object C:\Users\thotc\.ipython\cython\Users\thotc\.ipython\cython\_cython_magic_5467937f7aa6ea7a116f5f2767d3035779423f023df9710bfbe656483ecd6471.cp313-win_amd64.exp
Generating code
Finished generating code

In [83]:
# %timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())


#### Numba (JIT compilation)
- An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler with Numba.

- Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran, by decorating your function with @jit.

- Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.

- **Note**: The @jit compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets. Consider caching your function to avoid compilation overhead each time your function is run.

- Numba can be used in 2 ways with pandas:
    - Specify the engine="numba" keyword in select pandas methods
    - Define your own Python function decorated with @jit and pass the underlying NumPy array of Series or DataFrame (using Series.to_numpy()) into the function

#### pandas Numba Engine
- If Numba is installed, one can specify engine="numba" in select pandas methods to execute the method using Numba. Methods that support engine="numba" will also have an engine_kwargs keyword that accepts a dictionary that allows one to specify "nogil", "nopython" and "parallel" keys with boolean values to pass into the @jit decorator. If engine_kwargs is not specified, it defaults to {"nogil": False, "nopython": True, "parallel": False} unless otherwise specified.

- **Note**: In terms of performance, the first time a function is run using the Numba engine will be slow as Numba will have some function compilation overhead. However, the JIT compiled functions are cached, and subsequent calls will be fast. In general, the Numba engine is performant with a larger amount of data points (e.g. 1+ million).

In [84]:
data = pd.Series(range(1_000_000))  # noqa: E225
roll = data.rolling(10)
def f(x):
    return np.sum(x) + 5

# Run the first time, compilation time will affect performance
%timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)

4.76 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)


In [85]:
# Function is cached and performance will improve
%timeit roll.apply(f, engine='numba', raw=True)

109 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [86]:
%timeit roll.apply(f, engine='cython', raw=True)

2.59 s ± 72.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


- If your compute hardware contains multiple CPUs, the largest performance gain can be realized by setting parallel to True to leverage more than 1 CPU. Internally, pandas leverages numba to parallelize computations over the columns of a DataFrame; therefore, this performance benefit is only beneficial for a DataFrame with a large number of columns.

In [87]:
import numba

In [88]:
numba.set_num_threads(1)
df = pd.DataFrame(np.random.randn(10_000, 100))
roll = df.rolling(100)
%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})


  na_positions[i] = np.array(na_pos)


18.7 ms ± 814 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [89]:
numba.set_num_threads(2)
%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})


9.96 ms ± 332 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### Custom Function Examples
- A custom Python function decorated with @jit can be used with pandas objects by passing their NumPy array representations with Series.to_numpy().

In [90]:

@numba.jit
def f_plain(x):
    return x * (x - 1)


@numba.jit
def integrate_f_numba(a, b, N):
    s = 0
    dx = (b - a) / N
    for i in range(N):
        s += f_plain(a + i * dx)
    return s * dx


@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
    n = len(col_N)
    result = np.empty(n, dtype="float64")
    assert len(col_a) == len(col_b) == n
    for i in range(n):
        result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
    return result


def compute_numba(df):
    result = apply_integrate_f_numba(
        df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()
    )
    return pd.Series(result, index=df.index, name="result")

In [93]:
# %timeit compute_numba(df)


- In this example, using Numba was faster than Cython.

- Numba can also be used to write vectorized functions that do not require the user to explicitly loop over the observations of a vector; a vectorized function will be applied to each row automatically. Consider the following example of doubling each observation:

In [94]:
def double_every_value_nonumba(x):
    return x * 2


@numba.vectorize
def double_every_value_withnumba(x):  # noqa E501
    return x * 2

In [96]:
# Custom function without numba
# %timeit df["col1_doubled"] = df["a"].apply(double_every_value_nonumba)  # noqa E501

# Standard implementation (faster than a custom function)
# %timeit df["col1_doubled"] = df["a"] * 2

# Custom function with numba
# %timeit df["col1_doubled"] = double_every_value_withnumba(df["a"].to_numpy())


#### Caveats
- Numba is best at accelerating functions that apply numerical functions to NumPy arrays. If you try to @jit a function that contains unsupported Python or NumPy code, compilation will revert object mode which will mostly likely not speed up your function. If you would prefer that Numba throw an error if it cannot compile a function in a way that speeds up your code, pass Numba the argument nopython=True (e.g. @jit(nopython=True)). For more on troubleshooting Numba modes, see the Numba troubleshooting page.

- Using parallel=True (e.g. @jit(parallel=True)) may result in a SIGABRT if the threading layer leads to unsafe behavior. You can first specify a safe threading layer before running a JIT function with parallel=True.

- Generally if the you encounter a segfault (SIGSEGV) while using Numba, please report the issue to the Numba issue tracker.

#### Expression evaluation via eval()
- The top-level function pandas.eval() implements performant expression evaluation of Series and DataFrame. Expression evaluation allows operations to be expressed as strings and can potentially provide a performance improvement by evaluate arithmetic and boolean expression all at once for large DataFrame.

- **Note**: You should not use eval() for simple expressions or for expressions involving small DataFrames. In fact, eval() is many orders of magnitude slower for smaller expressions or objects than plain Python. A good rule of thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.

#### Supported syntax
- These operations are supported by pandas.eval():
    - Arithmetic operations except for the left shift (<<) and right shift (>>) operators, e.g., df + 2 * pi / s ** 4 % 42 - the_golden_ratio
    - Comparison operations, including chained comparisons, e.g., 2 < df < df2
    - Boolean operations, e.g., df < df2 and df3 < df4 or not df_bool
    - list and tuple literals, e.g., [1, 2] or (1, 2)
    - Attribute access, e.g., df.a
    - Subscript expressions, e.g., df[0]
    - Simple variable evaluation, e.g., pd.eval("df") (this is not very useful)
    - Math functions: sin, cos, exp, log, expm1, log1p, sqrt, sinh, cosh, tanh, arcsin, arccos, arctan, arccosh, arcsinh, arctanh, abs, arctan2 and log10.
- The following Python syntax is not allowed:
- Expressions
    - Function calls other than math functions.
    - is/is not operations
    - if expressions
    - lambda expressions
    - list/set/dict comprehensions
    - Literal dict and set expressions
    - yield expressions
    - Generator expressions
    - Boolean expressions consisting of only scalar values
- Statements
    - Neither simple or compound statements are allowed. This includes for, while, and if.
#### Local variables
- You must explicitly reference any local variable that you want to use in an expression by placing the @ character in front of the name. This mechanism is the same for both DataFrame.query() and DataFrame.eval(). For example,

In [97]:
df = pd.DataFrame(np.random.randn(5, 2), columns=list("ab"))
newcol = np.random.randn(len(df))
df.eval("b + @newcol")

0      -1.018
1       1.775
2      -3.568
3   -816.830m
4      -1.410
dtype: float64

In [98]:
df.query("b < @newcol")

Unnamed: 0,a,b
2,-2.55,-2.434
4,-1.547,-721.786m


- If you don’t prefix the local variable with @, pandas will raise an exception telling you the variable is undefined.

- When using DataFrame.eval() and DataFrame.query(), this allows you to have a local variable and a DataFrame column with the same name in an expression.

In [99]:
a = np.random.randn()
df.query("@a < a")

Unnamed: 0,a,b
1,1.300,1.233
3,816.659m,-336.873m


In [100]:
df.loc[a < df["a"]]  # same as the previous expression

Unnamed: 0,a,b
1,1.300,1.233
3,816.659m,-336.873m


- **Warning**: pandas.eval() will raise an exception if you cannot use the @ prefix because it isn’t defined in that context.

In [102]:
a, b = 1, 2
try:
    pd.eval("@a + b")
except Exception as e:
    print(e)

The '@' prefix is not allowed in top-level eval calls.
please refer to your variables by name without the '@' prefix.


In [103]:
pd.eval("a + b")

np.int64(3)

#### pandas.eval() parsers
- There are two different expression syntax parsers.

- The default 'pandas' parser allows a more intuitive syntax for expressing query-like operations (comparisons, conjunctions and disjunctions). In particular, the precedence of the & and | operators is made equal to the precedence of the corresponding boolean operations and and or.

- For example, the above conjunction can be written without parentheses. Alternatively, you can use the 'python' parser to enforce strict Python semantics.

In [104]:
nrows, ncols = 20000, 100
df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]
expr = "(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"
x = pd.eval(expr, parser="python")
expr_no_parens = "df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0"
y = pd.eval(expr_no_parens, parser="pandas")

np.all(x == y)


np.True_

- The same expression can be “anded” together with the word and as well:

In [105]:
expr = "(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"
x = pd.eval(expr, parser="python")
expr_with_ands = "df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0"
y = pd.eval(expr_with_ands, parser="pandas")
np.all(x == y)

np.True_

- The and and or operators here have the same precedence that they would in Python.
#### pandas.eval() engines
- There are two different expression engines.

- The 'numexpr' engine is the more performant engine that can yield performance improvements compared to standard Python syntax for large DataFrame. This engine requires the optional dependency numexpr to be installed.

- The 'python' engine is generally not useful except for testing other evaluation engines against it. You will achieve no performance benefits using eval() with engine='python' and may incur a performance hit.

In [106]:
%timeit df1 + df2 + df3 + df4

20.9 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [107]:
%timeit pd.eval("df1 + df2 + df3 + df4", engine="python")

20.7 ms ± 244 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### The DataFrame.eval() method
- In addition to the top level pandas.eval() function you can also evaluate an expression in the “context” of a DataFrame.

In [108]:
df = pd.DataFrame(np.random.randn(5, 2), columns=["a", "b"])
df.eval("a + b")

0    275.278m
1      -3.251
2    761.283m
3      -1.816
4      -1.635
dtype: float64

- Any expression that is a valid pandas.eval() expression is also a valid DataFrame.eval() expression, with the added benefit that you don’t have to prefix the name of the DataFrame to the column(s) you’re interested in evaluating.

- In addition, you can perform assignment of columns within an expression. This allows for formulaic evaluation. The assignment target can be a new column name or an existing column name, and it must be a valid Python identifier.

In [109]:
df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
df = df.eval("c = a + b")
df = df.eval("d = a + b + c")
df = df.eval("a = 1")
df

Unnamed: 0,a,b,c,d
0,1,5,5,10
1,1,6,7,14
2,1,7,9,18
3,1,8,11,22
4,1,9,13,26


- A copy of the DataFrame with the new or modified columns is returned, and the original frame is unchanged.

In [110]:
df

Unnamed: 0,a,b,c,d
0,1,5,5,10
1,1,6,7,14
2,1,7,9,18
3,1,8,11,22
4,1,9,13,26


In [111]:
df.eval("e = a - c")

Unnamed: 0,a,b,c,d,e
0,1,5,5,10,-4
1,1,6,7,14,-6
2,1,7,9,18,-8
3,1,8,11,22,-10
4,1,9,13,26,-12


In [112]:
df

Unnamed: 0,a,b,c,d
0,1,5,5,10
1,1,6,7,14
2,1,7,9,18
3,1,8,11,22
4,1,9,13,26


- Multiple column assignments can be performed by using a multi-line string.

In [113]:
df.eval(
    """
c = a + b
d = a + b + c
a = 1""",
)

Unnamed: 0,a,b,c,d
0,1,5,6,12
1,1,6,7,14
2,1,7,8,16
3,1,8,9,18
4,1,9,10,20


- The equivalent in standard Python would be

In [114]:
df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
df["c"] = df["a"] + df["b"]
df["d"] = df["a"] + df["b"] + df["c"]
df["a"] = 1
df


Unnamed: 0,a,b,c,d
0,1,5,5,10
1,1,6,7,14
2,1,7,9,18
3,1,8,11,22
4,1,9,13,26


#### eval() performance comparison
- pandas.eval() works well with expressions containing large arrays.

In [115]:
nrows, ncols = 20000, 100
df1, df2, df3, df4 = [pd.DataFrame(np.random.randn(nrows, ncols)) for _ in range(4)]


- DataFrame arithmetic:

In [116]:
%timeit df1 + df2 + df3 + df4

20.9 ms ± 534 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [117]:
%timeit pd.eval("df1 + df2 + df3 + df4")

8.25 ms ± 649 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- DataFrame comparison:

In [118]:
%timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)


15.8 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [119]:
%timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)")


6.85 ms ± 535 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- DataFrame arithmetic with unaligned axes.

In [120]:
s = pd.Series(np.random.randn(50))
%timeit df1 + df2 + df3 + df4 + s

32.1 ms ± 577 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [121]:
%timeit pd.eval("df1 + df2 + df3 + df4 + s")

8.56 ms ± 539 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


- **Note**: Operations such as
```
1 and 2  # would parse to 1 & 2, but should evaluate to 2
3 or 4  # would parse to 3 | 4, but should evaluate to 3
~1  # this is okay, but slower when using eval
```
- should be performed in Python. An exception will be raised if you try to perform any boolean/bitwise operations with scalar operands that are not of type bool or np.bool_.

- Here is a plot showing the running time of pandas.eval() as function of the size of the frame involved in the computation. The two lines are two different engines.

- You will only see the performance benefits of using the numexpr engine with pandas.eval() if your DataFrame has more than approximately 100,000 rows.

- This plot was created using a DataFrame with 3 columns each containing floating point values generated using numpy.random.randn().

- Expression evaluation limitations with numexpr
Expressions that would result in an object dtype or involve datetime operations because of NaT must be evaluated in Python space, but part of an expression can still be evaluated with numexpr. For example:

In [122]:
df = pd.DataFrame(
    {"strings": np.repeat(list("cba"), 3), "nums": np.repeat(range(3), 3)}
)
df

Unnamed: 0,strings,nums
0,c,0
1,c,0
2,c,0
3,b,1
4,b,1
5,b,1
6,a,2
7,a,2
8,a,2


In [123]:
df.query("strings == 'a' and nums == 1")

Unnamed: 0,strings,nums


- The numeric part of the comparison (nums == 1) will be evaluated by numexpr and the object part of the comparison ("strings == 'a') will be evaluated by Python.

### Scaling to large datasets
- From https://pandas.pydata.org/docs/user_guide/scale.html
- pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some pandas operations need to make intermediate copies.

- This document provides a few recommendations for scaling your analysis to larger datasets. It’s a complement to Enhancing performance, which focuses on speeding up analysis for datasets that fit in memory.

#### Load less data
- Suppose our raw dataset on disk has many columns.

In [124]:
def make_timeseries(start="2000-01-01", end="2000-12-31", freq="1D", seed=None):
    index = pd.date_range(start=start, end=end, freq=freq, name="timestamp")
    n = len(index)
    state = np.random.RandomState(seed)
    columns = {
        "name": state.choice(["Alice", "Bob", "Charlie"], size=n),
        "id": state.poisson(1000, size=n),
        "x": state.rand(n) * 2 - 1,
        "y": state.rand(n) * 2 - 1,
    }
    df = pd.DataFrame(columns, index=index, columns=sorted(columns))
    if df.index[-1] == end:
        df = df.iloc[:-1]
    return df

timeseries = [
    make_timeseries(freq="1min", seed=i).rename(columns=lambda x: f"{x}_{i}")
    for i in range(10)
]

ts_wide = pd.concat(timeseries, axis=1)

ts_wide.head()

Unnamed: 0_level_0,id_0,name_0,x_0,y_0,id_1,name_1,x_1,y_1,id_2,name_2,...,x_7,y_7,id_8,name_8,x_8,y_8,id_9,name_9,x_9,y_9
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-01 00:00:00,977,Alice,-821.225m,906.222m,975,Bob,-288.451m,-215.082m,1047,Alice,...,-371.775m,697.468m,1048,Alice,403.201m,-756.503m,1025,Charlie,-957.208m,-757.508m
2000-01-01 00:01:00,1018,Bob,-219.182m,350.855m,1032,Alice,919.521m,-338.915m,1043,Bob,...,-570.205m,-473.155m,1037,Bob,-690.994m,-623.366m,981,Alice,-414.445m,-100.298m
2000-01-01 00:02:00,927,Alice,660.908m,-798.511m,967,Alice,628.664m,763.875m,963,Alice,...,-690.044m,-912.261m,987,Bob,656.727m,579.849m,923,Charlie,-325.838m,581.859m
2000-01-01 00:03:00,997,Bob,-852.458m,735.260m,1021,Bob,995.494m,514.133m,952,Charlie,...,-397.596m,248.303m,1013,Bob,-132.701m,-173.416m,1042,Bob,992.033m,-686.692m
2000-01-01 00:04:00,965,Bob,717.283m,393.391m,1011,Bob,-143.403m,-282.985m,973,Charlie,...,574.683m,-764.567m,1010,Charlie,-741.446m,-886.785m,964,Charlie,-924.556m,-184.161m


In [None]:
ts_wide.to_parquet("data/timeseries_wide.parquet")

- To load the columns we want, we have two options. Option 1 loads in all the data and then filters to what we need.

In [126]:
columns = ["id_0", "name_0", "x_0", "y_0"]
pd.read_parquet("data/timeseries_wide.parquet")[columns]

Unnamed: 0_level_0,id_0,name_0,x_0,y_0
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,977,Alice,-821.225m,906.222m
2000-01-01 00:01:00,1018,Bob,-219.182m,350.855m
2000-01-01 00:02:00,927,Alice,660.908m,-798.511m
2000-01-01 00:03:00,997,Bob,-852.458m,735.260m
2000-01-01 00:04:00,965,Bob,717.283m,393.391m
...,...,...,...,...
2000-12-30 23:56:00,1037,Bob,-814.321m,612.836m
2000-12-30 23:57:00,980,Bob,232.195m,-618.828m
2000-12-30 23:58:00,965,Alice,-231.131m,26.310m
2000-12-30 23:59:00,984,Alice,942.819m,853.128m


- Option 2 only loads the columns we request.

In [128]:
pd.read_parquet("data/timeseries_wide.parquet", columns=columns)


Unnamed: 0_level_0,id_0,name_0,x_0,y_0
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,977,Alice,-821.225m,906.222m
2000-01-01 00:01:00,1018,Bob,-219.182m,350.855m
2000-01-01 00:02:00,927,Alice,660.908m,-798.511m
2000-01-01 00:03:00,997,Bob,-852.458m,735.260m
2000-01-01 00:04:00,965,Bob,717.283m,393.391m
...,...,...,...,...
2000-12-30 23:56:00,1037,Bob,-814.321m,612.836m
2000-12-30 23:57:00,980,Bob,232.195m,-618.828m
2000-12-30 23:58:00,965,Alice,-231.131m,26.310m
2000-12-30 23:59:00,984,Alice,942.819m,853.128m


- If we were to measure the memory usage of the two calls, we’d see that specifying columns uses about 1/10th the memory in this case.

- With pandas.read_csv(), you can specify usecols to limit the columns read into memory. Not all file formats that can be read by pandas provide an option to read a subset of columns.

#### Use efficient datatypes
- The default pandas data types are not the most memory efficient. This is especially true for text data columns with relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you can store larger datasets in memory.

In [129]:
ts = make_timeseries(freq="30s", seed=0)
ts.to_parquet("data/timeseries.parquet")
ts = pd.read_parquet("data/timeseries.parquet")
ts

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1041,Alice,889.987m,281.011m
2000-01-01 00:00:30,988,Bob,-455.299m,488.153m
2000-01-01 00:01:00,1018,Alice,96.061m,580.473m
2000-01-01 00:01:30,992,Bob,142.482m,41.665m
2000-01-01 00:02:00,960,Bob,-36.235m,802.159m
...,...,...,...,...
2000-12-30 23:58:00,1022,Alice,266.191m,875.579m
2000-12-30 23:58:30,974,Alice,-9.826m,413.686m
2000-12-30 23:59:00,1028,Charlie,307.108m,-656.789m
2000-12-30 23:59:30,1002,Alice,202.602m,541.335m


- Now, let’s inspect the data types and memory usage to see where we should focus our attention.

In [130]:
ts.dtypes

id        int32
name     object
x       float64
y       float64
dtype: object

In [131]:
ts.memory_usage(deep=True)  # memory usage in bytes

Index     8409608
id        4204804
name     56766826
x         8409608
y         8409608
dtype: int64

- The name column is taking up much more memory than any other. It has just a few unique values, so it’s a good candidate for converting to a pandas.Categorical. With a pandas.Categorical, we store each unique name once and use space-efficient integers to know which specific name is used in each row.

In [132]:
ts2 = ts.copy()
ts2["name"] = ts2["name"].astype("category")
ts2.memory_usage(deep=True)


Index    8409608
id       4204804
name     1051471
x        8409608
y        8409608
dtype: int64

- We can go a bit further and downcast the numeric columns to their smallest types using pandas.to_numeric().

In [133]:
ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")
ts2[["x", "y"]] = ts2[["x", "y"]].apply(pd.to_numeric, downcast="float")
ts2.dtypes

id        uint16
name    category
x        float32
y        float32
dtype: object

In [134]:
ts2.memory_usage(deep=True)

Index    8409608
id       2102402
name     1051471
x        4204804
y        4204804
dtype: int64

In [135]:
reduction = ts2.memory_usage(deep=True).sum() / ts.memory_usage(deep=True).sum()
print(f"{reduction:0.2f}")


0.23


- In all, we’ve reduced the in-memory footprint of this dataset to 1/5 of its original size.

#### Use chunking
- Some workloads can be achieved with chunking by splitting a large problem into a bunch of small problems. For example, converting an individual CSV file into a Parquet file and repeating that for each file in a directory. As long as each chunk fits in memory, you can work with datasets that are much larger than memory.

- **Note**: Chunking works well when the operation you’re performing requires zero or minimal coordination between chunks. For more complicated workflows, you’re better off using other libraries.

- Suppose we have an even larger “logical dataset” on disk that’s a directory of parquet files. Each file in the directory represents a different year of the entire dataset.

In [136]:
import pathlib
N = 12
starts = [f"20{i:>02d}-01-01" for i in range(N)]
ends = [f"20{i:>02d}-12-13" for i in range(N)]
pathlib.Path("data/timeseries").mkdir(exist_ok=True)
for i, (start, end) in enumerate(zip(starts, ends)):
    ts = make_timeseries(start=start, end=end, freq="1min", seed=i)
    ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")

- Now we’ll implement an out-of-core pandas.Series.value_counts(). The peak memory usage of this workflow is the single largest chunk, plus a small series storing the unique value counts up to this point. As long as each individual file fits in memory, this will work for arbitrary-sized datasets.

In [137]:
%%time
files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
counts = pd.Series(dtype=int)
for path in files:
    df = pd.read_parquet(path)
    counts = counts.add(df["name"].value_counts(), fill_value=0)
counts.astype(int)

CPU times: total: 1 s
Wall time: 1.07 s


name
Alice      1994645
Bob        1993692
Charlie    1994875
dtype: int64

- Some readers, like pandas.read_csv(), offer parameters to control the chunksize when reading a single file.

- Manually chunking is an OK option for workflows that don’t require too sophisticated of operations. Some operations, like pandas.DataFrame.groupby(), are much harder to do chunkwise. In these cases, you may be better switching to a different library that implements these out-of-core algorithms for you.

#### Use Other Libraries
- There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame, and can give you the ability to scale your large dataset processing and analytics by parallel runtime, distributed memory, clustering, etc. You can find more information in the ecosystem page.

### Sparse data structures
- From https://pandas.pydata.org/docs/user_guide/sparse.html
- pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

In [138]:
arr = np.random.randn(10)
arr[2:-2] = np.nan
ts = pd.Series(pd.arrays.SparseArray(arr))
ts

0    696.890m
1      -1.397
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8     55.749m
9    993.277m
dtype: Sparse[float64, nan]

- Notice the dtype, Sparse[float64, nan]. The nan means that elements in the array that are nan aren’t actually stored, only the non-nan elements are. Those non-nan elements have a float64 dtype.

- The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame:

In [139]:
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = np.nan
sdf = df.astype(pd.SparseDtype("float", np.nan))
sdf.head()

Unnamed: 0,0,1,2,3
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [140]:
sdf.dtypes

0    Sparse[float64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
3    Sparse[float64, nan]
dtype: object

In [141]:
sdf.sparse.density

np.float64(0.0002)

- As you can see, the density (% of values that have not been “compressed”) is extremely low. This sparse object takes up much less memory on disk (pickled) and in the Python interpreter.

In [142]:
'dense : {:0.2f} bytes'.format(df.memory_usage().sum() / 1e3)

'dense : 320.13 bytes'

In [143]:

'sparse: {:0.2f} bytes'.format(sdf.memory_usage().sum() / 1e3)

'sparse: 0.23 bytes'

- Functionally, their behavior should be nearly identical to their dense counterparts.
#### SparseArray
- arrays.SparseArray is a ExtensionArray for storing an array of sparse values (see dtypes for more on extension arrays). It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:

In [144]:
arr = np.random.randn(10)
arr[2:5] = np.nan
arr[7:8] = np.nan
sparr = pd.arrays.SparseArray(arr)
sparr

[0.2950153851526463, 0.19492931235777924, nan, nan, nan, 0.16794675839328033, -0.7857965624047829, nan, 0.48435540288516066, -0.5822005610498053]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)

- A sparse array can be converted to a regular (dense) ndarray with numpy.asarray()

In [145]:
np.asarray(sparr)

array([ 0.29501539,  0.19492931,         nan,         nan,         nan,
        0.16794676, -0.78579656,         nan,  0.4843554 , -0.58220056])

#### SparseDtype
- The SparseArray.dtype property stores two pieces of information
    - The dtype of the non-sparse values
    - The scalar fill value

In [146]:
sparr.dtype

Sparse[float64, nan]

- A SparseDtype may be constructed by passing only a dtype

In [147]:
pd.SparseDtype(np.dtype('datetime64[ns]'))

Sparse[datetime64[ns], np.datetime64('NaT')]

- in which case a default fill value will be used (for NumPy dtypes this is often the “missing” value for that dtype). To override this default an explicit fill value may be passed instead

In [148]:
pd.SparseDtype(np.dtype('datetime64[ns]'),
               fill_value=pd.Timestamp('2017-01-01'))

Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]

- Finally, the string alias 'Sparse[dtype]' may be used to specify a sparse dtype in many places

In [149]:
pd.array([1, 0, 0, 2], dtype='Sparse[int]')

[1, 0, 0, 2]
Fill: 0
IntIndex
Indices: array([0, 3], dtype=int32)

#### Sparse accessor
- pandas provides a .sparse accessor, similar to .str for string data, .cat for categorical data, and .dt for datetime-like data. This namespace provides attributes and methods that are specific to sparse data.

In [150]:
s = pd.Series([0, 0, 1, 2], dtype="Sparse[int]")
s.sparse.density

0.5

In [151]:
s.sparse.fill_value

0

- This accessor is available only on data with SparseDtype, and on the Series class itself for creating a Series with sparse data from a scipy COO matrix with.

- A .sparse accessor has been added for DataFrame as well. See Sparse accessor for more.

#### Sparse calculation
- You can apply NumPy ufuncs to arrays.SparseArray and get a arrays.SparseArray as a result.

In [152]:
arr = pd.arrays.SparseArray([1., np.nan, np.nan, -2., np.nan])
np.abs(arr)

[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)

- The ufunc is also applied to fill_value. This is needed to get the correct dense result.

In [153]:
arr = pd.arrays.SparseArray([1., -1, -1, -2., -1], fill_value=-1)
np.abs(arr)

[1, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([3], dtype=int32)

In [154]:
np.abs(arr).to_dense()

array([1., 1., 1., 2., 1.])

#### Conversion
- To convert data from sparse to dense, use the .sparse accessors

In [155]:
sdf.sparse.to_dense()

Unnamed: 0,0,1,2,3
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,
...,...,...,...,...
9995,,,,
9996,,,,
9997,,,,
9998,-938.899m,995.457m,845.394m,584.624m


- From dense to sparse, use DataFrame.astype() with a SparseDtype.

In [156]:
dense = pd.DataFrame({"A": [1, 0, 0, 1]})
dtype = pd.SparseDtype(int, fill_value=0)
dense.astype(dtype)

Unnamed: 0,A
0,1
1,0
2,0
3,1


#### Interaction with scipy.sparse
- Use DataFrame.sparse.from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.

In [157]:
from scipy.sparse import csr_matrix

In [158]:
arr = np.random.random(size=(1000, 5))
arr[arr < .9] = 0
sp_arr = csr_matrix(arr)
sp_arr

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 495 stored elements and shape (1000, 5)>

In [159]:
sdf = pd.DataFrame.sparse.from_spmatrix(sp_arr)
sdf.head()

Unnamed: 0,0,1,2,3,4
0,0,0,0,983.490m,0
1,0,0,0,0,0
2,0,0,0,0,0
3,0,0,0,0,990.420m
4,0,0,0,0,0


In [160]:
sdf.dtypes

0    Sparse[float64, 0]
1    Sparse[float64, 0]
2    Sparse[float64, 0]
3    Sparse[float64, 0]
4    Sparse[float64, 0]
dtype: object

- All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying data as needed. To convert back to sparse SciPy matrix in COO format, you can use the DataFrame.sparse.to_coo() method:

In [161]:
sdf.sparse.to_coo()

<COOrdinate sparse matrix of dtype 'float64'
	with 495 stored elements and shape (1000, 5)>

- Series.sparse.to_coo() is implemented for transforming a Series with sparse values indexed by a MultiIndex to a scipy.sparse.coo_matrix.

- The method requires a MultiIndex with two or more levels.

In [162]:
s = pd.Series([3.0, np.nan, 1.0, 3.0, np.nan, np.nan])
s.index = pd.MultiIndex.from_tuples(
    [
        (1, 2, "a", 0),
        (1, 2, "a", 1),
        (1, 1, "b", 0),
        (1, 1, "b", 1),
        (2, 1, "b", 0),
        (2, 1, "b", 1),
    ],
    names=["A", "B", "C", "D"],
)
ss = s.astype('Sparse')
ss

A  B  C  D
1  2  a  0    3.000
         1      NaN
   1  b  0    1.000
         1    3.000
2  1  b  0      NaN
         1      NaN
dtype: Sparse[float64, nan]

- In the example below, we transform the Series to a sparse representation of a 2-d array by specifying that the first and second MultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns. We also specify that the column and row labels should be sorted in the final sparse representation.

In [163]:
A, rows, columns = ss.sparse.to_coo(
    row_levels=["A", "B"], column_levels=["C", "D"], sort_labels=True
)
A

<COOrdinate sparse matrix of dtype 'float64'
	with 3 stored elements and shape (3, 4)>

In [164]:
A.todense()

matrix([[0., 0., 1., 3.],
        [3., 0., 0., 0.],
        [0., 0., 0., 0.]])

In [165]:
rows

[(1, 1), (1, 2), (2, 1)]

In [166]:
columns

[('a', 0), ('a', 1), ('b', 0), ('b', 1)]

- Specifying different row and column labels (and not sorting them) yields a different sparse matrix:

In [167]:
A, rows, columns = ss.sparse.to_coo(
    row_levels=["A", "B", "C"], column_levels=["D"], sort_labels=False
)
A

<COOrdinate sparse matrix of dtype 'float64'
	with 3 stored elements and shape (3, 2)>

In [168]:
A.todense()

matrix([[3., 0.],
        [1., 3.],
        [0., 0.]])

In [169]:
rows

[(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]

In [170]:
columns

[(0,), (1,)]

- A convenience method Series.sparse.from_coo() is implemented for creating a Series with sparse values from a scipy.sparse.coo_matrix.

In [171]:
from scipy import sparse

A = sparse.coo_matrix(([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4))
A

<COOrdinate sparse matrix of dtype 'float64'
	with 3 stored elements and shape (3, 4)>

In [172]:
A.todense()

matrix([[0., 0., 1., 2.],
        [3., 0., 0., 0.],
        [0., 0., 0., 0.]])

- The default behaviour (with dense_index=False) simply returns a Series containing only the non-null entries.

In [173]:
ss = pd.Series.sparse.from_coo(A)
ss

0  2    1.000
   3    2.000
1  0    3.000
dtype: Sparse[float64, nan]

- Specifying dense_index=True will result in an index that is the Cartesian product of the row and columns coordinates of the matrix. Note that this will consume a significant amount of memory (relative to dense_index=False) if the sparse matrix is large (and sparse) enough.

In [174]:
ss_dense = pd.Series.sparse.from_coo(A, dense_index=True)
ss_dense

1  0    3.000
   2      NaN
   3      NaN
0  0      NaN
   2    1.000
   3    2.000
   0      NaN
   2    1.000
   3    2.000
dtype: Sparse[float64, nan]

### Frequently Asked Questions (FAQ)
- From https://pandas.pydata.org/docs/user_guide/gotchas.html
#### DataFrame memory usage
- The memory usage of a DataFrame (including the index) is shown when calling the info(). A configuration option, display.memory_usage (see the list of options), specifies if the DataFrame memory usage will be displayed when invoking the info() method.

- For example, the memory usage of the DataFrame below is shown when calling info():

In [175]:
dtypes = [
    "int64",
    "float64",
    "datetime64[ns]",
    "timedelta64[ns]",
    "complex128",
    "object",
    "bool",
]
n = 5000
data = {t: np.random.randint(100, size=n).astype(t) for t in dtypes}
df = pd.DataFrame(data)
df["categorical"] = df["object"].astype("category")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype          
---  ------           --------------  -----          
 0   int64            5000 non-null   int64          
 1   float64          5000 non-null   float64        
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128     
 5   object           5000 non-null   object         
 6   bool             5000 non-null   bool           
 7   categorical      5000 non-null   category       
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 288.2+ KB


- The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used by values in columns with dtype=object.

- Passing memory_usage='deep' will enable a more accurate memory usage report, accounting for the full usage of the contained objects. This is optional as it can be expensive to do this deeper introspection.

In [176]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype          
---  ------           --------------  -----          
 0   int64            5000 non-null   int64          
 1   float64          5000 non-null   float64        
 2   datetime64[ns]   5000 non-null   datetime64[ns] 
 3   timedelta64[ns]  5000 non-null   timedelta64[ns]
 4   complex128       5000 non-null   complex128     
 5   object           5000 non-null   object         
 6   bool             5000 non-null   bool           
 7   categorical      5000 non-null   category       
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1), object(1), timedelta64[ns](1)
memory usage: 424.9 KB


- By default the display option is set to True but can be explicitly overridden by passing the memory_usage argument when invoking info().

- The memory usage of each column can be found by calling the memory_usage() method. This returns a Series with an index represented by column names and memory usage of each column shown in bytes. For the DataFrame above, the memory usage of each column and the total memory usage can be found with the memory_usage() method:

In [177]:
df.memory_usage()

Index                132
int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

In [178]:
# total memory usage of dataframe
df.memory_usage().sum()

np.int64(295100)

- By default the memory usage of the DataFrame index is shown in the returned Series, the memory usage of the index can be suppressed by passing the index=False argument:

In [179]:
df.memory_usage(index=False)

int64              40000
float64            40000
datetime64[ns]     40000
timedelta64[ns]    40000
complex128         80000
object             40000
bool                5000
categorical         9968
dtype: int64

- The memory usage displayed by the info() method utilizes the memory_usage() method to determine the memory usage of a DataFrame while also formatting the output in human-readable units (base-2 representation; i.e. 1KB = 1024 bytes).

- See also Categorical Memory Usage.

#### Using if/truth statements with pandas
- pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens in an if-statement or when using the boolean operations: and, or, and not. It is not clear what the result of the following code should be:

In [181]:
# if pd.Series([False, True, False]):
#     pass

- Should it be True because it’s not zero-length, or False because there are False values? It is unclear, so instead, pandas raises a ValueError:
- You need to explicitly choose what you want to do with the DataFrame, e.g. use any(), all() or empty(). Alternatively, you might want to compare if the pandas object is None:

In [182]:
if pd.Series([False, True, False]) is not None:
    print("I was not None")


I was not None


- Below is how to check if any of the values are True:

In [183]:
if pd.Series([False, True, False]).any():
    print("I am any")

I am any


#### Bitwise boolean
- Bitwise boolean operators like == and != return a boolean Series which performs an element-wise comparison when compared to a scalar.

In [184]:
s = pd.Series(range(5))
s == 4

0    False
1    False
2    False
3    False
4     True
dtype: bool

#### Using the in operator
- Using the Python in operator on a Series tests for membership in the index, not membership among the values.

In [185]:
s = pd.Series(range(5), index=list("abcde"))
2 in s

False

In [186]:
'b' in s

True

- If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series are dict-like. To test for membership in the values, use the method isin():

In [187]:
s.isin([2])

a    False
b    False
c     True
d    False
e    False
dtype: bool

In [188]:
s.isin([2]).any()

np.True_

- For DataFrame, likewise, in applies to the column axis, testing for membership in the list of column names.

#### Mutating with User Defined Function (UDF) methods
- This section applies to pandas methods that take a UDF. In particular, the methods DataFrame.apply(), DataFrame.aggregate(), DataFrame.transform(), and DataFrame.filter().

- It is a general rule in programming that one should not mutate a container while it is being iterated over. Mutation will invalidate the iterator, causing unexpected behavior. Consider the example:

In [189]:
values = [0, 1, 2, 3, 4, 5]
n_removed = 0
for k, value in enumerate(values):
    idx = k - n_removed
    if value % 2 == 1:
        del values[idx]
        n_removed += 1
    else:
        values[idx] = value + 1
values

[1, 4, 5]

- One probably would have expected that the result would be [1, 3, 5]. When using a pandas method that takes a UDF, internally pandas is often iterating over the DataFrame or other pandas object. Therefore, if the UDF mutates (changes) the DataFrame, unexpected behavior can arise.

- Here is a similar example with DataFrame.apply():

In [192]:
def f(s):
    s.pop("a")
    return s


df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

try:
    df.apply(f, axis="columns")
except Exception as e:
    print("KeyError")

KeyError


- To resolve this issue, one can make a copy so that the mutation does not apply to the container being iterated over.

In [193]:
values = [0, 1, 2, 3, 4, 5]
n_removed = 0
for k, value in enumerate(values.copy()):
    idx = k - n_removed
    if value % 2 == 1:
        del values[idx]
        n_removed += 1
    else:
        values[idx] = value + 1
values

[1, 3, 5]

In [194]:
def f(s):
    s = s.copy()
    s.pop("a")
    return s

df = pd.DataFrame({"a": [1, 2, 3], 'b': [4, 5, 6]})
df.apply(f, axis="columns")


Unnamed: 0,b
0,4
1,5
2,6


#### Missing value representation for NumPy types
- np.nan as the NA representation for NumPy types
For lack of NA (missing) support from the ground up in NumPy and Python in general, NA could have been represented with:
    - A masked array solution: an array of data and an array of boolean values indicating whether a value is there or is missing.
    - Using a special sentinel value, bit pattern, or set of sentinel values to denote NA across the dtypes.

- The special value np.nan (Not-A-Number) was chosen as the NA value for NumPy types, and there are API functions like DataFrame.isna() and DataFrame.notna() which can be used across the dtypes to detect NA values. However, this choice has a downside of coercing missing integer data as float types as shown in Support for integer NA.

#### NA type promotions for NumPy types
- When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this table:

| ypeclass | Promotion dtype for storing NAs |
| ------ | ------- |
| floating | no change |
| object | no change |
| integer | cast to float64 |
| boolean | cast to object |

#### Support for integer NA
- In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is the ability to represent NAs in integer arrays. For example:

In [195]:
s = pd.Series([1, 2, 3, 4, 5], index=list("abcde"))
s

a    1
b    2
c    3
d    4
e    5
dtype: int64

In [196]:
s.dtype

dtype('int64')

In [197]:
s2 = s.reindex(["a", "b", "c", "f", "u"])
s2

a    1.000
b    2.000
c    3.000
f      NaN
u      NaN
dtype: float64

In [198]:
s2.dtype

dtype('float64')

- This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”.

- If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes provided by pandas or pyarrow
    - Int8Dtype
    - Int16Dtype
    - Int32Dtype
    - Int64Dtype
    - ArrowDtype

In [199]:
s_int = pd.Series([1, 2, 3, 4, 5], index=list("abcde"), dtype=pd.Int64Dtype())
s_int

a    1
b    2
c    3
d    4
e    5
dtype: Int64

In [200]:
s_int.dtype

Int64Dtype()

In [201]:
s2_int = s_int.reindex(["a", "b", "c", "f", "u"])
s2_int

a       1
b       2
c       3
f    <NA>
u    <NA>
dtype: Int64

In [202]:
s2_int.dtype

Int64Dtype()

In [203]:
s_int_pa = pd.Series([1, 2, None], dtype="int64[pyarrow]")
s_int_pa

0       1
1       2
2    <NA>
dtype: int64[pyarrow]

#### Why not make NumPy like R?
- Many people have suggested that NumPy should simply emulate the NA support present in the more domain-specific statistical programming language R. Part of the reason is the NumPy type hierarchy.

- The R language, by contrast, only has a handful of built-in data types: integer, numeric (floating-point), character, and boolean. NA types are implemented by reserving special bit patterns for each type to be used as the missing value. While doing this with the full NumPy type hierarchy would be possible, it would be a more substantial trade-off (especially for the 8- and 16-bit data types) and implementation undertaking.

- However, R NA semantics are now available by using masked NumPy types such as Int64Dtype or PyArrow types (ArrowDtype).

#### Differences with NumPy
- For Series and DataFrame objects, var() normalizes by N-1 to produce unbiased estimates of the population variance, while NumPy’s numpy.var() normalizes by N, which measures the variance of the sample. Note that cov() normalizes by N-1 in both pandas and NumPy.

#### Thread-safety
- pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where the data copying occurs.

- See this link for more information.

#### Byte-ordering issues
- Occasionally you may have to deal with data that were created on a machine with a different byte order than the one on which you are running Python. A common symptom of this issue is an error like:
`Traceback
    ...
ValueError: Big-endian buffer not supported on little-endian compiler`
- To deal with this issue you should convert the underlying NumPy array to the native system byte order before passing it to Series or DataFrame constructors using something similar to the following:

In [204]:
x = np.array(list(range(10)), ">i4")  # big endian

newx = x.byteswap().view(x.dtype.newbyteorder())  # force native byteorder

s = pd.Series(newx)