Intel® Scalable Dataframe Compiler (Intel® SDC) extends capabilities of `Numba*`_ to compile a subset of `Pandas*`_ into native code. Being integral part of `Numba*`_ it allows to combine regular `NumPy*`_ codes with `Pandas*`_ operations.
Like in `Numba*`_ the compilation is controlled by a regular @njit
decorator and respective compilation
directives which control its behavior.
The code below illustrates a typical workflow that Intel® SDC is intended to compile:
.. literalinclude:: ../../examples/basic_workflow.py :language: python :lines: 27- :caption: Example 1: Compiling Basic Pandas* Workflow :name: ex_getting_started_basic_workflow
The workflow typically starts with reading data from a file (or multiple files) into a dataframe (or multiple dataframes) followed by data transformations of dataframes and/or individual columns, cleaning the data, grouping and binning, and finally by feeding the cleaned data into machine learning algorithm for training or inference.
We also recommend to read A ~5 minute guide to Numba for getting started with `Numba*`_.
You can use conda and pip package managers to install Intel® SDC into your `Python*`_ environment.
Intel SDC is available on the Anaconda Cloud intel/label/beta channel. Distribution includes Intel SDC for Python 3.6 and 3.7 for Windows and Linux platforms.
Intel SDC conda package can be installed using the steps below:
> conda create -n sdc_env python=<3.7 or 3.6> pyarrow=4.0.1 pandas=1.3.4 -c anaconda -c conda-forge > conda activate sdc_env > conda install sdc -c intel/label/beta -c intel -c defaults -c conda-forge --override-channels
Intel SDC wheel package can be installed using the steps below:
> conda create -n sdc_env python=<3.7 or 3.6> pip pyarrow=4.0.1 pandas=1.3.4 -c anaconda -c conda-forge > conda activate sdc_env > pip install --index-url https://pypi.anaconda.org/intel/label/beta/simple --extra-index-url https://pypi.anaconda.org/intel/simple --extra-index-url https://pypi.org/simple sdc
Experienced users can also build Intel SDC from sources for Linux* and for Windows*.
The code below illustrates a typical ML workflow that consists of data pre-processing and predicting stages. Intel® SDC is intended to compile pre-processing stage that includes reading dataset from a csv file, filtering data and performing Pearson correlation operation. The prediction based on gradient boosting regression module is made using scikit-learn module.
.. literalinclude:: ../../examples/basic_usage_nyse_predict.py :language: python :lines: 27- :caption: Typical usage of Intel® SDC in combination with scikit-learn :name: ex_getting_started_basic_usage_nyse_predict
Not all Python code can be compiled with Intel® SDC. Not all `Pandas*`_ and `Numpy*`_ APIs are currently supported and not all valid python code can be compiled using underlying Numba compiler.
To be successfully compiled code must use only supported subset of `Pandas*`_ API and use only subset `Python*`_ supported by `Numba*`_ (e.g. be type-stable)
Example of currently unsupported code:
if flag: a = 1.0 else: a = np.ones(10) return a # Type of a cannot be inferred
:ref:`SDC API reference<apireference>`
:ref:`More info on SDC compilation process and supported Python features<compilation>`
Lets consider we want to measure performance of Series.max() method.
from numba import njit @njit def series_max(s): return s.max()
- First, recall that Intel® SDC is based on Numba. Therefore, execution time may consist of the following:
- Numba has to compile your function for the first time, this takes time.
- Boxing and unboxing convert Python objects into native values, and vice-versa. They occur at the boundaries of calling a `Numba*`_ function from the Python interpreter. E.g. boxing and unboxing apply to `Pandas*`_ types like :ref:`Series <pandas.Series>` and :ref:`DataFrame <pandas.DataFrame>`.
- The execution of the function itself.
A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.
A good way to measure the impact Numba JIT has on your code is to time execution using the timeit module functions.
Intel® SDC also recommends eliminate the impact of compilation and boxing/unboxing by measuring the time inside Numba JIT code.
Example of measuring performance:
import time import numpy as np import pandas as pd from numba import njit @njit def perf_series_max(s): # <-- unboxing start_time = time.time() # <-- time inside Numba JIT code res = s.max() finish_time = time.time() # <-- time inside Numba JIT code return finish_time - start_time, res # <-- boxing s = pd.Series(np.random.ranf(size=100000)) exec_time, res = perf_series_max(s) print("Execution time in JIT code: ", exec_time)
See also `Numba*`_ documentation How to measure the performance of Numba?
See also Intel® SDC repository performance tests.
If you get poor performance you need to consider several reasons, among which compilation overheads, overheads related to converting Python objects to native structures and back, amount of parallelism in compiled code, to what extent the code is “static” and many other factors. See more details in Intel® SDC documentation :ref:`Getting Performance With Intel® SDC <performance>`.
Also you need to consider limitations of particular function. See more details in Intel® SDC documentation for particular function :ref:`apireference`.
See also `Numba*`_ documentation Performance Tips and The compiled code is too slow.
Build instructions for Linux*: https://github.com/IntelPython/sdc#building-intel-sdc-from-source-on-linux Build instructions for Windows*: https://github.com/IntelPython/sdc#building-intel-sdc-from-source-on-windows