Getting Started

Intel® Scalable Dataframe Compiler (Intel® SDC) extends capabilities of `Numba*`_ to compile a subset of `Pandas*`_ into native code. Being integral part of `Numba*`_ it allows to combine regular `NumPy*`_ codes with `Pandas*`_ operations.

Like in `Numba*`_ the compilation is controlled by a regular @njit decorator and respective compilation directives which control its behavior.

The code below illustrates a typical workflow that Intel® SDC is intended to compile:

.. literalinclude:: ../../examples/basic_workflow.py
   :language: python
   :lines: 27-
   :caption: Example 1: Compiling Basic Pandas* Workflow
   :name: ex_getting_started_basic_workflow

The workflow typically starts with reading data from a file (or multiple files) into a dataframe (or multiple dataframes) followed by data transformations of dataframes and/or individual columns, cleaning the data, grouping and binning, and finally by feeding the cleaned data into machine learning algorithm for training or inference.

We also recommend to read A ~5 minute guide to Numba for getting started with `Numba*`_.

Installation

You can use conda and pip package managers to install Intel® SDC into your `Python*`_ environment.

Intel SDC is available on the Anaconda Cloud intel/label/beta channel. Distribution includes Intel SDC for Python 3.6 and 3.7 for Windows and Linux platforms.

Intel SDC conda package can be installed using the steps below:

> conda create -n sdc_env python=<3.7 or 3.6> pyarrow=4.0.1 pandas=1.3.4 -c anaconda -c conda-forge
> conda activate sdc_env
> conda install sdc -c intel/label/beta -c intel -c defaults -c conda-forge --override-channels

Intel SDC wheel package can be installed using the steps below:

> conda create -n sdc_env python=<3.7 or 3.6> pip pyarrow=4.0.1 pandas=1.3.4 -c anaconda -c conda-forge
> conda activate sdc_env
> pip install --index-url https://pypi.anaconda.org/intel/label/beta/simple --extra-index-url https://pypi.anaconda.org/intel/simple --extra-index-url https://pypi.org/simple sdc

Experienced users can also build Intel SDC from sources for Linux* and for Windows*.

Basic Usage

The code below illustrates a typical ML workflow that consists of data pre-processing and predicting stages. Intel® SDC is intended to compile pre-processing stage that includes reading dataset from a csv file, filtering data and performing Pearson correlation operation. The prediction based on gradient boosting regression module is made using scikit-learn module.

.. literalinclude:: ../../examples/basic_usage_nyse_predict.py
   :language: python
   :lines: 27-
   :caption: Typical usage of Intel® SDC in combination with scikit-learn
   :name: ex_getting_started_basic_usage_nyse_predict

What If I Get A Compilation Error

Not all Python code can be compiled with Intel® SDC. Not all `Pandas*`_ and `Numpy*`_ APIs are currently supported and not all valid python code can be compiled using underlying Numba compiler.

To be successfully compiled code must use only supported subset of `Pandas*`_ API and use only subset `Python*`_ supported by `Numba*`_ (e.g. be type-stable)

Example of currently unsupported code:

if flag:
   a = 1.0
else:
   a = np.ones(10)
 return a  # Type of a cannot be inferred

:ref:`SDC API reference<apireference>`

:ref:`More info on SDC compilation process and supported Python features<compilation>`

Numba documentation

Measuring Performance

Lets consider we want to measure performance of Series.max() method.

from numba import njit

@njit
def series_max(s):
   return s.max()

First, recall that Intel® SDC is based on Numba. Therefore, execution time may consist of the following:

Numba has to compile your function for the first time, this takes time.
Boxing and unboxing convert Python objects into native values, and vice-versa. They occur at the boundaries of calling a `Numba*`_ function from the Python interpreter. E.g. boxing and unboxing apply to `Pandas*`_ types like :ref:`Series <pandas.Series>` and :ref:`DataFrame <pandas.DataFrame>`.
The execution of the function itself.

A really common mistake when measuring performance is to not account for the above behaviour and to time code once with a simple timer that includes the time taken to compile your function in the execution time.

A good way to measure the impact Numba JIT has on your code is to time execution using the timeit module functions.

Intel® SDC also recommends eliminate the impact of compilation and boxing/unboxing by measuring the time inside Numba JIT code.

Example of measuring performance:

import time
import numpy as np
import pandas as pd
from numba import njit

@njit
def perf_series_max(s):                  # <-- unboxing
   start_time = time.time()              # <-- time inside Numba JIT code
   res = s.max()
   finish_time = time.time()             # <-- time inside Numba JIT code
   return finish_time - start_time, res  # <-- boxing

s = pd.Series(np.random.ranf(size=100000))
exec_time, res = perf_series_max(s)
print("Execution time in JIT code: ", exec_time)

See also `Numba*`_ documentation How to measure the performance of Numba?

See also Intel® SDC repository performance tests.

What If I Get Poor Performance?

If you get poor performance you need to consider several reasons, among which compilation overheads, overheads related to converting Python objects to native structures and back, amount of parallelism in compiled code, to what extent the code is “static” and many other factors. See more details in Intel® SDC documentation :ref:`Getting Performance With Intel® SDC <performance>`.

Also you need to consider limitations of particular function. See more details in Intel® SDC documentation for particular function :ref:`apireference`.

See also `Numba*`_ documentation Performance Tips and The compiled code is too slow.

Build Instructions

Build instructions for Linux*: https://github.com/IntelPython/sdc#building-intel-sdc-from-source-on-linux Build instructions for Windows*: https://github.com/IntelPython/sdc#building-intel-sdc-from-source-on-windows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting_started.rst

getting_started.rst

Getting Started

Installation

Basic Usage

What If I Get A Compilation Error

Measuring Performance

What If I Get Poor Performance?

Build Instructions

Files

getting_started.rst

Latest commit

History

getting_started.rst

File metadata and controls

Getting Started

Installation

Basic Usage

What If I Get A Compilation Error

Measuring Performance

What If I Get Poor Performance?

Build Instructions