cuDF - GPU DataFrame Library
Branch: master
Clone or download
Pull request Compare This branch is 534 commits behind branch-0.6.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.github Remove reference to google groups in pull request template [skip ci] Jan 9, 2019
ci FIX Remove conda installs, rely on container Jan 30, 2019
conda REL Update cmake and python deps Jan 31, 2019
cpp
docker Reorder conda channels for priority, fix edge case of package ending … Jan 24, 2019
docs REL v0.5.1 cuDF release Feb 5, 2019
img DOC Cropped RAPIDS logo Oct 27, 2018
python Remove commented import Feb 5, 2019
thirdparty BLD Target RMM master branch for RMM v0.5 Jan 29, 2019
.dockerignore Created .dockerignore file to prevent adding stale /cpp/build/* direc… Jan 2, 2019
.gitattributes ENH Finish pygdf renaming to cudf Oct 26, 2018
.gitignore merge branch 0.5. Resolve conflicts with the compression related chan… Dec 18, 2018
.gitmodules BLD Target RMM master branch for RMM v0.5 Jan 29, 2019
CHANGELOG.md DOC Update release date Feb 5, 2019
CONTRIBUTING.md DOC Add contribution guide Jan 26, 2019
Dockerfile Fix tabs --> spaces Jan 25, 2019
LICENSE FIX Update author and license info Oct 26, 2018
MANIFEST.in Update paths in meta.yaml, update paths in MANIFEST Nov 8, 2018
README.md REL Update conda install for v0.5 Feb 1, 2019
print_env.sh Added checking if the script is being executed inside a git repositor… Dec 20, 2018
readthedocs.yml FIX Correct path for RTD builds Nov 27, 2018
setup.cfg BLD Adds more metadata to pip package Jan 15, 2019
setup_pip.py Remove numpy from setup_pip.py Feb 5, 2019

README.md

 cuDF - GPU DataFrames

Build Status  Documentation Status

The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.

NOTE: For the latest stable README.md ensure you are on the master branch.

Quick Start

Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.

Install cuDF

Conda

It is easy to install cuDF using conda. You can get a minimal conda installation with Miniconda or get the full installation with Anaconda.

Install and update cuDF using the conda command:

# CUDA 9.2
conda install -c nvidia -c rapidsai -c numba -c conda-forge -c defaults cudf=0.5

# CUDA 10.0
conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c numba -c conda-forge -c defaults cudf=0.5

Note: This conda installation only applies to Linux and Python versions 3.6/3.7.

Pip

It is easy to install cuDF using pip. You must specify the CUDA version to ensure you install the right package.

# CUDA 9.2
pip install cudf-cuda92

# CUDA 10.0.
pip install cudf-cuda100

Development Setup

The following instructions are for developers and contributors to cuDF OSS development. These instructions are tested on Linux Ubuntu 16.04 & 18.04. Use these instructions to build cuDF from source and contribute to its development. Other operatings systems may be compatible, but are not currently tested.

Get libcudf Dependencies

Compiler requirements:

  • gcc version 5.4+
  • nvcc version 9.2+
  • cmake version 3.12.4+

CUDA/GPU requirements:

  • CUDA 9.2+
  • NVIDIA driver 396.44+
  • Pascal architecture or better

You can obtain CUDA from https://developer.nvidia.com/cuda-downloads

Since cmake will download and build Apache Arrow you may need to install Boost C++ (version 1.58+) before running cmake:

# Install Boost C++ for Ubuntu 16.04/18.04
$ sudo apt-get install libboost-all-dev

or

# Install Boost C++ for Conda
$ conda install -c conda-forge boost

Script to build cuDF from source

Build from Source

To install cuDF from source, ensure the dependencies are met and follow the steps below:

  • Clone the repository and submodules
CUDF_HOME=$(pwd)/cudf
git clone https://github.com/rapidsai/cudf.git $CUDF_HOME
cd CUDF_HOME
git submodule update --init --remote --recursive
  • Create the conda development environment cudf_dev
# create the conda environment (assuming in base `cudf` directory)
conda env create --name cudf_dev --file conda/environments/cudf_dev.yml
# activate the environment
source activate cudf_dev
  • Build and install libcudf. CMake depends on the nvcc executable being on your path or defined in $CUDACXX.
$ cd $CUDF_HOME/cpp                                                       # navigate to C/C++ CUDA source root directory
$ mkdir build                                                             # make a build directory
$ cd build                                                                # enter the build directory

# CMake options:
# -DCMAKE_INSTALL_PREFIX set to the install path for your libraries or $CONDA_PREFIX if you're using Anaconda, i.e. -DCMAKE_INSTALL_PREFIX=/install/path or -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX
# -DCMAKE_CXX11_ABI set to ON or OFF depending on the ABI version you want, defaults to OFF. When turned ON, ABI compability for C++11 is used. When OFF, pre-C++11 ABI compability is used.
$ cmake .. -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DCMAKE_CXX11_ABI=OFF     # configure cmake ...

$ make -j                                                                 # compile the libraries librmm.so, libcudf.so ... '-j' will start a parallel job using the number of physical cores available on your system
$ make install                                                            # install the libraries librmm.so, libcudf.so to the CMAKE_INSTALL_PREFIX
  • To run tests (Optional):
$ make test
  • Build, install, and test cffi bindings:
$ make python_cffi                                  # build CFFI bindings for librmm.so, libcudf.so
$ make install_python                               # build & install CFFI python bindings. Depends on cffi package from PyPi or Conda
$ cd python && py.test -v                           # optional, run python tests on low-level python bindings
  • Build the cudf python package, in the python folder:
$ cd $CUDF_HOME/python
$ python setup.py build_ext --inplace
  • You will also need the following environment variables, including $CUDA_HOME.
NUMBAPRO_NVVM=$CUDA_HOME/nvvm/lib64/libnvvm.so
NUMBAPRO_LIBDEVICE=$CUDA_HOME/nvvm/libdevice
  • To run Python tests (Optional):
$ py.test -v                                        # run python tests on cudf python bindings
  • Finally, install the Python package to your Python path:
$ python setup.py install                           # install cudf python bindings

Done! You are ready to develop for the cuDF OSS project.

Debugging cuDF

Building Debug mode from source

Follow the above instructions to build from source and add -DCMAKE_BUILD_TYPE=Debug to the cmake step.

For example:

$ cmake .. -DCMAKE_INSTALL_PREFIX=/install/path -DCMAKE_BUILD_TYPE=Debug     # configure cmake ... use -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX if you're using Anaconda

This builds libcudf in Debug mode which enables some assert safety checks and includes symbols in the library for debugging.

All other steps for installing libcudf into your environment are the same.

Debugging with cuda-gdb and cuda-memcheck

When you have a debug build of libcudf installed, debugging with the cuda-gdb and cuda-memcheck is easy.

If you are debugging a Python script, simply run the following:

cuda-gdb

cuda-gdb -ex r --args python <program_name>.py <program_arguments>

cuda-memcheck

cuda-memcheck python <program_name>.py <program_arguments>

Automated Build in Docker Container

A Dockerfile is provided with a preconfigured conda environment for building and installing cuDF from source based off of the master branch.

Prerequisites

  • Install nvidia-docker2 for Docker + GPU support
  • Verify NVIDIA driver is 396.44 or higher
  • Ensure CUDA 9.2+ is installed

Usage

From cudf project root run the following, to build with defaults:

$ docker build --tag cudf .

After the container is built run the container:

$ docker run --runtime=nvidia -it cudf bash

Activate the conda environment cudf to use the newly built cuDF and libcudf libraries:

root@3f689ba9c842:/# source activate cudf
(cudf) root@3f689ba9c842:/# python -c "import cudf"
(cudf) root@3f689ba9c842:/#

Customizing the Build

Several build arguments are available to customize the build process of the container. These are specified by using the Docker build-arg flag. Below is a list of the available arguments and their purpose:

Build Argument Default Value Other Value(s) Purpose
CUDA_VERSION 9.2 10.0 set CUDA version
LINUX_VERSION ubuntu16.04 ubuntu18.04 set Ubuntu version
CC & CXX 5 7 set gcc/g++ version; NOTE: gcc7 requires Ubuntu 18.04
CUDF_REPO This repo Forks of cuDF set git URL to use for git clone
CUDF_BRANCH master Any branch name set git branch to checkout of CUDF_REPO
NUMBA_VERSION newest >=0.40.0 set numba version
NUMPY_VERSION newest >=1.14.3 set numpy version
PANDAS_VERSION newest >=0.23.4 set pandas version
PYARROW_VERSION 0.12.0 Not supported set pyarrow version
CMAKE_VERSION newest >=3.12 set cmake version
CYTHON_VERSION 0.29 Not supported set Cython version
PYTHON_VERSION 3.6 3.7 set python version

Open GPU Data Science

The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.