The RAPIDS cuDF library is a GPU DataFrame manipulation library based on Apache Arrow that accelerates loading, filtering, and manipulation of data for model training data preparation. The RAPIDS GPU DataFrame provides a pandas-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.
Please see the Demo Docker Repository, choosing a tag based on the NVIDIA CUDA version you’re running. This provides a ready to run Docker container with example notebooks and data, showcasing how you can utilize cuDF.
You can install and update cuDF using the conda command:
conda install -c numba -c conda-forge -c rapidsai -c defaults cudf=0.2.0
Note: This conda installation only applies to Linux and Python versions 3.5/3.6.
You can create and activate a development environment using the conda command:
conda env create --name cudf --file conda_environments/testing_py35.yml source activate cudf
For cudf development, use
conda—environments/dev_py35.yml in the above
conda create command instead.
Support is coming soon, please use conda for the time being.
The following instructions are tested on Linux Ubuntu 16.04 & 18.04, to enable from source builds and development. Other operatings systems may be compatible, but are not currently supported.
Get libgdf Dependencies
- CUDA 9.2+
- NVIDIA driver 396.44+
- Pascal architecture or better
You can obtain CUDA from https://developer.nvidia.com/cuda-downloads
cmake will download and build Apache Arrow (version 0.7.1 or
0.8+) you may need to install Boost C++ (version 1.58+) before running
# Install Boost C++ for Ubuntu 16.04/18.04 $ sudo apt-get install libboost-all-dev
# Install Boost C++ for Conda $ conda install -c conda-forge boost
Build from Source
To install cuDF from source, ensure the dependencies are met and follow the steps below:
- Clone the repository
git clone --recurse-submodules https://github.com/rapidsai/cudf.git cd cudf
- Create the conda development environment
cudfas detailed above
- Build and install
source activate cudf mkdir -p libgdf/build cd libgdf/build cmake .. -DHASH_JOIN=ON -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX make -j install make copy_python python setup.py install
- Build and install
cudffrom the root of the repository
cd ../.. python setup.py install
Automated Build in Docker Container
A Dockerfile is provided with a preconfigured conda environment for building and installing cuDF from source based off of the master branch.
- Install nvidia-docker2 for Docker + GPU support
- Verify NVIDIA driver is
- Ensure CUDA 9.2+ is installed
From cudf project root run the following, to build with defaults:
docker build -t cudf .
After the container is built run the container:
docker run --runtime=nvidia -it cudf bash
Activate the conda environment
cudf to use the newly built cuDF and libgdf libraries:
root@3f689ba9c842:/# source activate cudf (cudf) root@3f689ba9c842:/# python -c "import cudf" (cudf) root@3f689ba9c842:/#
Customizing the Build
Several build arguments are available to customize the build process of the container. These are spcified by using the Docker build-arg flag. Below is a list of the available arguments and their purpose:
|Build Argument||Default Value||Other Value(s)||Purpose|
||9.2||10.0||set CUDA version|
||ubuntu16.04||ubuntu18.04||set Ubuntu version|
||5||7||set gcc/g++ version; NOTE: gcc7 requires Ubuntu 18.04|
||This repo||Forks of cuDF||set git URL to use for
||master||Any branch name||set git branch to checkout of
||0.40.0||Not supported||set numba version|
||1.14.3||Not supported||set numpy version|
||0.20.3||Not supported||set pandas version|
||0.10.0||0.8.0+||set pyarrow version|
||3.5||3.6||set python version|
This project uses py.test
In the source root directory and with the development conda environment activated, run:
py.test --cache-clear --ignore=libgdf
libgdf tests require a GPU and CUDA. CUDA can be installed locally or through the conda packages of
cudatoolkit. For more details on the requirements needed to run these tests see the libgdf README.
libgdf has two testing frameworks
py.test and GoogleTest:
# Run py.test command inside the /libgdf folder py.test # Run GoogleTest command inside the /libgdf/build folder after cmake make -j test
The RAPIDS suite of open source software libraries aim to enable execution of end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposing that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Apache Arrow on GPU
The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported.