<a id="introduction"></a>
## Introduction to cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames in RAPIDS.

**Table of Contents**

* [Introduction to cuDF](#introduction)
* [Setup](#setup)
* [cuDF DataFrame Basics](#basics)
* [Input/Output](#io)
* [cuDF API](#cudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-nightly:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Tue Jun 11 05:14:09 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0    58W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="basics"></a>
## cuDF DataFrame Basics

As we showed in the previous tutorial, cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations.

In the next several sections, we'll show how to create and manipulate cuDF DataFrames. For more information on using cuDF DataFrames, check out the documentation: https://rapidsai.github.io/projects/cudf/en/latest/index.html

#### Creating a cudf.DataFrame using lists

There are several ways to create a cuDF DataFrame. The easiest of these is to instantiate an empty cuDF DataFrame and then use Python list objects or NumPy arrays to create columns. Below, we import the cuDF library and create an empty cuDF DataFrame.

In [3]:
import cudf; print('cuDF Version:', cudf.__version__)


df = cudf.DataFrame()
print(df)

cuDF Version: 0.7.2+0.g3ebd286.dirty
Empty DataFrame
Columns: []
Index: []


Next, we can create two columns named `key` and `value` by using the bracket notation with the cuDF DataFrame and storing either a list of Python values or a NumPy array into that column.

In [4]:
import numpy as np; print('NumPy Version:', np.__version__)

# here we create two columns named "key" and "value"
df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
print(df)

NumPy Version: 1.16.2
   key  value
0    0     10
1    1     11
2    2     12
3    3     13
4    4     14


#### Creating a cudf.DataFrame using a list of tuples or a dictionary

Another way we can create a cuDF DataFrame is by providing a mapping of column names to column values, either via a list of tuples or by using a dictionary. In the below examples, we create a list of two-value tuples; the first value is the name of the column - for example, `id` or `timestamp` - and the second value is a list of Python objects or Numpy arrays. Note that we don't have to constrain the data stored in our cuDF DataFrames to common data types like integers or floats - we can use more exotic data types such as datetimes or strings. We'll investigate how such data types behave on the GPU a bit later.

In [5]:
from datetime import datetime, timedelta


ids = np.arange(5)
t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
timestamps = [(t0+ timedelta(seconds=x)) for x in range(5)]
timestamps_np = np.array(timestamps, dtype='datetime64')

In [6]:
df = cudf.DataFrame([('id', ids), ('timestamp', timestamps_np)])
print(df)

   id               timestamp
0   0 2018-10-07T12:00:00.000
1   1 2018-10-07T12:00:01.000
2   2 2018-10-07T12:00:02.000
3   3 2018-10-07T12:00:03.000
4   4 2018-10-07T12:00:04.000


Alternatively, we can create a dictonary of key-value pairs, where each key in the dictionary represents a column name and each value associated with the key represents the values that belong in that column.

In [7]:
df = cudf.DataFrame({'id': ids, 'timestamp': timestamps_np})
print(df)

   id               timestamp
0   0 2018-10-07T12:00:00.000
1   1 2018-10-07T12:00:01.000
2   2 2018-10-07T12:00:02.000
3   3 2018-10-07T12:00:03.000
4   4 2018-10-07T12:00:04.000


#### Creating a cudf.DataFrame from a Pandas DataFrame

Pandas DataFrames are a first class citizen within cuDF - this means that we can create a cuDF DataFrame from a Pandas DataFrame and vice versa.

In [8]:
import pandas as pd

pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


We can use the `cudf.from_pandas` or `cudf.DataFrame.from_pandas` functions to create a cuDF DataFrame from a Pandas DataFrame.

In [9]:
df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


#### Inspecting a cuDF DataFrame

There are several ways to inspect a cuDF DataFrame. The first method is to enter the cuDF DataFrame directly into the REPL. This shows us information about the type of the object, and metadata such as the number of rows or columns.

In [10]:
df

<cudf.DataFrame ncols=2 nrows=11 >

A second way to inspect a cuDF DataFrame is to wrap the object in a Python `print` function. This results in showing the rows and columns of the dataframe.

In [11]:
print(df)

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


For very large dataframes, we often want to see the first couple rows. We can use the `head` method of a cuDF DataFrame to view the first N rows.

In [12]:
print(df.head())

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4


#### Columns

cuDF DataFrames store metadata such as information about columns or data types. We can access the columns of a cuDF DataFrame using the `.columns` attribute.

In [13]:
print(df.columns)

Index(['a', 'b'], dtype='object')


We can modify the columns of a cuDF DataFrame by modifying the `columns` attribute. We can do this by setting that attribute equal to a list of strings representing the new columns.

In [14]:
df.columns = ['c', 'd']
print(df.columns)

Index(['c', 'd'], dtype='object')


#### Data types

We can also inspect the data types of the columns of a cuDF DataFrame using the `dtypes` attribute.

In [15]:
print(df.dtypes)

c      int64
d    float64
dtype: object


We can modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type. Be warned that silent errors may be introduced from nonsensical type conversations - for example, changing a float to an integer or vice versa.

In [16]:
df['c'] = df['c'].astype(np.float32)
df['d'] = df['d'].astype(np.int32)
print(df.dtypes)
print(df)

c    float32
d      int32
dtype: object
     c  d
0  0.0  0
1  1.0  0
2  2.0  0
3  3.0   
4  4.0  0
5  5.0  0
6  6.0  0
7  7.0  0
8  8.0  0
9  9.0  0
[1 more rows]


#### Series

cuDF DataFrames are composed of rows and columns. Each column is represented using an object of type `cudf.dataframe.series.Series`. For example, if we subset a cuDF DataFrame using just one column we will be returned an object of type `cudf.dataframe.series.Series`.

In [17]:
print(type(df['c']))
print(df['c'])

<class 'cudf.dataframe.series.Series'>
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
[1 more rows]
Name: c, dtype: float32


#### Index


In [18]:
df.index

GenericIndex([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int64)

#### Converting a cudf DataFrame to a Pandas DataFrame

We can convert a cuDF DataFrame back to a Pandas DataFrame using the `to_pandas` method.

In [19]:
pandas_df = df.to_pandas()
print(type(pandas_df))
print(pandas_df)

<class 'pandas.core.frame.DataFrame'>
       c  d
0    0.0  0
1    1.0  0
2    2.0  0
3    3.0 -1
4    4.0  0
5    5.0  0
6    6.0  0
7    7.0  0
8    8.0  0
9    9.0  0
10  10.0  1


#### Converting a cudf DataFrame to a NumPy Array

Often we want to work with NumPy arrays. We can convert a cuDF DataFrame to a NumPy array by first converting it to a Pandas DataFrame using the `to_pandas` method followed by accessing the `values` attribute of the Pandas DataFrame.

In [20]:
numpy_array = df.to_pandas().values
print(type(numpy_array))
print(numpy_array)

<class 'numpy.ndarray'>
[[ 0.  0.]
 [ 1.  0.]
 [ 2.  0.]
 [ 3. -1.]
 [ 4.  0.]
 [ 5.  0.]
 [ 6.  0.]
 [ 7.  0.]
 [ 8.  0.]
 [ 9.  0.]
 [10.  1.]]


#### Converting a cudf DataFrame to Other Data Formats

We can also convert a cuDF DataFrame to other data formats. 

For more information, see the documentation: https://docs.rapids.ai/api/cudf/stable/

<a id="io"></a>
## Input/Output

Before we process data and use it in machine learning models, we need to be able to load it into memory and write it after we're done using it. There are several ways to do this using cuDF.

#### Writing and Loading CSV Files

At this time, there is no direct way to use to cuDF to write directly to CSV. However, we can conver the cuDF DataFrame to a Pandas DataFrame and then write it directly to a CSV.

In [21]:
df.to_pandas().to_csv('./dataset.csv', index=False)

Perhaps one of the most common ways to create cuDF DataFrames is by loading a table that is stored as a file on disk. cuDF provides a lot of functionality for reading in a variety of different data formats. Below, we show how easy it is to read in a CSV file:

In [22]:
df = cudf.read_csv('./dataset.csv')
print(df)

     c   d
0  0.0   0
1  1.0   0
2  2.0   0
3  3.0  -1
4  4.0   0
5  5.0   0
6  6.0   0
7  7.0   0
8  8.0   0
9  9.0   0
[1 more rows]


CSV files come in many flavors and cuDF tries to be as flexible as possible, mirroring the Pandas API wherever possible. For more information on possible parameters for working with files, see the cuDF IO documentation: 

https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.io.csv.read_csv

<a id="cudfapi"></a>
## cuDF API

The cuDF API is pleasantly simple and mirrors the Pandas API as closely as possible. In this section, we will explore the cuDF API and show how to perform common data manipulation operations.

#### Selecting Rows or Columns

We can select rows from a cuDF DataFrame using slicing syntax. 

In [23]:
print(df[0:5])

     c   d
0  0.0   0
1  1.0   0
2  2.0   0
3  3.0  -1
4  4.0   0


There are several ways to select a column from a cuDF DataFrame.

In [24]:
print(df['c'])
# print(df.c)  # alternative

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
[1 more rows]
Name: c, dtype: float64


We can also select multiple columns by passing in a list of column names.

In [25]:
print(df[['c', 'd']])

     c   d
0  0.0   0
1  1.0   0
2  2.0   0
3  3.0  -1
4  4.0   0
5  5.0   0
6  6.0   0
7  7.0   0
8  8.0   0
9  9.0   0
[1 more rows]


We can select specific rows and columns using the slicing syntax as well as passing in a list of column names.

In [26]:
print(df.loc[0:5, ['c']])
# print(df.loc[0:5, ['c', 'd']])  # to select multiple columns, pass in multiple column names

     c
0  0.0
1  1.0
2  2.0
3  3.0
4  4.0
5  5.0


#### Missing Data (coming soon!)

#### Boolean Indexing (coming soon!)

#### Statistical Operations (coming soon!)

#### Applymap Operations (coming soon!)

#### Histogramming (coming soon!)

#### Merges (coming soon!)

#### Concatenations (coming soon!)

#### Groupbys (coming soon!)

#### One Hot Encoding (coming soon!)

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames in RAPIDS.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)