<a id="introduction"></a>
## Introduction to cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames in RAPIDS.

**Table of Contents**

* [Introduction to cuDF](#introduction)
* [Setup](#setup)
* [cuDF Series Basics](#series)
* [cuDF DataFrame Basics](#dataframes)
* [Input/Output](#io)
* [cuDF API](#cudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `nvcr.io/nvidia/rapidsai/rapidsai:0.8-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [NGC - rapidsai/rapidsai](https://ngc.nvidia.com/catalog/containers/nvidia:rapidsai:rapidsai)
* `rapidsai/rapidsai-nightly:0.9-cuda10.0-devel-ubuntu18.04-gcc7-py3.7` from [DockerHub - rapidsai/rapidsai-nightly](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA Tesla V100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Wed Jul 17 08:35:24 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla V100-SXM2...  On   | 00000000:06:00.0 Off |                    0 |
| N/A   37C    P0    58W / 300W |   5219MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:07:00.0 Off |                    0 |
| N/A   37C    P0    57W / 300W |    726MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:0A:00.0 Off |                    0 |
| N/A   

Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130


<a id="series"></a>
## cuDF Series Basics

First, let's load the cuDF library.

In [3]:
import cudf; print('cuDF Version:', cudf.__version__)

cuDF Version: 0.8.0+0.g8fa7bd3.dirty


There are two main data structures in cuDF: a `Series` object and a `DataFrame` object. Multiple `Series` objects are used as columns for a `DataFrame`. We'll first explore the `Series` class and build upon that foundation to later introduce how to work with objects of type `DataFrame`.

We can create a `Series` object using the `cudf.Series` class.

In [4]:
column = cudf.Series([10, 11, 12, 13])
column

<cudf.Series nrows=4 >

We see from the output that `column` is an object of type `cudf.Series` and has 4 rows.

Another way to inspect a `Series` is to use the Python `print` statement.

In [5]:
print(column)

0    10
1    11
2    12
3    13
dtype: int64


We see that our `Series` object has four rows with values 10, 11, 12, and 13. We also see that the type of this data is `int64`. There are several ways to represent data using cuDF. The most common formats are `int8`, `int32`, `int64`, `float32`, and `float64`.

We also see a column of values on the left hand side with values 0, 1, 2, 3. These values represent the index of the `Series`. 

In [6]:
print(column.index)

RangeIndex(start=0, stop=4)


We can create a new column with a different index by using the `set_index` method.

In [7]:
new_column = column.set_index([5, 6, 7, 8]) 
print(new_column)

5    10
6    11
7    12
8    13
dtype: int64


Indexes are useful for operations like joins and groupbys.

<a id="dataframes"></a>
## cuDF DataFrame Basics

As we showed in the previous tutorial, cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations.

In the next several sections, we'll show how to create and manipulate cuDF DataFrames. For more information on using cuDF DataFrames, check out the documentation: https://rapidsai.github.io/projects/cudf/en/latest/index.html

#### Creating a cudf DataFrame using lists

There are several ways to create a cuDF DataFrame. The easiest of these is to instantiate an empty cuDF DataFrame and then use Python list objects or NumPy arrays to create columns. Below, we create an empty cuDF DataFrame.

In [8]:
df = cudf.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


Next, we can create two columns named `key` and `value` by using the bracket notation with the cuDF DataFrame and storing either a list of Python values or a NumPy array into that column.

In [9]:
import numpy as np; print('NumPy Version:', np.__version__)


# here we create two columns named "key" and "value"
df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
print(df)

NumPy Version: 1.16.2
   key  value
0    0     10
1    1     11
2    2     12
3    3     13
4    4     14


#### Creating a cudf DataFrame using a list of tuples or a dictionary

Another way we can create a cuDF DataFrame is by providing a mapping of column names to column values, either via a list of tuples or by using a dictionary. In the below examples, we create a list of two-value tuples; the first value is the name of the column - for example, `id` or `timestamp` - and the second value is a list of Python objects or Numpy arrays. Note that we don't have to constrain the data stored in our cuDF DataFrames to common data types like integers or floats - we can use more exotic data types such as datetimes or strings. We'll investigate how such data types behave on the GPU a bit later.

In [10]:
from datetime import datetime, timedelta


ids = np.arange(5)
t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
timestamps = [(t0+ timedelta(seconds=x)) for x in range(5)]
timestamps_np = np.array(timestamps, dtype='datetime64')

In [11]:
df = cudf.DataFrame([('id', ids), ('timestamp', timestamps_np)])
print(df)

   id               timestamp
0   0 2018-10-07T12:00:00.000
1   1 2018-10-07T12:00:01.000
2   2 2018-10-07T12:00:02.000
3   3 2018-10-07T12:00:03.000
4   4 2018-10-07T12:00:04.000


Alternatively, we can create a dictonary of key-value pairs, where each key in the dictionary represents a column name and each value associated with the key represents the values that belong in that column.

In [12]:
df = cudf.DataFrame({'id': ids, 'timestamp': timestamps_np})
print(df)

   id               timestamp
0   0 2018-10-07T12:00:00.000
1   1 2018-10-07T12:00:01.000
2   2 2018-10-07T12:00:02.000
3   3 2018-10-07T12:00:03.000
4   4 2018-10-07T12:00:04.000


#### Creating a cudf DataFrame from a Pandas DataFrame

Pandas DataFrames are a first class citizen within cuDF - this means that we can create a cuDF DataFrame from a Pandas DataFrame and vice versa.

In [13]:
import pandas as pd; print('Pandas Version:', pd.__version__)


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

Pandas Version: 0.23.4
     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


We can use the `cudf.from_pandas` or `cudf.DataFrame.from_pandas` functions to create a cuDF DataFrame from a Pandas DataFrame.

In [14]:
df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

   a    b
0  0  0.0
1  1  0.1
2  2  0.2
3  3     
4  4  0.4
5  5  0.5
6  6  0.6
7  7  0.7
8  8  0.8
9  9  0.9
[1 more rows]


#### Creating a cuDF DataFrame from cuDF Series

We can create a cuDF DataFrame from one or more cuDF Series objects by passing the Series objects in a dictionary mapping each Series object to a column name.

In [15]:
column1 = cudf.Series([1, 2, 3, 4])
column2 = cudf.Series([5, 6, 7, 8])
column3 = cudf.Series([9, 10, 11, 12])
df = cudf.DataFrame({'a': column1, 'b': column2, 'c': column3})
print(df)

   a  b   c
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12


#### Inspecting a cuDF DataFrame

There are several ways to inspect a cuDF DataFrame. The first method is to enter the cuDF DataFrame directly into the REPL. This shows us information about the type of the object, and metadata such as the number of rows or columns.

In [16]:
df = cudf.DataFrame({'a': np.arange(0, 100), 'b': np.arange(100, 0, -1)})

In [17]:
df

<cudf.DataFrame ncols=2 nrows=100 >

A second way to inspect a cuDF DataFrame is to wrap the object in a Python `print` function. This results in showing the rows and columns of the dataframe.

In [18]:
print(df)

   a    b
0  0  100
1  1   99
2  2   98
3  3   97
4  4   96
5  5   95
6  6   94
7  7   93
8  8   92
9  9   91
[90 more rows]


For very large dataframes, we often want to see the first couple rows. We can use the `head` method of a cuDF DataFrame to view the first N rows.

In [19]:
print(df.head())

   a    b
0  0  100
1  1   99
2  2   98
3  3   97
4  4   96


#### Columns

cuDF DataFrames store metadata such as information about columns or data types. We can access the columns of a cuDF DataFrame using the `.columns` attribute.

In [20]:
print(df.columns)

Index(['a', 'b'], dtype='object')


We can modify the columns of a cuDF DataFrame by modifying the `columns` attribute. We can do this by setting that attribute equal to a list of strings representing the new columns.

In [21]:
df.columns = ['c', 'd']
print(df.columns)

Index(['c', 'd'], dtype='object')


#### Data Types

We can also inspect the data types of the columns of a cuDF DataFrame using the `dtypes` attribute.

In [22]:
print(df.dtypes)

c    int64
d    int64
dtype: object


We can modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type. Be warned that silent errors may be introduced from nonsensical type conversations - for example, changing a float to an integer or vice versa.

In [23]:
df['c'] = df['c'].astype(np.float32)
df['d'] = df['d'].astype(np.int32)
print(df.dtypes)

c    float32
d      int32
dtype: object


#### Series

cuDF DataFrames are composed of rows and columns. Each column is represented using an object of type `Series`. For example, if we subset a cuDF DataFrame using just one column we will be returned an object of type `cudf.dataframe.series.Series`.

In [24]:
print(type(df['c']))
print(df['c'])

<class 'cudf.dataframe.series.Series'>
0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
[90 more rows]
Name: c, dtype: float32


#### Index

Like `Series` objects, each `DataFrame` has an index attribute.

In [25]:
df.index

RangeIndex(start=0, stop=100)

We can use the index values to subset the `DataFrame`.

In [26]:
print(df[df.index == 2])

     c   d
2  2.0  98


#### Converting a cudf DataFrame to a Pandas DataFrame

We can convert a cuDF DataFrame back to a Pandas DataFrame using the `to_pandas` method.

In [27]:
pandas_df = df.to_pandas()
print(type(pandas_df))

<class 'pandas.core.frame.DataFrame'>


#### Converting a cudf DataFrame to a NumPy Array

Often we want to work with NumPy arrays. We can convert a cuDF DataFrame to a NumPy array by first converting it to a Pandas DataFrame using the `to_pandas` method followed by accessing the `values` attribute of the Pandas DataFrame.

In [28]:
numpy_array = df.to_pandas().values
print(type(numpy_array))

<class 'numpy.ndarray'>


#### Converting a cudf DataFrame to Other Data Formats

We can also convert a cuDF DataFrame to other data formats. 

For more information, see the documentation: https://docs.rapids.ai/api/cudf/stable/

<a id="io"></a>
## Input/Output

Before we process data and use it in machine learning models, we need to be able to load it into memory and write it after we're done using it. There are several ways to do this using cuDF.

#### Writing and Loading CSV Files

At this time, there is no direct way to use to cuDF to write directly to CSV. However, we can conver the cuDF DataFrame to a Pandas DataFrame and then write it directly to a CSV.

In [29]:
df.to_pandas().to_csv('./dataset.csv', index=False)

Perhaps one of the most common ways to create cuDF DataFrames is by loading a table that is stored as a file on disk. cuDF provides a lot of functionality for reading in a variety of different data formats. Below, we show how easy it is to read in a CSV file:

In [30]:
df = cudf.read_csv('./dataset.csv')
print(df)

     c    d
0  0.0  100
1  1.0   99
2  2.0   98
3  3.0   97
4  4.0   96
5  5.0   95
6  6.0   94
7  7.0   93
8  8.0   92
9  9.0   91
[90 more rows]


CSV files come in many flavors and cuDF tries to be as flexible as possible, mirroring the Pandas API wherever possible. For more information on possible parameters for working with files, see the cuDF IO documentation: 

https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.io.csv.read_csv

<a id="cudfapi"></a>
## cuDF API

The cuDF API is pleasantly simple and mirrors the Pandas API as closely as possible. In this section, we will explore the cuDF API and show how to perform common data manipulation operations.

#### Selecting Rows or Columns

We can select rows from a cuDF DataFrame using slicing syntax. 

In [31]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32)})

In [32]:
print(df[0:5])

     a      b
0  0.0  100.0
1  1.0   99.0
2  2.0   98.0
3  3.0   97.0
4  4.0   96.0


There are several ways to select a column from a cuDF DataFrame.

In [33]:
print(df['a'])
# print(df.a)  # alternative

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
[90 more rows]
Name: a, dtype: float32


We can also select multiple columns by passing in a list of column names.

In [34]:
print(df[['a', 'b']])

     a      b
0  0.0  100.0
1  1.0   99.0
2  2.0   98.0
3  3.0   97.0
4  4.0   96.0
5  5.0   95.0
6  6.0   94.0
7  7.0   93.0
8  8.0   92.0
9  9.0   91.0
[90 more rows]


We can select specific rows and columns using the slicing syntax as well as passing in a list of column names.

In [35]:
print(df.loc[0:5, ['a']])
# print(df.loc[0:5, ['a', 'b']])  # to select multiple columns, pass in multiple column names

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
Name: a, dtype: float32


#### Defining New Columns

We often want to define new columns from existing columns.

In [36]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

In [37]:
df['d'] = np.arange(200, 300).astype(np.float32)

print(df)

     a      b      c      d
0  0.0  100.0  100.0  200.0
1  1.0   99.0  101.0  201.0
2  2.0   98.0  102.0  202.0
3  3.0   97.0  103.0  203.0
4  4.0   96.0  104.0  204.0
5  5.0   95.0  105.0  205.0
6  6.0   94.0  106.0  206.0
7  7.0   93.0  107.0  207.0
8  8.0   92.0  108.0  208.0
9  9.0   91.0  109.0  209.0
[90 more rows]


In [38]:
data = np.arange(300, 400).astype(np.float32)
df.add_column('e', data)

print(df)

     a      b      c      d      e
0  0.0  100.0  100.0  200.0  300.0
1  1.0   99.0  101.0  201.0  301.0
2  2.0   98.0  102.0  202.0  302.0
3  3.0   97.0  103.0  203.0  303.0
4  4.0   96.0  104.0  204.0  304.0
5  5.0   95.0  105.0  205.0  305.0
6  6.0   94.0  106.0  206.0  306.0
7  7.0   93.0  107.0  207.0  307.0
8  8.0   92.0  108.0  208.0  308.0
9  9.0   91.0  109.0  209.0  309.0
[90 more rows]


#### Dropping Columns

Alternatively, we may want to remove columns from our `DataFrame`. We can do so using the `drop_column` method. Note that this method removes a column in-place - meaning that the `DataFrame` we act on will be modified.

In [39]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

In [40]:
df.drop_column('a')
print(df)

       b      c
0  100.0  100.0
1   99.0  101.0
2   98.0  102.0
3   97.0  103.0
4   96.0  104.0
5   95.0  105.0
6   94.0  106.0
7   93.0  107.0
8   92.0  108.0
9   91.0  109.0
[90 more rows]


If we want to remove a column without modifying the original DataFrame, we can use the `drop` method. This method will return a new DataFrame without that column (or columns).

In [41]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

In [42]:
new_df = df.drop('a')

print('Original DataFrame:')
print(df)
print(79 * '-')
print('New DataFrame:')
print(new_df)

Original DataFrame:
     a      b      c
0  0.0  100.0  100.0
1  1.0   99.0  101.0
2  2.0   98.0  102.0
3  3.0   97.0  103.0
4  4.0   96.0  104.0
5  5.0   95.0  105.0
6  6.0   94.0  106.0
7  7.0   93.0  107.0
8  8.0   92.0  108.0
9  9.0   91.0  109.0
[90 more rows]
-------------------------------------------------------------------------------
New DataFrame:
       b      c
0  100.0  100.0
1   99.0  101.0
2   98.0  102.0
3   97.0  103.0
4   96.0  104.0
5   95.0  105.0
6   94.0  106.0
7   93.0  107.0
8   92.0  108.0
9   91.0  109.0
[90 more rows]


We can also pass in a list of column names to drop.

In [43]:
new_df = df.drop(['a', 'b'])

print('Original DataFrame:')
print(df)
print(79 * '-')
print('New DataFrame:')
print(new_df)

Original DataFrame:
     a      b      c
0  0.0  100.0  100.0
1  1.0   99.0  101.0
2  2.0   98.0  102.0
3  3.0   97.0  103.0
4  4.0   96.0  104.0
5  5.0   95.0  105.0
6  6.0   94.0  106.0
7  7.0   93.0  107.0
8  8.0   92.0  108.0
9  9.0   91.0  109.0
[90 more rows]
-------------------------------------------------------------------------------
New DataFrame:
       c
0  100.0
1  101.0
2  102.0
3  103.0
4  104.0
5  105.0
6  106.0
7  107.0
8  108.0
9  109.0
[90 more rows]


#### Missing Data

Sometimes data is not as clean as we would like it - often there wrong values or values that are missing entirely. cuDF DataFrames can represent missing values using the Python `None` keyword.

In [44]:
df = cudf.DataFrame({'a': [0, None, 2, 3, 4, 5, 6, 7, 8, None, 10],
                     'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
                     'c': [0.0, 0.1, None, None, 0.4, 0.5, None, 0.7, 0.8, 0.9, 1.0]})
print(df)

   a    b    c
0  0  0.0  0.0
1     0.1  0.1
2  2  0.2     
3  3          
4  4  0.4  0.4
5  5  0.5  0.5
6  6  0.6     
7  7  0.7  0.7
8  8  0.8  0.8
9     0.9  0.9
[1 more rows]


We can also fill in these missing values with another value using the `fillna` method. Both `Series` and `DataFrame` objects implement this method.

In [45]:
df['c'] = df['c'].fillna(999)
print(df)

   a    b      c
0  0  0.0    0.0
1     0.1    0.1
2  2  0.2  999.0
3  3       999.0
4  4  0.4    0.4
5  5  0.5    0.5
6  6  0.6  999.0
7  7  0.7    0.7
8  8  0.8    0.8
9     0.9    0.9
[1 more rows]


In [46]:
new_df = df.fillna(-1)
print(new_df)

    a     b      c
0   0   0.0    0.0
1  -1   0.1    0.1
2   2   0.2  999.0
3   3  -1.0  999.0
4   4   0.4    0.4
5   5   0.5    0.5
6   6   0.6  999.0
7   7   0.7    0.7
8   8   0.8    0.8
9  -1   0.9    0.9
[1 more rows]


#### Boolean Indexing

We previously saw how we can select certain rows from our dataset by using the bracket `[]` notation. However, we may want to select rows based on a certain criteria - this is called boolean indexing. We can combine the indexing notation with an array of boolean values to select only certain rows that meet this criteria.

In [47]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

In [48]:
mask = df['a'] == 3
df[mask]

<cudf.DataFrame ncols=4 nrows=25 >

#### Sorting Data

Data is often not sorted before we start to work with it. Sorting data is is very useful for optimizing operations like joins and aggregations, especially when the data is distributed.

We can sort data in cuDF using the `sort_values` method and passing in which column we want to sort by. 

In [49]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b  c    d
0  0  1  0  100
1  0  1  1   99
2  0  0  2   98
3  0  1  3   97
4  0  0  4   96


In [50]:
print(df.sort_values('d').head())

   a  b   c  d
99  3  1  99  1
98  3  0  98  2
97  3  1  97  3
96  3  1  96  4
95  3  1  95  5


We can also specify if the column we're sorting should be sorted in ascending or descending order by using the `ascending` argument and passing in `True` or `False`.

In [51]:
print(df.sort_values('c', ascending=False).head())

   a  b   c  d
99  3  1  99  1
98  3  0  98  2
97  3  1  97  3
96  3  1  96  4
95  3  1  95  5


We can sort by multiple columns by passing in a list of column names. 

In [52]:
print(df.sort_values(['a', 'b']).head())

   a  b  c   d
2  0  0  2  98
4  0  0  4  96
6  0  0  6  94
7  0  0  7  93
8  0  0  8  92


We can also specify which of those columns should be sorted in ascending or descending order by passing in a list of boolean values, where each boolean value maps to each column, respectively.

In [53]:
print('Sort with all columns specified descending:')
print(df.sort_values(['a', 'b'], ascending=False).head())
print(79 * '-')
print('Sort with both a descending and b ascending:')
print(df.sort_values(['a', 'b'], ascending=[False, True]).head())



Sort with all columns specified descending:
   a  b   c   d
78  3  1  78  22
82  3  1  82  18
84  3  1  84  16
85  3  1  85  15
88  3  1  88  12
-------------------------------------------------------------------------------
Sort with both a descending and b ascending:
   a  b   c   d
75  3  0  75  25
76  3  0  76  24
77  3  0  77  23
79  3  0  79  21
80  3  0  80  20


#### Statistical Operations

There are several statistical operations we can use to aggregate our data in meaningful ways. These can be applied to both `Series` and `DataFrame` objects.

In [54]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

In [55]:
df['a'].sum()

150

In [56]:
print(df.sum())

a     150
b      52
c    4950
d    5050
dtype: int64


#### Applymap Operations

While cuDF allows us to define new columns in interesting ways, we often want to work with more complex functions. We can define a function and use the `applymap` method to apply this function to each value in a column in element-wise fashion. While the below example is simple, it can be very easily extended to more complex workflows.

In [57]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

In [58]:
def add_ten_to_x(x):
    return x + 10

print(df['c'].applymap(add_ten_to_x))

0    10
1    11
2    12
3    13
4    14
5    15
6    16
7    17
8    18
9    19
[90 more rows]
Name: c, dtype: int32


#### Histogramming

We can access the value counts of a column using the `value_counts` method. Note that this is typically used with columns representing discrete data i.e. integers, strings, categoricals, etc. We may not be as interested in the value counts of numerical data e.g. how often the value 2.1 appears. The results of the `value_counts` method can be used with Python plotting libraries like Matplotlib or Seaborn to generate visualizations such as histograms.

In [59]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

In [60]:
result = df['a'].value_counts()
print(result)

0    25
1    25
2    25
3    25
dtype: int64


#### Concatenations

In everyday data science, we typically work with multiple sources of data and wish to combine these data into a single more meaningful representation. These operations are often called concatenations and joins. We can concatenate two or more dataframes together row-wise or column-wise by passing in a list of the dataframes to be concatenated into the `cudf.concat` function and specifying the axis along which to concatenate these dataframes.

If we want to concatenate the dataframes row-wise, we can specify `axis=0`. To concatenate column-wise, we can specify `axis=1`.

In [61]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})

In [62]:
df = cudf.concat([df1, df2], axis=0)
df

<cudf.DataFrame ncols=4 nrows=200 >

In [63]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'e': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'f': np.random.randint(2, size=100).astype(np.int32), 
                      'g': np.arange(0, 100).astype(np.int32), 
                      'h': np.arange(100, 0, -1).astype(np.int32)})

In [64]:
df = cudf.concat([df1, df2], axis=1)
df

<cudf.DataFrame ncols=8 nrows=100 >

#### Joins / Merges

Multiple dataframes can be joined together using a single (or multiple) column(s). There are two syntaxes for performing joins:

* One can use the `DataFrame.merge` method and pass in another dataframe to join, or
* One can use the `cudf.merge` function and pass in which dataframes to join.

Both syntaxes can also be passed a list of column names to an additional keyword argument `on` - this will specify which columns the dataframes should be joined on. If this keyword is not specified, cuDF will by default join using column names that appear in both dataframes.

In [65]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'e': np.arange(0, 100).astype(np.int32), 
                      'f': np.arange(100, 0, -1).astype(np.int32)})

In [66]:
df = df1.merge(df2)
print(df.head())

   a  b  c    d  e    f
0  0  1  0  100  1   99
1  0  0  1   99  0  100
2  0  1  2   98  1   99
3  0  1  3   97  1   99
4  0  1  4   96  1   99


In [67]:
df = df1.merge(df2, on=['a'])
print(df.head())

   a  b_x   c   d  b_y   e   f
0  2    1  64  36    0  64  36
1  2    0  65  35    0  64  36
2  2    1  66  34    0  64  36
3  2    1  67  33    0  64  36
4  2    1  68  32    0  64  36


In [68]:
df = df1.merge(df2, on=['a', 'b'])
print(df.head())

   a  b  c    d  e    f
0  0  1  0  100  1   99
1  0  0  1   99  0  100
2  0  1  2   98  1   99
3  0  1  3   97  1   99
4  0  1  4   96  1   99


In [69]:
df = cudf.merge(df1, df2)
print(df.head())

   a  b  c    d  e    f
0  0  1  0  100  1   99
1  0  0  1   99  0  100
2  0  1  2   98  1   99
3  0  1  3   97  1   99
4  0  1  4   96  1   99


In [70]:
df = cudf.merge(df1, df2, on=['a'])
print(df.head())

   a  b_x   c   d  b_y   e   f
0  2    1  64  36    0  64  36
1  2    0  65  35    0  64  36
2  2    1  66  34    0  64  36
3  2    1  67  33    0  64  36
4  2    1  68  32    0  64  36


In [71]:
df = cudf.merge(df1, df2, on=['a', 'b'])
print(df.head())

   a  b  c    d  e    f
0  0  1  0  100  1   99
1  0  0  1   99  0  100
2  0  1  2   98  1   99
3  0  1  3   97  1   99
4  0  1  4   96  1   99


#### Groupbys

A useful operation when working with datasets is to group the data using a specific key and aggregate the values mapping to those keys. For example, we might want to aggregate multiple temperature measurements taken during a day from a specific sensor and average those measurements to find avergage daily temperature at a specific geolocation.

cuDF allows us to perform such an operation using the `groupby` method. This will create an object of type `cudf.groupby.groupby.Groupby` that we can operate on using aggregation functions such as `sum`, `var`, or complex aggregation functions defined by the user.

We can also specify multiple columns to group on by passing a list of column names to the `groupby` method.

In [72]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b  c    d
0  0  0  0  100
1  0  1  1   99
2  0  0  2   98
3  0  1  3   97
4  0  0  4   96


In [73]:
grouped_df = df.groupby('a')
grouped_df

<cudf.groupby.groupby.Groupby at 0x7fe5b664fda0>

In [74]:
aggregation = grouped_df.sum()
print(aggregation)

    b     c     d
a
0  13   300  2200
1  19   925  1575
2  15  1550   950
3   9  2175   325


In [75]:
aggregation = df.groupby(['a', 'b']).sum().to_pandas()
print(aggregation)

        c     d
a b            
0 0   143  1057
  1   157  1143
1 0   245   355
  1   680  1220
2 0   635   365
  1   915   585
3 0  1408   192
  1   767   133


#### One Hot Encoding

Data scientists often work with discrete data such as integers or categories. However, this data can be represented using a One Hote Encoding format.

cuDF allows us to convert these discrete datas to a One Hot Encoding format using the `one_hot_encoding` method. We can pass this method the column name to convert, a prefix with which to prepend to each newly created column, and the categories of data to create new columns for. We can pass in all the categories in the discrete data or a subset - cuDF will flexibly handle both and only create new columns for the categories specified.

In [76]:
categories = [0, 1, 2, 3]
df = cudf.DataFrame({'a': np.repeat(categories, 25).astype(np.int32), 
                     'b': np.arange(0, 100).astype(np.int32), 
                     'c': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

   a  b    c
0  0  0  100
1  0  1   99
2  0  2   98
3  0  3   97
4  0  4   96


In [77]:
result = df.one_hot_encoding('a', prefix='a_', cats=categories)
print(result.head())
print(result.tail())

   a  b    c  a__0  a__1  a__2  a__3
0  0  0  100   1.0   0.0   0.0   0.0
1  0  1   99   1.0   0.0   0.0   0.0
2  0  2   98   1.0   0.0   0.0   0.0
3  0  3   97   1.0   0.0   0.0   0.0
4  0  4   96   1.0   0.0   0.0   0.0
   a   b  c  a__0  a__1  a__2  a__3
95  3  95  5   0.0   0.0   0.0   1.0
96  3  96  4   0.0   0.0   0.0   1.0
97  3  97  3   0.0   0.0   0.0   1.0
98  3  98  2   0.0   0.0   0.0   1.0
99  3  99  1   0.0   0.0   0.0   1.0


In [78]:
result = df.one_hot_encoding('a', prefix='a_', cats=[0, 1, 2])
print(result.head())
print(result.tail())

   a  b    c  a__0  a__1  a__2
0  0  0  100   1.0   0.0   0.0
1  0  1   99   1.0   0.0   0.0
2  0  2   98   1.0   0.0   0.0
3  0  3   97   1.0   0.0   0.0
4  0  4   96   1.0   0.0   0.0
   a   b  c  a__0  a__1  a__2
95  3  95  5   0.0   0.0   0.0
96  3  96  4   0.0   0.0   0.0
97  3  97  3   0.0   0.0   0.0
98  3  98  2   0.0   0.0   0.0
99  3  99  1   0.0   0.0   0.0


<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames in RAPIDS.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)