<a id="introduction"></a>
## Introduction to cuDF
#### By Paul Hendricks
-------

In this notebook, we will show how to work with cuDF DataFrames in RAPIDS.

**Table of Contents**

* [Introduction to cuDF](#introduction)
* [Setup](#setup)
* [cuDF Series Basics](#series)
* [cuDF DataFrame Basics](#dataframes)
* [Input/Output](#io)
* [cuDF API](#cudfapi)
* [Conclusion](#conclusion)

<a id="setup"></a>
## Setup

This notebook was tested using the following Docker containers:

* `rapidsai/rapidsai-dev-nightly:0.12-cuda10.0-runtime-ubuntu16.04-py3.6` container from [DockerHub](https://hub.docker.com/r/rapidsai/rapidsai-nightly)

This notebook was run on the NVIDIA GV100 GPU. Please be aware that your system may be different and you may need to modify the code or install packages to run the below examples. 

If you think you have found a bug or an error, please file an issue here: https://github.com/rapidsai/notebooks-contrib/issues

Before we begin, let's check out our hardware setup by running the `nvidia-smi` command.

In [1]:
!nvidia-smi

Tue Aug  1 02:50:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |


|   0  Tesla T4            On   | 00000000:3B:00.0 Off |                    0 |
| N/A   49C    P0    29W /  70W |   5970MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:5E:00.0 Off |                    0 |
| N/A   58C    P0    29W /  70W |   2836MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:AF:00.0 Off |                    0 |
| N/A   51C    P0    29W /  70W |   2836MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+


|   3  Tesla T4            On   | 00000000:D8:00.0 Off |                    0 |
| N/A   47C    P0    28W /  70W |   2836MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
+-----------------------------------------------------------------------------+


Next, let's see what CUDA version we have:

In [2]:
!nvcc --version

/usr/bin/sh: 1: nvcc: not found


<a id="series"></a>
## cuDF Series Basics

First, let's load the cuDF library.

In [3]:
import cudf; print('cuDF Version:', cudf.__version__)

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy, cupy-cuda11x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------



ImportError: /opt/conda/lib/python3.9/site-packages/cudf/_lib/../libcudf.so: undefined symbol: _ZN5arrow5fieldESsSt10shared_ptrINS_8DataTypeEEbS0_IKNS_16KeyValueMetadataEE

There are two main data structures in cuDF: a `Series` object and a `DataFrame` object. Multiple `Series` objects are used as columns for a `DataFrame`. We'll first explore the `Series` class and build upon that foundation to later introduce how to work with objects of type `DataFrame`.

We can create a `Series` object using the `cudf.Series` class.

In [4]:
column = cudf.Series([10, 11, 12, 13])
column

NameError: name 'cudf' is not defined

We see from the output that `column` is an object of type `cudf.Series` and has 4 rows.

Another way to inspect a `Series` is to use the Python `print` statement.

In [5]:
print(column)

NameError: name 'column' is not defined

We see that our `Series` object has four rows with values 10, 11, 12, and 13. We also see that the type of this data is `int64`. There are several ways to represent data using cuDF. The most common formats are `int8`, `int32`, `int64`, `float32`, and `float64`.

We also see a column of values on the left hand side with values 0, 1, 2, 3. These values represent the index of the `Series`. 

In [6]:
print(column.index)

NameError: name 'column' is not defined

We can change the index of the `Series` by setting the `index` property.

In [7]:
column.index = [5, 6, 7, 8] 
column

NameError: name 'column' is not defined

Indexes are useful for operations like joins and groupbys.

<a id="dataframes"></a>
## cuDF DataFrame Basics

As we showed in the previous tutorial, cuDF DataFrames are a tabular structure of data that reside on the GPU. We interface with these cuDF DataFrames in the same way we interface with Pandas DataFrames that reside on the CPU - with a few deviations.

In the next several sections, we'll show how to create and manipulate cuDF DataFrames. For more information on using cuDF DataFrames, check out the documentation: https://docs.rapids.ai/api/cudf/stable/

#### Creating a cudf DataFrame using lists

There are several ways to create a cuDF DataFrame. The easiest of these is to instantiate an empty cuDF DataFrame and then use Python list objects or NumPy arrays to create columns. Below, we create an empty cuDF DataFrame.

In [8]:
df = cudf.DataFrame()
print(df)

NameError: name 'cudf' is not defined

Next, we can create two columns named `key` and `value` by using the bracket notation with the cuDF DataFrame and storing either a list of Python values or a NumPy array into that column.

In [9]:
import numpy as np; print('NumPy Version:', np.__version__)


# here we create two columns named "key" and "value"
df['key'] = [0, 1, 2, 3, 4]
df['value'] = np.arange(10, 15)
print(df)

NumPy Version: 1.24.4


NameError: name 'df' is not defined

#### Creating a cudf DataFrame using a list of tuples or a dictionary

Another way we can create a cuDF DataFrame is by providing a mapping of column names to column values, either via a list of tuples or by using a dictionary. In the below examples, we create a list of two-value tuples; the first value is the name of the column - for example, `id` or `timestamp` - and the second value is a list of Python objects or Numpy arrays. Note that we don't have to constrain the data stored in our cuDF DataFrames to common data types like integers or floats - we can use more exotic data types such as datetimes or strings. We'll investigate how such data types behave on the GPU a bit later.

In [10]:
from datetime import datetime, timedelta


ids = np.arange(5)
t0 = datetime.strptime('2018-10-07 12:00:00', '%Y-%m-%d %H:%M:%S')
timestamps = [(t0+ timedelta(seconds=x)) for x in range(5)]
timestamps_np = np.array(timestamps, dtype='datetime64')

In [11]:
df = cudf.DataFrame()
df['ids'] = ids
df['timestamp'] = timestamps_np
print(df)

NameError: name 'cudf' is not defined

Alternatively, we can create a dictonary of key-value pairs, where each key in the dictionary represents a column name and each value associated with the key represents the values that belong in that column.

In [12]:
df = cudf.DataFrame({'id': ids, 'timestamp': timestamps_np})
print(df)

NameError: name 'cudf' is not defined

#### Creating a cudf DataFrame from a Pandas DataFrame

Pandas DataFrames are a first class citizen within cuDF - this means that we can create a cuDF DataFrame from a Pandas DataFrame and vice versa.

In [13]:
import pandas as pd; print('Pandas Version:', pd.__version__)


pandas_df = pd.DataFrame({'a': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                          'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]})
print(pandas_df)

Pandas Version: 1.5.3
     a    b
0    0  0.0
1    1  0.1
2    2  0.2
3    3  NaN
4    4  0.4
5    5  0.5
6    6  0.6
7    7  0.7
8    8  0.8
9    9  0.9
10  10  1.0


We can use the `cudf.from_pandas` or `cudf.DataFrame.from_pandas` functions to create a cuDF DataFrame from a Pandas DataFrame.

In [14]:
df = cudf.from_pandas(pandas_df)
# df = cudf.DataFrame.from_pandas(pandas_df)  # alternative
print(df)

NameError: name 'cudf' is not defined

#### Creating a cuDF DataFrame from cuDF Series

We can create a cuDF DataFrame from one or more cuDF Series objects by passing the Series objects in a dictionary mapping each Series object to a column name.

In [15]:
column1 = cudf.Series([1, 2, 3, 4])
column2 = cudf.Series([5, 6, 7, 8])
column3 = cudf.Series([9, 10, 11, 12])
df = cudf.DataFrame({'a': column1, 'b': column2, 'c': column3})
print(df)

NameError: name 'cudf' is not defined

#### Inspecting a cuDF DataFrame

There are several ways to inspect a cuDF DataFrame. The first method is to enter the cuDF DataFrame directly into the REPL. This shows us information about the type of the object, and metadata such as the number of rows or columns.

In [16]:
df = cudf.DataFrame({'a': np.arange(0, 100), 'b': np.arange(100, 0, -1)})

NameError: name 'cudf' is not defined

In [17]:
df

NameError: name 'df' is not defined

A second way to inspect a cuDF DataFrame is to wrap the object in a Python `print` function. This results in showing the rows and columns of the dataframe.

In [18]:
print(df)

NameError: name 'df' is not defined

For very large dataframes, we often want to see the first couple rows. We can use the `head` method of a cuDF DataFrame to view the first N rows.

In [19]:
print(df.head())

NameError: name 'df' is not defined

#### Columns

cuDF DataFrames store metadata such as information about columns or data types. We can access the columns of a cuDF DataFrame using the `.columns` attribute.

In [20]:
print(df.columns)

NameError: name 'df' is not defined

We can modify the columns of a cuDF DataFrame by modifying the `columns` attribute. We can do this by setting that attribute equal to a list of strings representing the new columns.

In [21]:
df.columns = ['c', 'd']
print(df.columns)

NameError: name 'df' is not defined

#### Data Types

We can also inspect the data types of the columns of a cuDF DataFrame using the `dtypes` attribute.

In [22]:
print(df.dtypes)

NameError: name 'df' is not defined

We can modify the data types of the columns of a cuDF DataFrame by passing in a cuDF Series with a modified data type. Be warned that silent errors may be introduced from nonsensical type conversations - for example, changing a float to an integer or vice versa.

In [23]:
df['c'] = df['c'].astype(np.float32)
df['d'] = df['d'].astype(np.int32)
print(df.dtypes)

NameError: name 'df' is not defined

#### Series

cuDF DataFrames are composed of rows and columns. Each column is represented using an object of type `Series`. For example, if we subset a cuDF DataFrame using just one column we will be returned an object of type `cudf.dataframe.series.Series`.

In [24]:
print(type(df['c']))
print(df['c'])

NameError: name 'df' is not defined

#### Index

Like `Series` objects, each `DataFrame` has an index attribute.

In [25]:
df.index

NameError: name 'df' is not defined

We can use the index values to subset the `DataFrame`.

In [26]:
print(df[df.index == 2])

NameError: name 'df' is not defined

#### Converting a cudf DataFrame to a Pandas DataFrame

We can convert a cuDF DataFrame back to a Pandas DataFrame using the `to_pandas` method.

In [27]:
pandas_df = df.to_pandas()
print(type(pandas_df))

NameError: name 'df' is not defined

#### Converting a cudf DataFrame to a NumPy Array

Often we want to work with NumPy arrays. We can convert a cuDF DataFrame to a NumPy array by first converting it to a Pandas DataFrame using the `to_pandas` method followed by accessing the `values` attribute of the Pandas DataFrame.

In [28]:
numpy_array = df.to_pandas().values
print(type(numpy_array))

NameError: name 'df' is not defined

#### Converting a cudf DataFrame to Other Data Formats

We can also convert a cuDF DataFrame to other data formats. 

For more information, see the documentation: https://docs.rapids.ai/api/cudf/stable/

<a id="io"></a>
## Input/Output

Before we process data and use it in machine learning models, we need to be able to load it into memory and write it after we're done using it. There are several ways to do this using cuDF.

#### Writing and Loading CSV Files

We can write a cuDF DataFrame to a CSV file using the `to_csv()` method.

In [29]:
df.to_csv('./dataset.csv', index=False)

NameError: name 'df' is not defined

Perhaps one of the most common ways to create cuDF DataFrames is by loading a table that is stored as a file on disk. cuDF provides a lot of functionality for reading in a variety of different data formats. Below, we show how easy it is to read in a CSV file:

In [30]:
df = cudf.read_csv('./dataset.csv')
print(df)

NameError: name 'cudf' is not defined

CSV files come in many flavors and cuDF tries to be as flexible as possible, mirroring the Pandas API wherever possible. For more information on possible parameters for working with files, see the cuDF IO documentation: 

https://docs.rapids.ai/api/cudf/stable/api.html?highlight=read_csv#cudf.io.csv.read_csv

<a id="cudfapi"></a>
## cuDF API

The cuDF API is pleasantly simple and mirrors the Pandas API as closely as possible. In this section, we will explore the cuDF API and show how to perform common data manipulation operations.

#### Selecting Rows or Columns

We can select rows from a cuDF DataFrame using slicing syntax. 

In [31]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32)})

NameError: name 'cudf' is not defined

In [32]:
print(df[0:5])

NameError: name 'df' is not defined

There are several ways to select a column from a cuDF DataFrame.

In [33]:
print(df['a'])
# print(df.a)  # alternative

NameError: name 'df' is not defined

We can also select multiple columns by passing in a list of column names.

In [34]:
print(df[['a', 'b']])

NameError: name 'df' is not defined

We can select specific rows and columns using the slicing syntax as well as passing in a list of column names.

In [35]:
print(df.loc[0:5, ['a']])
# print(df.loc[0:5, ['a', 'b']])  # to select multiple columns, pass in multiple column names

NameError: name 'df' is not defined

#### Defining New Columns

We often want to define new columns from existing columns.

In [36]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

NameError: name 'cudf' is not defined

In [37]:
df['d'] = np.arange(200, 300).astype(np.float32)

print(df)

NameError: name 'df' is not defined

In [38]:
data = np.arange(300, 400).astype(np.float32)
# df.add_column('e', data)
df['e'] = data
print(df)

NameError: name 'df' is not defined

#### Dropping Columns

Alternatively, we may want to remove columns from our `DataFrame`. We can do so using the `drop_column` method. Note that this method removes a column in-place - meaning that the `DataFrame` we act on will be modified.

In [39]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

NameError: name 'cudf' is not defined

In [40]:
# df.drop_column('a')
df.drop(['a'], axis=1)
print(df)

NameError: name 'df' is not defined

If we want to remove a column without modifying the original DataFrame, we can use the `drop` method. This method will return a new DataFrame without that column (or columns).

In [41]:
df = cudf.DataFrame({'a': np.arange(0, 100).astype(np.float32), 
                     'b': np.arange(100, 0, -1).astype(np.float32), 
                     'c': np.arange(100, 200).astype(np.float32)})

NameError: name 'cudf' is not defined

In [42]:
# new_df = df.drop('a')
new_df = df.drop(['a'], axis=1)
print('Original DataFrame:')
print(df)
print(79 * '-')
print('New DataFrame:')
print(new_df)

NameError: name 'df' is not defined

We can also pass in a list of column names to drop.

In [43]:
# new_df = df.drop(['a', 'b'])
new_df = df.drop(['a', 'b'], axis=1)

print('Original DataFrame:')
print(df)
print(79 * '-')
print('New DataFrame:')
print(new_df)

NameError: name 'df' is not defined

#### Missing Data

Sometimes data is not as clean as we would like it - often there wrong values or values that are missing entirely. cuDF DataFrames can represent missing values using the Python `None` keyword.

In [44]:
df = cudf.DataFrame({'a': [0, None, 2, 3, 4, 5, 6, 7, 8, None, 10],
                     'b': [0.0, 0.1, 0.2, None, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], 
                     'c': [0.0, 0.1, None, None, 0.4, 0.5, None, 0.7, 0.8, 0.9, 1.0]})
print(df)

NameError: name 'cudf' is not defined

We can also fill in these missing values with another value using the `fillna` method. Both `Series` and `DataFrame` objects implement this method.

In [45]:
df['c'] = df['c'].fillna(999)
print(df)

NameError: name 'df' is not defined

In [46]:
new_df = df.fillna(-1)
print(new_df)

NameError: name 'df' is not defined

#### Boolean Indexing

We previously saw how we can select certain rows from our dataset by using the bracket `[]` notation. However, we may want to select rows based on a certain criteria - this is called boolean indexing. We can combine the indexing notation with an array of boolean values to select only certain rows that meet this criteria.

In [47]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [48]:
mask = df['a'] == 3
df[mask]

NameError: name 'df' is not defined

#### Sorting Data

Data is often not sorted before we start to work with it. Sorting data is is very useful for optimizing operations like joins and aggregations, especially when the data is distributed.

We can sort data in cuDF using the `sort_values` method and passing in which column we want to sort by. 

In [49]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

NameError: name 'cudf' is not defined

In [50]:
print(df.sort_values('d').head())

NameError: name 'df' is not defined

We can also specify if the column we're sorting should be sorted in ascending or descending order by using the `ascending` argument and passing in `True` or `False`.

In [51]:
print(df.sort_values('c', ascending=False).head())

NameError: name 'df' is not defined

We can sort by multiple columns by passing in a list of column names. 

In [52]:
print(df.sort_values(['a', 'b']).head())

NameError: name 'df' is not defined

We can also specify which of those columns should be sorted in ascending or descending order by passing in a list of boolean values, where each boolean value maps to each column, respectively.

In [53]:
print('Sort with all columns specified descending:')
print(df.sort_values(['a', 'b'], ascending=False).head())
print(79 * '-')
print('Sort with both a descending and b ascending:')
print(df.sort_values(['a', 'b'], ascending=[False, True]).head())

Sort with all columns specified descending:


NameError: name 'df' is not defined

#### Statistical Operations

There are several statistical operations we can use to aggregate our data in meaningful ways. These can be applied to both `Series` and `DataFrame` objects.

In [54]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [55]:
df['a'].sum()

NameError: name 'df' is not defined

In [56]:
print(df.sum())

NameError: name 'df' is not defined

#### Series Apply Operations

While cuDF allows us to define new columns in interesting ways, we often want to work with more complex functions. We can define a function and use the `apply` method to apply this function to each value in a column in element-wise fashion. While the below example is simple, it can be very easily extended to more complex workflows.

In [57]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [58]:
def add_ten_to_x(x):
    return x + 10

print(df['c'].apply(add_ten_to_x))

NameError: name 'df' is not defined

#### Histogramming

We can access the value counts of a column using the `value_counts` method. Note that this is typically used with columns representing discrete data i.e. integers, strings, categoricals, etc. We may not be as interested in the value counts of numerical data e.g. how often the value 2.1 appears. The results of the `value_counts` method can be used with Python plotting libraries like Matplotlib or Seaborn to generate visualizations such as histograms.

In [59]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [60]:
result = df['a'].value_counts()
print(result)

NameError: name 'df' is not defined

#### Concatenations

In everyday data science, we typically work with multiple sources of data and wish to combine these data into a single more meaningful representation. These operations are often called concatenations and joins. We can concatenate two or more dataframes together row-wise or column-wise by passing in a list of the dataframes to be concatenated into the `cudf.concat` function and specifying the axis along which to concatenate these dataframes.

If we want to concatenate the dataframes row-wise, we can specify `axis=0`. To concatenate column-wise, we can specify `axis=1`.

In [61]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [62]:
df = cudf.concat([df1, df2], axis=0)
df

NameError: name 'cudf' is not defined

In [63]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'e': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'f': np.random.randint(2, size=100).astype(np.int32), 
                      'g': np.arange(0, 100).astype(np.int32), 
                      'h': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [64]:
df = cudf.concat([df1, df2], axis=1)
df

NameError: name 'cudf' is not defined

#### Joins / Merges

Multiple dataframes can be joined together using a single (or multiple) column(s). There are two syntaxes for performing joins:

* One can use the `DataFrame.merge` method and pass in another dataframe to join, or
* One can use the `cudf.merge` function and pass in which dataframes to join.

Both syntaxes can also be passed a list of column names to an additional keyword argument `on` - this will specify which columns the dataframes should be joined on. If this keyword is not specified, cuDF will by default join using column names that appear in both dataframes.

In [65]:
df1 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'c': np.arange(0, 100).astype(np.int32), 
                      'd': np.arange(100, 0, -1).astype(np.int32)})
df2 = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                      'b': np.random.randint(2, size=100).astype(np.int32), 
                      'e': np.arange(0, 100).astype(np.int32), 
                      'f': np.arange(100, 0, -1).astype(np.int32)})

NameError: name 'cudf' is not defined

In [66]:
df = df1.merge(df2)
print(df.head())

NameError: name 'df1' is not defined

In [67]:
df = df1.merge(df2, on=['a'])
print(df.head())

NameError: name 'df1' is not defined

In [68]:
df = df1.merge(df2, on=['a', 'b'])
print(df.head())

NameError: name 'df1' is not defined

In [69]:
df = cudf.merge(df1, df2)
print(df.head())

NameError: name 'cudf' is not defined

In [70]:
df = cudf.merge(df1, df2, on=['a'])
print(df.head())

NameError: name 'cudf' is not defined

In [71]:
df = cudf.merge(df1, df2, on=['a', 'b'])
print(df.head())

NameError: name 'cudf' is not defined

#### Groupbys

A useful operation when working with datasets is to group the data using a specific key and aggregate the values mapping to those keys. For example, we might want to aggregate multiple temperature measurements taken during a day from a specific sensor and average those measurements to find avergage daily temperature at a specific geolocation.

cuDF allows us to perform such an operation using the `groupby` method. This will create an object of type `cudf.groupby.groupby.Groupby` that we can operate on using aggregation functions such as `sum`, `var`, or complex aggregation functions defined by the user.

We can also specify multiple columns to group on by passing a list of column names to the `groupby` method.

In [72]:
df = cudf.DataFrame({'a': np.repeat([0, 1, 2, 3], 25).astype(np.int32), 
                     'b': np.random.randint(2, size=100).astype(np.int32), 
                     'c': np.arange(0, 100).astype(np.int32), 
                     'd': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

NameError: name 'cudf' is not defined

In [73]:
grouped_df = df.groupby('a')
grouped_df

NameError: name 'df' is not defined

In [74]:
aggregation = grouped_df.sum()
print(aggregation)

NameError: name 'grouped_df' is not defined

In [75]:
aggregation = df.groupby(['a', 'b']).sum().to_pandas()
print(aggregation)

NameError: name 'df' is not defined

#### One Hot Encoding

Data scientists often work with discrete data such as integers or categories. However, this data can be represented using a One Hote Encoding format.

cuDF allows us to convert these discrete datas to a One Hot Encoding format using the `one_hot_encoding` method. We can pass this method the column name to convert, a prefix with which to prepend to each newly created column, and the categories of data to create new columns for. We can pass in all the categories in the discrete data or a subset - cuDF will flexibly handle both and only create new columns for the categories specified.

In [76]:
categories = [0, 1, 2, 3]
df = cudf.DataFrame({'a': np.repeat(categories, 25).astype(np.int32), 
                     'b': np.arange(0, 100).astype(np.int32), 
                     'c': np.arange(100, 0, -1).astype(np.int32)})
print(df.head())

NameError: name 'cudf' is not defined

In [77]:
result = cudf.get_dummies(df, columns='a', prefix='a_', cats=categories)
print(result.head())
print(result.tail())

NameError: name 'cudf' is not defined

In [78]:
result = cudf.get_dummies(df, columns='a', prefix='a_', cats=[0, 1, 2])
print(result.head())
print(result.tail())

NameError: name 'cudf' is not defined

<a id="conclusion"></a>
## Conclusion

In this notebook, we showed how to work with cuDF DataFrames in RAPIDS.

To learn more about RAPIDS, be sure to check out: 

* [Open Source Website](http://rapids.ai)
* [GitHub](https://github.com/rapidsai/)
* [Press Release](https://nvidianews.nvidia.com/news/nvidia-introduces-rapids-open-source-gpu-acceleration-platform-for-large-scale-data-analytics-and-machine-learning)
* [NVIDIA Blog](https://blogs.nvidia.com/blog/2018/10/10/rapids-data-science-open-source-community/)
* [Developer Blog](https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/)
* [NVIDIA Data Science Webpage](https://www.nvidia.com/en-us/deep-learning-ai/solutions/data-science/)