# RAPIDS cuDF

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ritchieng/deep-learning-wizard/blob/master/docs/machine_learning/gpu/rapids_cudf.ipynb)

## Environment Setup

### Check Version

#### Python Version

In [1]:
# Check Python Version
!python --version

Python 3.8.16


#### Ubuntu Version

In [2]:
# Check Ubuntu Version
!lsb_release -a

No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.6 LTS
Release:	18.04
Codename:	bionic


#### Check CUDA Version

In [3]:
# Check CUDA/cuDNN Version
!nvcc -V && which nvcc

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
/usr/local/cuda/bin/nvcc


#### Check GPU Version

In [4]:
# Check GPU
!nvidia-smi

Wed Jan  4 19:14:22 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    29W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. xgboost

**This will complete in about 3-4 minutes**

Please use the [RAPIDS Conda Colab Template notebook](https://colab.research.google.com/drive/1TAAi_szMfWqRfHVfjGSqnGVLr_ztzUM9) if you need to install any of RAPIDS Extended libraries, such as:
- cuSpatial
- cuSignal
- cuxFilter
- cuCIM

OR
- nightly versions of any library 

In [5]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 328, done.[K
remote: Counting objects: 100% (157/157), done.[K
remote: Compressing objects: 100% (102/102), done.[K
remote: Total 328 (delta 92), reused 98 (delta 55), pack-reused 171[K
Receiving objects: 100% (328/328), 94.64 KiB | 18.93 MiB/s, done.
Resolving deltas: 100% (154/154), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pynvml
  Downloading pynvml-11.4.1-py3-none-any.whl (46 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.0/47.0 KB 6.1 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.4.1
***********************************************************************
Woo! Your instance has the right kind of GPU, a Tesla T4!
We will now install RAPIDS via pip!  Please stand by, should be quick...
***********************************************************************

Looking in indexes: https://pypi.org/

## Critical Imports

In [6]:
# Critical imports
import cudf
import cuml
import os
import numpy as np
import pandas as pd

## Creating

### Create a Series of integers

In [7]:
gdf = cudf.Series([1, 2, 3, 4, 5, 6])
print(gdf)
print(type(gdf))

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64
<class 'cudf.core.series.Series'>


### Create a Series of floats

In [8]:
gdf = cudf.Series([1., 2., 3., 4., 5., 6.])
print(gdf)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64


### Create a  Series of strings


In [9]:
gdf = cudf.Series(['a', 'b', 'c'])
print(gdf)

0    a
1    b
2    c
dtype: object


### Create 3 column DataFrame
- Consisting of dates, integers and floats

In [10]:
# Import
import datetime as dt

# Using a dictionary of key-value pairs
# Each key in the dictionary represents a category
# The key is the category's name
# The value is a list of the values in that category
gdf = cudf.DataFrame({
    # Create 10 busindates ess from 1st January 2019 via pandas
    'dates': pd.date_range('1/1/2019', periods=10, freq='B'),
    # Integers
    'integers': [i for i in range(10)],
    # Floats
    'floats': [float(i) for i in range(10)]
})

# Print dataframe
print(gdf)

       dates  integers  floats
0 2019-01-01         0     0.0
1 2019-01-02         1     1.0
2 2019-01-03         2     2.0
3 2019-01-04         3     3.0
4 2019-01-07         4     4.0
5 2019-01-08         5     5.0
6 2019-01-09         6     6.0
7 2019-01-10         7     7.0
8 2019-01-11         8     8.0
9 2019-01-14         9     9.0


### Create 2 column Dataframe
- Consisting of integers and string category

In [11]:
# Using a dictionary
# Each key in the dictionary represents a category
# The key is the category's name
# The value is a list of the values in that category
gdf = cudf.DataFrame({
    'integers': [1 ,2, 3, 4],
    'string': ['a', 'b', 'c', 'd']
})

print(gdf)

   integers string
0         1      a
1         2      b
2         3      c
3         4      d


### Create a 2 Column  Dataframe with Pandas Bridge
- Consisting of integers and string category
- For all string columns, you must convert them to type `category` for filtering functions to work intuitively (for now)

In [12]:
# Create pandas dataframe
pandas_df = pd.DataFrame({
    'integers': [1, 2, 3, 4], 
    'strings': ['a', 'b', 'c', 'd']
})

# Convert string column to category format
pandas_df['strings'] = pandas_df['strings'].astype('category')

# Bridge from pandas to cudf
gdf = cudf.DataFrame.from_pandas(pandas_df)

# Print dataframe
print(gdf)

   integers strings
0         1       a
1         2       b
2         3       c
3         4       d


## Viewing

### Printing Column Names

In [13]:
gdf.columns

Index(['integers', 'strings'], dtype='object')

### Viewing Top of DataFrame

In [14]:
num_of_rows_to_view = 2 
print(gdf.head(num_of_rows_to_view))

   integers strings
0         1       a
1         2       b


### Viewing Bottom of DataFrame

In [15]:
num_of_rows_to_view = 3 
print(gdf.tail(num_of_rows_to_view))

   integers strings
1         2       b
2         3       c
3         4       d


## Filtering

### Method 1: Query

#### Filtering Integers/Floats by Column Values
- This only works for floats and integers, not for strings

In [16]:
# DO NOT RUN
# TOFIX: `cffi` package version mismatch error
print(gdf.query('integers == 1'))

   integers strings
0         1       a


#### Filtering Strings by Column Values
- This only works for floats and integers, not for strings so this will return an error!

In [17]:
print(gdf.query('strings == a'))

KeyError: ignored

### Method 2:  Simple Columns

#### Filtering Strings by Column Values


In [18]:
# Filtering based on the string column
print(gdf[gdf.strings == 'b'])

   integers strings
1         2       b


#### Filtering Integers/Floats by Column Values

In [19]:
# Filtering based on the string column
print(gdf[gdf.integers == 2])

   integers strings
1         2       b


### Method 2:  Simple Rows

#### Filtering by Row Numbers

In [20]:
# Filter rows 0 to 2 (not inclusive of the third row with the index 2)
print(gdf[0:2])

   integers strings
0         1       a
1         2       b


### Method 3:  loc[rows, columns]

In [21]:
# The syntax is as follows loc[rows, columns] allowing you to choose rows and columns accordingly
# The example allows us to filter the first 3 rows (inclusive) of the column integers
print(gdf.loc[0:2, ['integers']])

   integers
0         1
1         2
2         3
