<a href="https://colab.research.google.com/github/quaneh/tutorials-portfolio/blob/main/NVIDIA_RAPIDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Leveraging GPU-Accelerated Computing with NVIDIA RAPIDS

So, what is RAPIDS? From NVIDIA's own website, RAPIDS is "an open-source suite of GPU-accelerated data science and AI libraries with APIs that match the most popular open-source data tools."

In other words, RAPIDS can allow data scientists and AI engineers to drastically speed up their work, without having to drastically change their workflow.

In this notebook, I'll provide a bried tutorial and carry out some benchmarking using some simple data science workflows.

# Setup

Google Colab allows us to use GPUs for free, and by cloning the radidsai-csp-utils repo we can simplify the instalation of RAPIDS and other associated libraries.

In [None]:
#cuda version 12.2
!nvcc --version
!nvidia-smi # GPU vibe check - If this line fails, change your runtime type to T4 GPU in the toolbar on the top left of the screen.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Mon Sep 30 15:43:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                      

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 511, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (151/151), done.[K
remote: Total 511 (delta 159), reused 124 (delta 91), pack-reused 269 (from 1)[K
Receiving objects: 100% (511/511), 163.95 KiB | 818.00 KiB/s, done.
Resolving deltas: 100% (261/261), done.
Collecting pynvml
  Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 2.9 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.3
Installing RAPIDS remaining 24.4.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml-cu12==24.4.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1200.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 GB 1.1 

# Importing Libraries and Setting Up our Dataset

We'll kick this off by importing libraries and creating a synthetic dataset.
* cudf is equivalent to pandas
* cuml is equivalent to Scikit-Learn
* cupy is equivalent to numpy

We'll use Scikit-learn to create our data, and initialise our cuDF dataframe by converting the pandas dataframe. <br/> We've created a dataset for a classification problem, and we'll also split the data into train and test sets at this point.

In [None]:
import cudf
import cuml
import cupy as cp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from cuml.preprocessing import StandardScaler as cuStandardScaler


# Final checks that everything is ok with our GPU
print(f"CUDA available: {cp.cuda.is_available()}")
print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
print(f"Number of GPU devices: {cp.cuda.runtime.getDeviceCount()}")

CUDA available: True
CUDA version: 12020
Number of GPU devices: 1


In [None]:
# Generate a large synthetic dataset
n_samples = 1_000_000
n_features = 100
n_classes = 2

X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, random_state=42)

# Create pandas DataFrame
df_cpu = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)])
df_cpu['target'] = y

# Create cuDF DataFrame
df_gpu = cudf.DataFrame(df_cpu)

print(f"CPU DataFrame shape: {df_cpu.shape}")
print(f"GPU DataFrame shape: {df_gpu.shape}")

# Split the data into train and test sets
X_train_cpu, X_test_cpu, y_train_cpu, y_test_cpu = train_test_split(
    df_cpu.drop('target', axis=1), df_cpu['target'], test_size=0.2, random_state=42
)

X_train_gpu, X_test_gpu, y_train_gpu, y_test_gpu = train_test_split(
    df_gpu.drop('target', axis=1), df_gpu['target'], test_size=0.2, random_state=42
)

print(f"Training set shape: {X_train_cpu.shape}")
print(f"Test set shape: {X_test_cpu.shape}")

CPU DataFrame shape: (1000000, 101)
GPU DataFrame shape: (1000000, 101)
Training set shape: (800000, 100)
Test set shape: (200000, 100)


# First Benchmarking

We'll do some intial pre-processing of our datasets to get an initial idea of the kind of performance benefits we can expect with RAPIDS.

Here's what we'll do in our pre-preprocessing:
We'll artificially create some missing values, as these often exist in real

1.   We'll artificially create some missing values, and handle these missing values by imputing the mean value for this feature.
2.   We'll add a categorical feature, and then encode this feature using one-hot encoding.
3.   We will add some iteractino features.
4.   We'll scale our numerical features to ensure that certain features don't dominate the dataset and hide the importance of others.


In [None]:
from sklearn.preprocessing import StandardScaler
from cuml.preprocessing import StandardScaler as cuStandardScaler
import time

def preprocess_data_cpu(df):
    start_time = time.time()

    # Add some missing values
    df.loc[np.random.choice(df.index, 100000), 'feature_0'] = np.nan

    # Handle missing values
    df['feature_0'].fillna(df['feature_0'].mean(), inplace=True)

    # Create a categorical feature
    df['cat_feature'] = pd.qcut(df['feature_1'], q=5, labels=['A', 'B', 'C', 'D', 'E'])

    # Encode categorical variable
    df = pd.get_dummies(df, columns=['cat_feature'], dtype=float)

    # Create interaction features
    df['interaction_1'] = df['feature_2'] * df['feature_3']
    df['interaction_2'] = df['feature_4'] + df['feature_5']

    # Scale numerical features
    scaler = StandardScaler()
    numerical_columns = [f'feature_{i}' for i in range(100)]  # Original numerical feature columns
    df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

    end_time = time.time()
    return df, end_time - start_time

def preprocess_data_gpu(df):
    start_time = time.time()

    # Add some missing values
    df['feature_0'] = df['feature_0'].mask(cudf.Series(cp.random.choice([True, False], len(df), p=[0.1, 0.9])))

    # Handle missing values
    df['feature_0'] = df['feature_0'].fillna(df['feature_0'].mean())

    # Create a categorical feature
    df['cat_feature'] = cudf.cut(df['feature_1'], bins=5, labels=['A', 'B', 'C', 'D', 'E'])

    # Encode categorical variable
    df = cudf.get_dummies(df, columns=['cat_feature'], dtype=float)

    # Create interaction features
    df['interaction_1'] = df['feature_2'] * df['feature_3']
    df['interaction_2'] = df['feature_4'] + df['feature_5']

    # Scale numerical features
    scaler = cuStandardScaler()
    numerical_columns = [f'feature_{i}' for i in range(100)]  # Original numerical feature columns
    df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

    end_time = time.time()
    return df, end_time - start_time

# Preprocess data on CPU
df_cpu_preprocessed, cpu_time = preprocess_data_cpu(df_cpu.copy())
print(f"CPU preprocessing time: {cpu_time:.2f} seconds")

# Preprocess data on GPU
df_gpu_preprocessed, gpu_time = preprocess_data_gpu(df_gpu.copy())
print(f"GPU preprocessing time: {gpu_time:.2f} seconds")

speedup = cpu_time / gpu_time
print(f"Speedup factor: {speedup:.2f}x")

CPU preprocessing time: 31.57 seconds
GPU preprocessing time: 1.44 seconds
Speedup factor: 21.90x


Even with relatively simple operations, our speed-up is still significant. RAPIDS is over 10x faster!!

NOTE:
Did you notice our first GOTCHA when using RAPIDS?
The cuDF library does map 1-1 with pandas. When creating the categorical feature, we can see that RAPIDS does not have a qcut function. Instead, it used the cut function, which takes slightly different input params. Watch out and don't get caught out by this small difference like I did!


In [None]:
# Verify results
cpu_sum = df_cpu_preprocessed.sum().sum()
gpu_sum = df_gpu_preprocessed.sum().sum()
print(f"CPU sum: {cpu_sum:.2f}")
print(f"GPU sum: {gpu_sum:.2f}")
print(f"Relative difference: {abs(cpu_sum - gpu_sum) / cpu_sum:.2e}")

NameError: name 'df_cpu_preprocessed' is not defined

In [None]:
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuRandomForestClassifier
from sklearn.metrics import accuracy_score
import time

def train_evaluate_rf_cpu(X_train, y_train, X_test, y_test):
    rf_cpu = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

    start_time = time.time()
    rf_cpu.fit(X_train, y_train)
    train_time = time.time() - start_time

    start_time = time.time()
    y_pred = rf_cpu.predict(X_test)
    inference_time = time.time() - start_time

    accuracy = accuracy_score(y_test, y_pred)

    return train_time, inference_time, accuracy

def train_evaluate_rf_gpu(X_train, y_train, X_test, y_test):
    rf_gpu = cuRandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

    start_time = time.time()
    rf_gpu.fit(X_train, y_train)
    train_time = time.time() - start_time

    start_time = time.time()
    y_pred = rf_gpu.predict(X_test)
    inference_time = time.time() - start_time

    accuracy = accuracy_score(y_test.to_numpy(), y_pred.to_numpy())

    return train_time, inference_time, accuracy

# CPU Random Forest
cpu_train_time, cpu_inference_time, cpu_accuracy = train_evaluate_rf_cpu(
    X_train_cpu, y_train_cpu, X_test_cpu, y_test_cpu
)

print(f"CPU Training time: {cpu_train_time:.2f} seconds")
print(f"CPU Inference time: {cpu_inference_time:.2f} seconds")
print(f"CPU Accuracy: {cpu_accuracy:.4f}")

# GPU Random Forest
gpu_train_time, gpu_inference_time, gpu_accuracy = train_evaluate_rf_gpu(
    X_train_gpu, y_train_gpu, X_test_gpu, y_test_gpu
)

print(f"\nGPU Training time: {gpu_train_time:.2f} seconds")
print(f"GPU Inference time: {gpu_inference_time:.2f} seconds")
print(f"GPU Accuracy: {gpu_accuracy:.4f}")

print(f"\nTraining speedup factor: {cpu_train_time / gpu_train_time:.2f}x")
print(f"Inference speedup factor: {cpu_inference_time / gpu_inference_time:.2f}x")
print(f"Accuracy difference (GPU - CPU): {gpu_accuracy - cpu_accuracy:.4f}")

CPU Training time: 1271.22 seconds
CPU Inference time: 1.18 seconds
CPU Accuracy: 0.9753


  return func(**kwargs)
  ret = func(*args, **kwargs)



GPU Training time: 6.84 seconds
GPU Inference time: 0.07 seconds
GPU Accuracy: 0.9714

Training speedup factor: 185.91x
Inference speedup factor: 17.83x
Accuracy difference (GPU - CPU): -0.0040


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import time

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.test.is_built_with_cuda())
print("GPU devices:", tf.config.list_physical_devices('GPU'))

def create_model():
    model = Sequential([
        Dense(64, activation='relu', input_shape=(100,)),
        Dense(32, activation='relu'),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
    return model

def train_evaluate_nn(X_train, y_train, X_test, y_test, device):
    with tf.device(device):
        model = create_model()

        start_time = time.time()
        history = model.fit(X_train, y_train, epochs=10, batch_size=1024, validation_split=0.2, verbose=0)
        train_time = time.time() - start_time

        start_time = time.time()
        loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
        inference_time = time.time() - start_time

    return train_time, inference_time, accuracy, history

# Train on CPU
cpu_train_time, cpu_inference_time, cpu_accuracy, cpu_history = train_evaluate_nn(
    X_train_cpu.values, y_train_cpu.values, X_test_cpu.values, y_test_cpu.values, '/CPU:0'
)

print(f"CPU Training time: {cpu_train_time:.2f} seconds")
print(f"CPU Inference time: {cpu_inference_time:.2f} seconds")
print(f"CPU Accuracy: {cpu_accuracy:.4f}")

# Train on GPU
gpu_train_time, gpu_inference_time, gpu_accuracy, gpu_history = train_evaluate_nn(
    X_train_gpu.values.get(), y_train_gpu.values.get(), X_test_gpu.values.get(), y_test_gpu.values.get(), '/GPU:0'
)

print(f"\nGPU Training time: {gpu_train_time:.2f} seconds")
print(f"GPU Inference time: {gpu_inference_time:.2f} seconds")
print(f"GPU Accuracy: {gpu_accuracy:.4f}")

print(f"\nTraining speedup factor: {cpu_train_time / gpu_train_time:.2f}x")
print(f"Inference speedup factor: {cpu_inference_time / gpu_inference_time:.2f}x")
print(f"Accuracy difference (GPU - CPU): {gpu_accuracy - cpu_accuracy:.4f}")

# Plot training history
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(cpu_history.history['accuracy'], label='CPU Training')
plt.plot(cpu_history.history['val_accuracy'], label='CPU Validation')
plt.plot(gpu_history.history['accuracy'], label='GPU Training')
plt.plot(gpu_history.history['val_accuracy'], label='GPU Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(cpu_history.history['loss'], label='CPU Training')
plt.plot(cpu_history.history['val_loss'], label='CPU Validation')
plt.plot(gpu_history.history['loss'], label='GPU Training')
plt.plot(gpu_history.history['val_loss'], label='GPU Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.tight_layout()
plt.show()