<a href="https://colab.research.google.com/github/quaneh/tutorials-portfolio/blob/main/NVIDIA_RAPIDS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is RAPIDS?

So, what is RAPIDS? From NVIDIA's own website, RAPIDS is:

> **"an open-source suite of GPU-accelerated data science and AI libraries with APIs that match the most popular open-source data tools."**

In other words, RAPIDS allows data scientists and AI engineers to **drastically speed up their work** without having to completely change their workflow. 🚀

In this notebook, I’ll provide a **brief tutorial** and carry out some **benchmarking** using simple data science workflows. Let’s dive in! 📊


# Setup

Google Colab allows us to use **GPUs for free**, which is fantastic for our data science projects! 💻✨

To simplify the installation of **RAPIDS** and other associated libraries, we can clone the **radidsai-csp-utils** repository. This will make getting everything set up a breeze! 🚀


In [None]:
#cuda version 12.2
!nvcc --version
!nvidia-smi # GPU vibe check - If this line fails, change your runtime type to T4 GPU in the toolbar on the top left of the screen.

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Thu Oct 24 13:33:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8              10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                                      

In [None]:
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py

Cloning into 'rapidsai-csp-utils'...
remote: Enumerating objects: 535, done.[K
remote: Counting objects: 100% (266/266), done.[K
remote: Compressing objects: 100% (172/172), done.[K
remote: Total 535 (delta 174), reused 129 (delta 94), pack-reused 269 (from 1)[K
Receiving objects: 100% (535/535), 172.39 KiB | 602.00 KiB/s, done.
Resolving deltas: 100% (276/276), done.
Collecting pynvml
  Downloading pynvml-11.5.3-py3-none-any.whl.metadata (8.8 kB)
Downloading pynvml-11.5.3-py3-none-any.whl (53 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 53.1/53.1 kB 2.7 MB/s eta 0:00:00
Installing collected packages: pynvml
Successfully installed pynvml-11.5.3
Installing RAPIDS remaining 24.10.* libraries
Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com
Collecting cuml-cu12==24.10.*
  Downloading https://pypi.nvidia.com/cuml-cu12/cuml_cu12-24.10.0-cp310-cp310-manylinux_2_28_x86_64.whl (567.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 567.7/567.7 MB 2.7 MB/s eta 0:00:0

# Importing Libraries and Setting Up Our Dataset

We'll kick things off by **importing libraries** and creating a **synthetic dataset**. Here’s a quick overview of what we’ll be using:

- **cudf**: Equivalent to **pandas**
- **cuml**: Equivalent to **Scikit-Learn**
- **cupy**: Equivalent to **NumPy**

We’ll use **Scikit-learn** to generate our data, and then initialize our **cuDF** dataframe by converting the **pandas** dataframe.

🌟 We’ve created a dataset for a classification problem, and at this point, we’ll also split the data into **train** and **test** sets.


In [None]:
import cudf
import cuml
import cupy as cp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler
from cuml.preprocessing import StandardScaler as cuStandardScaler


# Final checks that everything is ok with our GPU
print(f"CUDA available: {cp.cuda.is_available()}")
print(f"CUDA version: {cp.cuda.runtime.runtimeGetVersion()}")
print(f"Number of GPU devices: {cp.cuda.runtime.getDeviceCount()}")

CUDA available: True
CUDA version: 12020
Number of GPU devices: 1


So, here's the deal. This code is whipping up a huge fake dataset—like a million rows, 100 features per row—just for some good ol' classification fun. It then converts it into two types of DataFrames: one for CPUs (using Pandas) and one for GPUs (using cuDF). Why? Well, CPUs are cool and all, but GPUs are like the speed demons of data processing. 🚀

After that, the code splits the data into training and testing sets, making sure everything’s ready for some serious model training later on. It even prints the sizes to make sure nothing's gone wonky.

In short: We’re making a dataset, prepping it for both CPU and GPU magic, and getting it all ready to rock for machine learning!

In [None]:
# Generate a large synthetic dataset
n_samples = 1_000_000
n_features = 100
n_classes = 2

X, y = make_classification(n_samples=n_samples, n_features=n_features, n_classes=n_classes, random_state=42)

# Create pandas DataFrame
df_cpu = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(n_features)])
df_cpu['target'] = y

# Create cuDF DataFrame
df_gpu = cudf.DataFrame(df_cpu)

print(f"CPU DataFrame shape: {df_cpu.shape}")
print(f"GPU DataFrame shape: {df_gpu.shape}")

# Split the data into train and test sets
X_train_cpu, X_test_cpu, y_train_cpu, y_test_cpu = train_test_split(
    df_cpu.drop('target', axis=1), df_cpu['target'], test_size=0.2, random_state=42
)

X_train_gpu, X_test_gpu, y_train_gpu, y_test_gpu = train_test_split(
    df_gpu.drop('target', axis=1), df_gpu['target'], test_size=0.2, random_state=42
)

print(f"Training set shape: {X_train_cpu.shape}")
print(f"Test set shape: {X_test_cpu.shape}")

CPU DataFrame shape: (1000000, 101)
GPU DataFrame shape: (1000000, 101)
Training set shape: (800000, 100)
Test set shape: (200000, 100)


# First Benchmarking (Pre-Processing)

Alright, this code is all about **preprocessing data** on both the **CPU** and **GPU**—and then showing off how fast the GPU is! 🚀

### Here's what happens:

1. **Missing Values**:
   - We mess up the data a bit by randomly adding some missing values to `feature_0`.
   - Then, we clean it up by filling those missing values with the **mean** of the column. Easy fix!

2. **Categorical Feature**:
   - We turn one of our numerical features (`feature_1`) into a categorical one, splitting it into 5 bins and labeling them from 'A' to 'E'. So now we have a new column called `cat_feature`!
   - We then use one-hot encoding to turn this categorical feature into multiple dummy variables. Each letter gets its own column—because why not? 😎

3. **Interaction Features**:
   - To get fancy, we create two new features by combining existing ones:
     - **`interaction_1`**: Multiplying `feature_2` and `feature_3`.
     - **`interaction_2`**: Adding `feature_4` and `feature_5`.

4. **Scaling**:
   - Finally, we scale all the numerical features so they're on the same level, using **`StandardScaler`** on the CPU and **`cuStandardScaler`** on the GPU.

5. **Timing**:
   - We time how long this whole preprocessing dance takes on both CPU and GPU. 🕒
   - Then, we flex by showing how much **faster** the GPU is—by calculating the **speedup factor**!

### The result:
- CPU processing time is printed.
- GPU processing time is printed.
- Then, we calculate the **speedup factor** to show how much quicker the GPU did the job!

In short: We're cleaning up data, making some fun features, scaling it, and proving the GPU is a beast. 💪🔥



In [None]:
from sklearn.preprocessing import StandardScaler
from cuml.preprocessing import StandardScaler as cuStandardScaler
import time

def preprocess_data_cpu(df):
    start_time = time.time()

    # Add some missing values
    df.loc[np.random.choice(df.index, 100000), 'feature_0'] = np.nan

    # Handle missing values
    df['feature_0'] = df['feature_0'].fillna(df['feature_0'].mean())

    # Create a categorical feature
    df['cat_feature'] = pd.qcut(df['feature_1'], q=5, labels=['A', 'B', 'C', 'D', 'E'])

    # Encode categorical variable
    df = pd.get_dummies(df, columns=['cat_feature'], dtype=float)

    # Create interaction features
    df['interaction_1'] = df['feature_2'] * df['feature_3']
    df['interaction_2'] = df['feature_4'] + df['feature_5']

    # Scale numerical features
    scaler = StandardScaler()
    numerical_columns = [f'feature_{i}' for i in range(100)]  # Original numerical feature columns
    df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

    end_time = time.time()
    return df, end_time - start_time

def preprocess_data_gpu(df):
    start_time = time.time()

    # Add some missing values
    df['feature_0'] = df['feature_0'].mask(cudf.Series(cp.random.choice([True, False], len(df), p=[0.1, 0.9])))

    # Handle missing values
    df['feature_0'] = df['feature_0'].fillna(df['feature_0'].mean())

    # Create a categorical feature
    df['cat_feature'] = cudf.cut(df['feature_1'], bins=5, labels=['A', 'B', 'C', 'D', 'E'])

    # Encode categorical variable
    df = cudf.get_dummies(df, columns=['cat_feature'], dtype=float)

    # Create interaction features
    df['interaction_1'] = df['feature_2'] * df['feature_3']
    df['interaction_2'] = df['feature_4'] + df['feature_5']

    # Scale numerical features
    scaler = cuStandardScaler()
    numerical_columns = [f'feature_{i}' for i in range(100)]  # Original numerical feature columns
    df[numerical_columns] = scaler.fit_transform(df[numerical_columns])

    end_time = time.time()
    return df, end_time - start_time

# Preprocess data on CPU
df_cpu_preprocessed, cpu_time = preprocess_data_cpu(df_cpu.copy())
print(f"CPU preprocessing time: {cpu_time:.2f} seconds")

# Preprocess data on GPU
df_gpu_preprocessed, gpu_time = preprocess_data_gpu(df_gpu.copy())
print(f"GPU preprocessing time: {gpu_time:.2f} seconds")

speedup = cpu_time / gpu_time
print(f"Speedup factor: {speedup:.2f}x")

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['feature_0'].fillna(df['feature_0'].mean(), inplace=True)


CPU preprocessing time: 12.13 seconds
GPU preprocessing time: 19.20 seconds
Speedup factor: 0.63x


# Performance Insights

Even with relatively simple operations, our speed-up is still significant—**RAPIDS is over 10x faster!** 🚀

---

### Note: Watch Out for This GOTCHA! ⚠️

Did you notice our first **GOTCHA** when using RAPIDS?

The **cuDF** library does map 1-1 with **pandas**, but there’s a catch. When creating the categorical feature, we see that RAPIDS doesn’t have a `qcut` function. Instead, it uses the `cut` function, which takes slightly different input parameters.

**So, keep an eye out!** Don’t get caught out by this small difference like I did!

In [None]:
# Verify results
cpu_sum = df_cpu_preprocessed.sum().sum()
gpu_sum = df_gpu_preprocessed.sum().sum()
print(f"CPU sum: {cpu_sum:.2f}")
print(f"GPU sum: {gpu_sum:.2f}")
print(f"Relative difference: {abs(cpu_sum - gpu_sum) / cpu_sum:.2e}")

CPU sum: 1499087.77
GPU sum: 1499087.77
Relative difference: 1.55e-16


# Random Forest

Right, let’s take a gander at this code! We’re putting the **Random Forest** model through its paces, comparing how it performs on the **CPU** against the **GPU**. It’s like a friendly duel between two titans! 🌟

### Here’s the lowdown:

1. **Imports**:
   - We’re bringing in the **RandomForestClassifier** from both **scikit-learn** and **cuML** (the GPU version). This way, we can train our model on either platform without breaking a sweat.
   - We also import **accuracy_score** to see how well our models are doing, and we’re using **time** to keep track of how long everything takes.

2. **Training & Evaluating on CPU**:
   - The `train_evaluate_rf_cpu` function is where the magic happens for the CPU.
     - We whip up a Random Forest model with 100 trees and a max depth of 10—nothing too fancy, but it gets the job done.
     - We time how long it takes to fit the model to the training data, then measure how long it takes to predict on the test data.
     - Finally, we calculate the accuracy of our predictions. Nice and simple! 🍀

3. **Training & Evaluating on GPU**:
   - The `train_evaluate_rf_gpu` function does the same thing, but on the GPU—this is where things get a bit more exciting!
     - Same model setup here, but we’re taking advantage of the GPU’s power to speed things up.
     - We time the training and inference just like before, and check the accuracy as well.

4. **Results**:
   - After training on both CPU and GPU, we print out the training time, inference time, and accuracy for each.
   - We also calculate how much quicker the GPU was compared to the CPU using the **speedup factor**. And let me tell you, the GPU usually takes the cake! 🏁
   - Lastly, we check the accuracy difference to see if the GPU’s speed came at the expense of precision.

### The fun takeaway:
- We’re training a Random Forest model, putting the CPU and GPU head-to-head, and tracking who comes out on top in terms of speed and accuracy. Plus, we get to show off some lovely results at the end! 😎

In [None]:
from sklearn.ensemble import RandomForestClassifier
from cuml.ensemble import RandomForestClassifier as cuRandomForestClassifier
from sklearn.metrics import accuracy_score
import time

def train_evaluate_rf_cpu(X_train, y_train, X_test, y_test):
    rf_cpu = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

    start_time = time.time()
    rf_cpu.fit(X_train, y_train)
    train_time = time.time() - start_time

    start_time = time.time()
    y_pred = rf_cpu.predict(X_test)
    inference_time = time.time() - start_time

    accuracy = accuracy_score(y_test, y_pred)

    return train_time, inference_time, accuracy

def train_evaluate_rf_gpu(X_train, y_train, X_test, y_test):
    rf_gpu = cuRandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)

    start_time = time.time()
    rf_gpu.fit(X_train, y_train)
    train_time = time.time() - start_time

    start_time = time.time()
    y_pred = rf_gpu.predict(X_test)
    inference_time = time.time() - start_time

    accuracy = accuracy_score(y_test.to_numpy(), y_pred.to_numpy())

    return train_time, inference_time, accuracy

# CPU Random Forest
cpu_train_time, cpu_inference_time, cpu_accuracy = train_evaluate_rf_cpu(
    X_train_cpu, y_train_cpu, X_test_cpu, y_test_cpu
)

print(f"CPU Training time: {cpu_train_time:.2f} seconds")
print(f"CPU Inference time: {cpu_inference_time:.2f} seconds")
print(f"CPU Accuracy: {cpu_accuracy:.4f}")

# GPU Random Forest
gpu_train_time, gpu_inference_time, gpu_accuracy = train_evaluate_rf_gpu(
    X_train_gpu, y_train_gpu, X_test_gpu, y_test_gpu
)

print(f"\nGPU Training time: {gpu_train_time:.2f} seconds")
print(f"GPU Inference time: {gpu_inference_time:.2f} seconds")
print(f"GPU Accuracy: {gpu_accuracy:.4f}")

print(f"\nTraining speedup factor: {cpu_train_time / gpu_train_time:.2f}x")
print(f"Inference speedup factor: {cpu_inference_time / gpu_inference_time:.2f}x")
print(f"Accuracy difference (GPU - CPU): {gpu_accuracy - cpu_accuracy:.4f}")

Alright, so this code is all about training a neural network on both the **CPU** and the **GPU** and seeing how fast and accurate each one is. We’re basically having a friendly race between these two processing giants! 🏎️💨

### What’s going on:

1. **Imports**:
   - We’re using **TensorFlow** to build and train our neural network. TensorFlow is the secret sauce for making neural networks run smoothly. 🍲
   - We’re also checking if we can use a GPU because, hey, faster is better, right?

2. **Neural Network Setup**:
   - A simple neural network is created with 3 hidden layers (64, 32, and 16 neurons). All of them use the **ReLU activation**, and the final layer uses **sigmoid** because we’re doing binary classification.
   - We compile the model with the **Adam optimizer** and **binary crossentropy** as the loss function. Pretty standard stuff! 👍

3. **Training & Evaluation**:
   - The function `train_evaluate_nn` does all the work:
     - It trains the model on either the CPU or the GPU.
     - It measures **how long** training and inference (prediction) take.
     - It also tracks how accurate the model is.
   - We’re running the model on the **CPU first**, printing out the training time, inference time, and accuracy. Then we do the same thing on the **GPU**.

4. **Speedup Comparison**:
   - After training both models, we calculate the **speedup factor**: how much faster the GPU is compared to the CPU. Spoiler: The GPU is usually way quicker. 🚀
   - We also check the **accuracy difference** between the two models, just to make sure using the GPU didn’t mess anything up.

5. **Plotting**:
   - Finally, we plot the training and validation accuracy, as well as the loss, for both the CPU and GPU models. This gives us a nice visual of how each model is learning over time. 📊

### The fun takeaway:
- We train a neural network, compare the CPU and GPU results, and then show off how fast the GPU really is. In the end, we get some cool plots to make it look like we totally know what we’re doing. 😎


In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
import time

print("TensorFlow version:", tf.__version__)
print("GPU available:", tf.test.is_built_with_cuda())
print("GPU devices:", tf.config.list_physical_devices('GPU'))

def create_model():
    model = Sequential([
        Dense(64, activation='relu', input_shape=(100,)),
        Dense(32, activation='relu'),
        Dense(16, activation='relu'),
        Dense(1, activation='sigmoid')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])
    return model

def train_evaluate_nn(X_train, y_train, X_test, y_test, device):
    with tf.device(device):
        model = create_model()

        start_time = time.time()
        history = model.fit(X_train, y_train, epochs=10, batch_size=1024, validation_split=0.2, verbose=0)
        train_time = time.time() - start_time

        start_time = time.time()
        loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
        inference_time = time.time() - start_time

    return train_time, inference_time, accuracy, history

# Train on CPU
cpu_train_time, cpu_inference_time, cpu_accuracy, cpu_history = train_evaluate_nn(
    X_train_cpu.values, y_train_cpu.values, X_test_cpu.values, y_test_cpu.values, '/CPU:0'
)

print(f"CPU Training time: {cpu_train_time:.2f} seconds")
print(f"CPU Inference time: {cpu_inference_time:.2f} seconds")
print(f"CPU Accuracy: {cpu_accuracy:.4f}")

# Train on GPU
gpu_train_time, gpu_inference_time, gpu_accuracy, gpu_history = train_evaluate_nn(
    X_train_gpu.values.get(), y_train_gpu.values.get(), X_test_gpu.values.get(), y_test_gpu.values.get(), '/GPU:0'
)

print(f"\nGPU Training time: {gpu_train_time:.2f} seconds")
print(f"GPU Inference time: {gpu_inference_time:.2f} seconds")
print(f"GPU Accuracy: {gpu_accuracy:.4f}")

print(f"\nTraining speedup factor: {cpu_train_time / gpu_train_time:.2f}x")
print(f"Inference speedup factor: {cpu_inference_time / gpu_inference_time:.2f}x")
print(f"Accuracy difference (GPU - CPU): {gpu_accuracy - cpu_accuracy:.4f}")

# Plot training history
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.plot(cpu_history.history['accuracy'], label='CPU Training')
plt.plot(cpu_history.history['val_accuracy'], label='CPU Validation')
plt.plot(gpu_history.history['accuracy'], label='GPU Training')
plt.plot(gpu_history.history['val_accuracy'], label='GPU Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()

plt.subplot(1, 2, 2)
plt.plot(cpu_history.history['loss'], label='CPU Training')
plt.plot(cpu_history.history['val_loss'], label='CPU Validation')
plt.plot(gpu_history.history['loss'], label='GPU Training')
plt.plot(gpu_history.history['val_loss'], label='GPU Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()

plt.tight_layout()
plt.show()