
# Accelerating Data Science Workflows with RAPIDS and cuDF

This notebook demonstrates how to transition from traditional CPU-based data processing to GPU-accelerated workflows using **RAPIDS**, an open-source suite of libraries developed by NVIDIA.

## Overview

GPU acceleration is a game-changer for handling large datasets and complex operations. RAPIDS provides tools like **cuDF** for pandas-like GPU-accelerated data manipulation, enabling faster and more efficient workflows with minimal code changes.

## Key Steps in the Notebook

### 1. Setting Up RAPIDS
The notebook begins by setting up RAPIDS. This involves enabling the cuDF extension, which allows you to run familiar pandas-like operations on GPUs seamlessly. 

### 2. Data Loading and Preparation
- **Data Ingestion**: Data is loaded into pandas DataFrames from CSV files.
- **Dataset Expansion**: The dataset is scaled to simulate large workloads by duplicating rows to reach a target size of 1 million rows.

### 3. GPU-Accelerated Data Manipulation
Using cuDF, common data manipulation tasks such as filtering, grouping, and merging are performed:
- **Filtering**: Subset rows based on conditions.
- **Grouping**: Aggregate data, e.g., calculating averages by group.
- **Merging**: Combine multiple DataFrames with additional metadata. 

### 4. Performance Profiling and Benchmarking
The notebook compares the execution time of key operations on the CPU and GPU:
- **Profiling**: Tools like `%cudf.pandas.profile` provide detailed performance metrics for GPU operations.
- **Benchmarking**: Commands like `%%time` and `%%timeit` measure and compare CPU and GPU runtimes, highlighting significant speedups achieved with GPU acceleration.

### 5. Verifying GPU Utilization
Checks are performed to confirm that operations are leveraging GPU resources effectively:
- **Type Checks**: Ensure arrays and DataFrames are processed on the GPU.
- **Execution Paths**: Verify whether the cuDF accelerator is active or falling back to pandas.

## Conclusion

By using RAPIDS and cuDF, this notebook demonstrates how to achieve significant performance improvements in data manipulation workflows. GPU acceleration enables faster processing for large datasets while maintaining compatibility with familiar pandas-like syntax, making it an accessible and powerful tool for data scientists.

This workflow illustrates the potential of RAPIDS to enhance efficiency and scalability in data science pipelines.


In [None]:
import pandas as pd
import time

# -------------------------
# Step 1: Load Data with CPU pandas
# -------------------------
train_df = pd.read_csv('./Titanic-Train-Dataset.csv') 
test_df = pd.read_csv('./Titanic-Test-Dataset.csv') 

# -------------------------
# Step 2: Expand DataFrames to ~1,000,000 Rows (if needed)
# -------------------------
target_rows = 1_000_000
# Use ceiling division to determine repeats
repeats_train = -(-target_rows // len(train_df))
repeats_test  = -(-target_rows // len(test_df))

# Concatenate copies to reach the desired number of rows
train_df = pd.concat([train_df] * repeats_train, ignore_index=True).head(target_rows)
test_df  = pd.concat([test_df] * repeats_test, ignore_index=True).head(target_rows)

# Create a combined list to mimic your workflow
combine = [train_df, test_df]

print("Before:", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

# -------------------------
# Step 3: Time the Operation Using Python's Time Module
# -------------------------
start_time = time.perf_counter()

# Drop the 'Ticket' and 'Cabin' columns using CPU pandas
train_df_trunc = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df_trunc  = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df_trunc, test_df_trunc]

end_time = time.perf_counter()

# Convert the elapsed time to milliseconds
elapsed_ms = (end_time - start_time) * 1000

print("After:", train_df_trunc.shape, test_df_trunc.shape, combine[0].shape, combine[1].shape)
print(f"Elapsed time: {elapsed_ms:.4f} milliseconds")

# -------------------------
# Step 4: Save final elapsed time for comparison later
# -------------------------
final_elapsed_time_ms = (end_time - start_time) * 1000


In [None]:
# ------------------------------------------------------------------------------------
# The following line is a Jupyter Notebook magic command that loads the cuDF pandas
# extension. This extension allows you to work with GPU-accelerated DataFrames using
# a pandas-like API.
#
# Note:
#   - This command is only valid within a Jupyter Notebook environment.
#   - If you attempt to run this code in a standard Python script, it will raise a
#     SyntaxError because the "%load_ext" syntax is not recognized in regular Python.
# ------------------------------------------------------------------------------------

%load_ext cudf.pandas


In [None]:
"""
This script loads the Titanic training and testing datasets using pandas,
concatenates them vertically, and prepares the data for further analysis.

Note:
    - The `cupy` module is imported to support GPU-accelerated operations if needed.
      In this snippet, it is not actively used.
"""

# Import the pandas library for data manipulation and analysis.
import pandas as pd

# Import the cupy library for GPU-accelerated numerical computations.
import cupy as cp

# -------------------------
# Step 1: Load Datasets
# -------------------------
# Load the Titanic training dataset from the CSV file.
train = pd.read_csv('./Titanic-Train-Dataset.csv')

# Load the Titanic testing dataset from the CSV file.
test = pd.read_csv('./Titanic-Test-Dataset.csv')

# -------------------------
# Step 2: Concatenate DataFrames
# -------------------------
# Concatenate the training and testing DataFrames vertically.
# The axis=0 parameter indicates vertical concatenation.
concat = pd.concat([train, test], axis=0)

# Optionally, print the shape of the concatenated DataFrame to verify the result.
print("Concatenated DataFrame shape:", concat.shape)


In [None]:
"""
Expand the train and test DataFrames to have exactly 1,000,000 rows each by repeating
their rows. The script uses ceiling division to determine the number of repeats needed,
concatenates the repeated DataFrame copies, and then truncates to the target number of rows.
Finally, the expanded DataFrames are combined into a list for further processing.

Expected outcomes (depending on the original column counts):
    - The expanded training DataFrame might have a shape like (1000000, N_train)
    - The expanded testing DataFrame might have a shape like (1000000, N_test)

In your example, commented shapes were:
    - train_df: (1000000, 2)
    - test_df:  (1000000, 2)
    - Later output showing (1000000, 12) and (1000000, 11) suggest differing original column counts.
"""

import pandas as pd

# Define the target number of rows for the expanded DataFrames.
TARGET_ROWS = 1_000_000

# -------------------------
# Expand the Training DataFrame
# -------------------------
# Calculate the number of repeats required using ceiling division.
# The formula: -(-TARGET_ROWS // len(train)) ensures that even if TARGET_ROWS is not an
# exact multiple of len(train), the result rounds up.
repeats = -(-TARGET_ROWS // len(train))

# Concatenate copies of the training DataFrame to reach or exceed TARGET_ROWS.
# The ignore_index=True parameter resets the index in the concatenated DataFrame.
# Finally, .head(TARGET_ROWS) ensures that the resulting DataFrame contains exactly TARGET_ROWS rows.
train_df = pd.concat([train] * repeats, ignore_index=True).head(TARGET_ROWS)

# Print the shape of the expanded training DataFrame.
# The comment indicates an expected shape, but actual results depend on the original DataFrame.
print("Training DataFrame shape:", train_df.shape)  # e.g., (1000000, 2) or (1000000, 12)

# -------------------------
# Expand the Testing DataFrame
# -------------------------
# Calculate the number of repeats needed for the test DataFrame using the same ceiling division.
repeats = -(-TARGET_ROWS // len(test))

# Expand the test DataFrame in the same manner.
test_df = pd.concat([test] * repeats, ignore_index=True).head(TARGET_ROWS)

# Print the shape of the expanded testing DataFrame.
print("Testing DataFrame shape:", test_df.shape)  # e.g., (1000000, 2) or (1000000, 11)

# -------------------------
# Combine the Expanded DataFrames
# -------------------------
# Place the expanded training and testing DataFrames into a list for further processing.
combine = [train_df, test_df]

# Print the shapes of the combined DataFrames for verification.
# Note: The shapes may vary if the original 'train' and 'test' DataFrames have different numbers
#       of columns. For example:
#       - combine[0].shape might be (1000000, 12)
#       - combine[1].shape might be (1000000, 11)
print("Combined DataFrame shapes:")
print("Train part shape:", combine[0].shape)
print("Test part shape:", combine[1].shape)


In [None]:
"""
Print the column names of the expanded training and testing DataFrames.

This snippet assumes that the DataFrames `train_df` and `test_df` have been
previously defined (for example, by loading CSV data and expanding them as shown
in previous code snippets). The output will display the column labels for each
DataFrame, helping to verify that the data has been loaded and processed correctly.
"""

# Print the column names for the training DataFrame.
print("Training DataFrame columns:", train_df.columns)

# Print the column names for the testing DataFrame.
print("Testing DataFrame columns:", test_df.columns)



In [None]:
%%cudf.pandas.profile
# ------------------------------------------------------------------------------------
# This cell profiles a sequence of DataFrame operations on 'train_df' using the
# cuDF pandas extension. The operations include:
#
#   1. Selecting the 'Pclass' and 'Survived' columns from the DataFrame.
#   2. Grouping the data by 'Pclass'. The parameter `as_index=False` ensures that
#      'Pclass' remains a column in the resulting DataFrame rather than becoming the index.
#   3. Calculating the mean for each group (i.e., the mean 'Survived' rate per 'Pclass').
#   4. Sorting the resulting DataFrame by the 'Survived' column in descending order.
#
# These operations allow you to quickly understand how survival rates vary across
# different passenger classes on the Titanic.
# ------------------------------------------------------------------------------------

# Select the columns 'Pclass' and 'Survived', then group by 'Pclass' without setting
# 'Pclass' as the index. Compute the mean values for each group and sort the result
# by the 'Survived' column in descending order.
train_df[["Pclass", "Survived"]].groupby(
    ["Pclass"], as_index=False
).mean().sort_values(
    by="Survived", ascending=False
)


In [None]:
"""
This script compares GPU and CPU execution times for specific DataFrame operations.
It assumes that:
    - A CPU runtime (in milliseconds) has already been measured and stored in
      'final_elapsed_time_ms'.
    - The DataFrames 'train_df', 'test_df', and the list 'combine' have been defined.
    
The GPU operation involves dropping the 'Ticket' and 'Cabin' columns from both the
training and testing DataFrames using GPU-accelerated operations. The script times
this operation and then prints a table comparing the CPU and GPU runtimes.
"""

import time

# Assume the CPU runtime (in milliseconds) has been measured in a previous cell or block.
cpu_elapsed_ms = final_elapsed_time_ms  # Previously recorded CPU runtime

# -------------------------
# GPU-Accelerated Operation Timing
# -------------------------
# Record the start time for the GPU operation.
start_time = time.perf_counter()

# Print the shapes of the DataFrames before performing the GPU-accelerated operation.
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

# Drop the 'Ticket' and 'Cabin' columns from the training DataFrame.
train_df_trunc = train_df.drop(['Ticket', 'Cabin'], axis=1)

# Drop the 'Ticket' and 'Cabin' columns from the testing DataFrame.
test_df_trunc = test_df.drop(['Ticket', 'Cabin'], axis=1)

# Update the combined list to include the truncated DataFrames.
combine = [train_df_trunc, test_df_trunc]

# Print the shapes of the DataFrames after dropping the specified columns.
print("After", train_df_trunc.shape, test_df_trunc.shape, combine[0].shape, combine[1].shape)

# Record the end time for the GPU operation.
end_time = time.perf_counter()

# Calculate the GPU runtime in milliseconds.
gpu_elapsed_ms = (end_time - start_time) * 1000

# Print the GPU runtime with four decimal precision.
print(f"GPU runtime: {gpu_elapsed_ms:.4f} milliseconds")

# -------------------------
# Compare the CPU and GPU runtimes and display the results in a table
# -------------------------
# Calculate the time savings by subtracting the GPU runtime from the CPU runtime.
time_savings_ms = cpu_elapsed_ms - gpu_elapsed_ms

# Print a table to compare the CPU and GPU runtimes.
print("\nTime Performance Comparison:")
print("+----------------+---------------+")
print("| Metric         | Time (ms)     |")
print("+----------------+---------------+")
print(f"| CPU Runtime    | {cpu_elapsed_ms:13.4f} |")
print(f"| GPU Runtime    | {gpu_elapsed_ms:13.4f} |")
print("+----------------+---------------+")
print(f"| Savings        | {time_savings_ms:13.4f} |")
print("+----------------+---------------+")


In [None]:
"""
Display DataFrame Information

This snippet prints summary information about the 'train_df' DataFrame.
The information includes:
    - The number of non-null entries per column.
    - The data type of each column.
    - Memory usage details.
    
This is useful for quickly assessing the structure and integrity of the dataset.

Assumptions:
    - The 'train_df' DataFrame has already been loaded (for example, via pd.read_csv()).
"""

# Print detailed information about the train_df DataFrame.
train_df.info()
