# Tips for Data Scientists to Get Started with GPU Acceleration

## Introduction
This notebook showcases important functionalities that are important for data scientist and how RAPIDS accelerates workflows using its powerful suite of libraries and frameworks.  

## Data We'll be Using
We'll be exploring and augmenting the Titanic passenger demographic data set from Kaggle to showcase how you can apply these functions to yoru real world data.  The dataset used for this notebook can be downloaded from Kaggle and consists of a 
- [train](https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/input?select=train.csv) dataset
- [test](https://www.kaggle.com/code/startupsci/titanic-data-science-solutions/input?select=test.csv) dataset

You will need to accept the terms of the competition before you can download it.  Once you do, please download both before continuing and put them into the same folder as you're running this notebook.


## Hello World: exploring cuDF and GPU Acceleration for pandas

%load_ext cudf.pandas loads the cuDF extension for Pandas, allowing the use of GPU-accelerated DataFrames.

In [18]:
%load_ext cudf.pandas

The cudf.pandas extension is already loaded. To reload it, use:
  %reload_ext cudf.pandas


Import libraries, read Titanic data, and concatenate data

In [19]:
import pandas as pd
import cupy as cp

train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
concat = pd.concat([train, test], axis = 0)

Scale up the dataset to demonstrate the advantage of GPU acceleration: the original Titanic dataset is too small, so we replicate it to simulate a dataset with 1 million rows

In [20]:
target_rows = 1_000_000
repeats = -(-target_rows // len(train))  # Ceiling division
train_df = pd.concat([train] * repeats, ignore_index=True).head(target_rows)
print(train_df.shape)  # (1000000, 2)

repeats = -(-target_rows // len(test))  # Ceiling division
test_df = pd.concat([test] * repeats, ignore_index=True).head(target_rows)
print(test_df.shape)  # (1000000, 2)

combine = [train_df, test_df]

(1000000, 12)
(1000000, 11)


The cudf.pandas extension allows the execution of familiar pandas operations such as filtering, grouping, and merging, on GPUs without requiring a code change and/or rewrites.

In [21]:
filtered_df = train_df[(train_df['Age'] > 30) & (train_df['Fare'] > 50)]
grouped_df = train_df.groupby('Embarked')[['Fare', 'Age']].mean()
additional_info = pd.DataFrame({
	'PassengerId': [1, 2, 3],
	'VIP_Status': ['No', 'Yes', 'No']
})
merged_df = train_df.merge(additional_info, on='PassengerId', how='left')

## Tracking Performance: CPU and GPU Runtime Metrics

The %%cudf.pandas.profile magic command profiles the calls executed on CPU and GPU and the time taken to execute them. The profiling output reveals that certain operations reverted to CPU execution, thereby indicating areas where GPU acceleration was not effectively utilized. 


In [22]:
%%cudf.pandas.profile
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

Unnamed: 0,Pclass,Survived
0,1,0.629592
1,2,0.47281
2,3,0.242378


We can use Python’s magic commands %%time %%timeit to time either the CPU and the GPU enabling you to benchmark specific code blocks by measuring their execution time and processor type. Because this environment is currently GPU enabled with cudf.pandas, and there currently is no simple way to turn it off, we can only show GPU accelerated runtimes. What we will do is run both examples from the blog with the GPU measurement. If you want to see the differences, you can still rerun the notebook and not load the cudf.pandas extension.

In [23]:
%%time

print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]

print("After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)

Before (1000000, 12) (1000000, 11) (1000000, 12) (1000000, 11)
After (1000000, 10) (1000000, 9) (1000000, 10) (1000000, 9)
CPU times: user 4.19 ms, sys: 12.8 ms, total: 17 ms
Wall time: 16.3 ms


In [24]:
%%timeit

for dataset in combine:
    dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\\.', expand=False)

pd.crosstab(train_df['Title'], train_df['Sex'])

36.8 ms ± 372 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Verifying GPU Utilization

Replicate a cupy.ndarray

In [25]:
guess_ages = cp.zeros((2,3))
guess_ages

array([[0., 0., 0.],
       [0., 0., 0.]])

Whether arrays are being processed on the CPU or GPU can be checked using the type command to differentiate between NumPy and CuPy arrays. If the output is np.array, the data is being processed on the CPU. If the output is cupy.ndarray, the data is being processed on the GPU. 

In [26]:
type(guess_ages)

cupy.ndarray

Using the print command can confirm whether the GPU is being utilized and ensure that a cuDF DataFrame is being processed. The output specifies whether the fast path (cuDF) or slow path (pandas) is in use.

In [27]:
print(pd)

<module 'pandas' (ModuleAccelerator(fast=cudf, slow=pandas))>


Commands like df.info() can be used to inspect the structure of cuDF DataFrame and confirm that computations are GPU-accelerated.

In [28]:
train_df.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 11 columns):
 #   Column       Non-Null Count    Dtype
---  ------       --------------    -----
 0   PassengerId  1000000 non-null  int64
 1   Survived     1000000 non-null  int64
 2   Pclass       1000000 non-null  int64
 3   Name         1000000 non-null  object
 4   Sex          1000000 non-null  object
 5   Age          801349 non-null   float64
 6   SibSp        1000000 non-null  int64
 7   Parch        1000000 non-null  int64
 8   Fare         1000000 non-null  float64
 9   Embarked     997755 non-null   object
 10  Title        1000000 non-null  object
dtypes: float64(2), int64(5), object(4)
memory usage: 102.7+ MB
