# cuML's accelerator mode (cuml.accel)

<img src="https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/images/cuml-accel-exec-flow.png" style="float: right; margin-left: 5px; width: 300px;">

cuML is a Python GPU library for accelerating machine learning models using a scikit-learn-like API.

cuML now has an accelerator mode (cuml.accel) which allows you to bring accelerated computing to existing workflows
with zero code changes required. 

Support for:

- scikit-learn
- umap-learn (UMAP)
- hdbscan (HDBSCAN)


Estimators that are implemented in cuML will be dispatched to run on the GPU where possible, and fall back to the CPU 
library otherwise. 

If a model is constructed on the GPU and then a method is called that is not implemented in `cuML`, `cuml.accel` will 
reconstruct the model on the CPU and gracefully fall back to the equivalent scikit-learn function instead.


**Attribution:** This section of the tutorial is based on the `cuML.accel` [quickstart notebook](https://colab.research.google.com/github/rapidsai-community/showcase/blob/main/getting_started_tutorials/cuml_sklearn_colab_demo.ipynb) from the RAPIDS documentation.

This notebook is a brief introduction to `cuml.accel`. With classical machine learning, there is a wide range of interesting 
problems we can explore. In this tutorial we'll examine three of the more popular use cases: classification, clustering, 
and dimensionality reduction.

### Data 

If you are running this locally, and you followed the steps in notebook [0.Welcome_and_Setup.ipynb](https://github.com/rapidsai-community/tutorial/blob/main/0.Welcome_and_Setup.ipynb), 
you should have the `/data` folder ready to go. 

#### Google Colab Instructions

In the next step we download a script that will allow you to get the data for this notebook session.

In [None]:
# colab: uncomment next line to get the data setup script
#! wget https://raw.githubusercontent.com/rapidsai-community/tutorial/refs/heads/main/data_setup.py

In [None]:
# colab: uncomment next line to get the pageviews data set
#! python data_setup.py --cover-type --har

In [None]:
# Verify that you are running with an NVIDIA GPU
! nvidia-smi  # this should display information about available GPUs

# Classification

Let's load a dataset and see how we can use scikit-learn to classify that data.  For this example we'll use the Coverage Type dataset, 
which contains a number of features that can be used to predict forest cover type, such as elevation, aspect, slope, and soil-type.

More information on this dataset can be found at [UCI Machine Learning Repository](https://archive.ics.uci.edu/dataset/31/covertype).

In [1]:
import pandas as pd

In [2]:
columns = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
           'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
           'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3',
           'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6',
           'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13',
           'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
           'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27',
           'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34',
           'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Cover_Type']

data = pd.read_csv("data/cover_forest_type.csv", names=columns, header=None)

In [None]:
data.shape

In [None]:
%load_ext cuml.accel

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

We separate out the classification variable (`Cover_Type`) from the rest of the data. This is what we will aim to predict 
with our classification model. We can also split our dataset into training and test data using the scikit-learn train_test_split function.

In [5]:
X, y = data.drop("Cover_Type", axis=1), data["Cover_Type"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
%%time
clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    max_features=1.0,
    n_jobs=-1,
)
clf.fit(X_train, y_train)

For reference, this same model trained on CPU it takes more than two minutes.  

Using cuML we're able to train this random forest model in just seconds instead of minutes. One thing to note is that 
cuML's implementation of `RandomForestClassifier` doesn't utilize the `n_jobs` parameter like scikit-learn, but we still
accept it which makes it easier to use this accelerator with zero code changes.

Let's take a look at an accuracy score and classification report. 

In [None]:
y_pred = clf.predict(X_test)
cr = classification_report(y_test, y_pred)
print(cr)

With a model that runs in just seconds, we can perform hyperparameter optimization using a method like the grid search, and 
have results in just minutes instead of hours.

You can try something like this, (it'll take > 10 min):

```python
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'bootstrap': [True, False]
}

grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)
```

Having one model trained in such short time allow us to quickly iterate on the hyperparameter configuration and find a 
model that performs better with excellent speedups.

For example, let's see what happens with a different `max_depth`

**Exercise:** Train the `RandomForestClassifier` with a  different set of values and analyze the results. 

<details>
  <summary>Solution (click dropdown) </summary>
  <p>

```python
# to run this type it in a code cell
clf = RandomForestClassifier(
    n_estimators=100,
    max_depth=30,
    max_features=1.0,
    n_jobs=-1,
)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))
```
  </p>
</details>


In [None]:
# your solution here

# Clustering

Clustering is an important data science workflow because it helps uncover hidden patterns and structures within data without requiring labeled outcomes. In practice, with high dimensional data it can be difficult to discern whether the clusters we've chosen are good or not. One way to determine the quality of our clustering is with sklearn's [silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), which we'll examine shortly.

HDBSCAN is a popular density-based clustering algorithm that is highly flexible. We'll load a toy sklearn dataset to illustrate how HDBSCAN can be accelerated with cuml.accel.

In [11]:
import hdbscan
from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

In [12]:
N = 20000
K = 100

X, y = make_blobs(
    n_samples=N,
    n_features=K,
    centers=5,
    cluster_std=[3, 1, 2, 1.5, 0.5],
    random_state=42,
)


In [13]:
clusterer = hdbscan.HDBSCAN()

In [None]:
%%time
clusterer.fit(X)

In [None]:
print(silhouette_score(X, clusterer.labels_))

It's important to note that on real-world datasets, the silhouette score produced by the GPU and CPU implementations of HDBSCAN will often have slight differences. The cuML implementation of HDBSCAN should provide equivalent results, but it is normal for the actual clusters to vary slightly when dealing with complex datasets.

Lastly, let's take a look at how we can use cuml's accelerator mode for a third popular machine learning task -- dimensionality reduction. 

# Dimensionality Reduction

UMAP is a popular dimensionality reduction technique that is used for both data visualization and as preprocessing for downstream modeling due to its ability to balance preserving both local and global structure of high-dimensional data. To learn more about how it works, visit the [UMAP documentation](https://umap-learn.readthedocs.io/en/latest/).

To explore how cuML can accelerate UMAP, let's load in another dataset from UCI. We'll use the Human Activity Recognition (HAR) dataset, which was created from recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors.

In [16]:
X_train = pd.read_csv(
    "data/UCI_HAR_Dataset/train/X_train.txt", sep=r"\s+", header=None
)
y_train = pd.read_csv(
    "data/UCI_HAR_Dataset/train/y_train.txt", sep=r"\s+", header=None
)
X_test = pd.read_csv(
    "data/UCI_HAR_Dataset/test/X_test.txt", sep=r"\s+", header=None
)
y_test = pd.read_csv(
    "data/UCI_HAR_Dataset/test/y_test.txt", sep=r"\s+", header=None
)
labels = pd.read_csv(
    "data/UCI_HAR_Dataset/activity_labels.txt", sep=r"\s+", header=None
)

In [None]:
X_train.shape

Let's take a look at the activity labels to better understand the data we're working with. We can see that the sensors have grouped activities into 6 different classes.

In [None]:
labels

In [19]:
from sklearn.preprocessing import StandardScaler

# Scale the data before applying UMAP
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

Let's run UMAP with some basic parameters and explore a lower-dimensionality projection of this dataset.

In [None]:
import umap
umap_model = umap.UMAP(
    n_neighbors=15, n_components=2, random_state=42, min_dist=0.0
)

In [None]:
%%time

# Fit UMAP model to the data
X_train_umap = umap_model.fit_transform(X_train_scaled)

It's often quite interesting to visualize the resulting projection of the embeddings created by UMAP. In this case, let's take a look at the now 2-dimensional dataset.

In [None]:
import matplotlib.pyplot as plt

# Plot the UMAP result
plt.figure(figsize=(10, 8))
plt.scatter(
    X_train_umap[:, 0],
    X_train_umap[:, 1],
    c=y_train.values.ravel(),
    cmap="Spectral",
    s=10,
)
plt.colorbar(label="Activity")
plt.title("UMAP projection of the UCI HAR dataset")
plt.xlabel("UMAP Component 1")
plt.ylabel("UMAP Component 2")
plt.show()


It's interesting to see how our different categories are grouped in relation to one another.

We can look at the trustworthiness score to better understand how well the structure of the original dataset was preserved by our 2D projection

In [None]:
from sklearn.manifold import trustworthiness
trustworthiness(X_train, X_train_umap, n_neighbors=15)


## Conclusion

In this notebook, we learned how `cuml.accel` works with familiar libraries by providing GPU acceleration with zero code changes:

- Fast data normalization with `StandardScaler`
- Efficient `UMAP` dimensionality reduction.
- Efficient clustering with HDBSCAN
- High data trustworthiness score while reducing compute time.


For more information on getting started with `cuml.accel`, check out [RAPIDS.ai](https://rapids.ai/cuml-accel/) or the [cuML Docs](https://docs.rapids.ai/api/cuml/stable/).

