# Install Package

In [1]:
!pip install \
    --extra-index-url=https://pypi.nvidia.com \
    "cuml-cu12==25.8.*"

Looking in indexes: https://pypi.org/simple, https://pypi.nvidia.com


# Import Dataset

The <b>Forest CoverType dataset</b> is a widely used, real-world dataset from the U.S. Forest Service designed for machine learning applications. It contains data describing various physical and ecological characteristics of forested areas in the Roosevelt National Forest, Colorado. Each data point represents a plot of land, described by environmental features, with the goal of predicting the <b>forest cover type</b> — the dominant tree species or group present at that site.

* <b>Samples</b>: 581,012 forest plots (rows)
* <b>Features</b>: 54 attributes per plot
  * 10 continuous features (e.g., elevation, slope, aspect, distances to hydrology/road/fire points, hillshade values)
  * 44 binary features (one-hot encoded: 4 wilderness areas, 40 soil types)
* <b>Target</b>: Forest cover type (integer 1–7), each representing a different dominant tree species/group (e.g., Spruce/Fir, Lodgepole Pine, Aspen, etc.)

In [2]:
%load_ext cudf.pandas

from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
covertype = fetch_ucirepo(id=31) 
  
# data (as pandas dataframes) 
X = covertype.data.features # Independent variables
y = covertype.data.targets  # Dependent variable

In [3]:
X.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,...,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Wilderness_Area2,Wilderness_Area3,Wilderness_Area4
0,2596,51,3,258,0,510,221,232,148,6279,...,0,0,0,0,0,0,0,0,0,0
1,2590,56,2,212,-6,390,220,235,151,6225,...,0,0,0,0,0,0,0,0,0,0
2,2804,139,9,268,65,3180,234,238,135,6121,...,0,0,0,0,0,0,0,0,0,0
3,2785,155,18,242,118,3090,238,238,122,6211,...,0,0,0,0,0,0,0,0,0,0
4,2595,45,2,153,-1,391,220,234,150,6172,...,0,0,0,0,0,0,0,0,0,0


In [4]:
X.info()

<class 'cudf.core.dataframe.DataFrame'>
RangeIndex: 581012 entries, 0 to 581011
Data columns (total 54 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           581012 non-null  int64
 1   Aspect                              581012 non-null  int64
 2   Slope                               581012 non-null  int64
 3   Horizontal_Distance_To_Hydrology    581012 non-null  int64
 4   Vertical_Distance_To_Hydrology      581012 non-null  int64
 5   Horizontal_Distance_To_Roadways     581012 non-null  int64
 6   Hillshade_9am                       581012 non-null  int64
 7   Hillshade_Noon                      581012 non-null  int64
 8   Hillshade_3pm                       581012 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  581012 non-null  int64
 10  Wilderness_Area1                    581012 non-null  int64
 11  Soil_Type1                          581012 non-nul

# Machine Learning

Machine learning workflows often involve computationally intensive operations, particularly when working with large datasets. Traditional CPU-based libraries like scikit-learn are widely used but may struggle to deliver optimal performance at scale. NVIDIA's cuML, a GPU-accelerated machine learning library, addresses this challenge by leveraging the power of GPUs to significantly accelerate common machine learning algorithms while maintaining compatibility with scikit-learn.

In this guide, we explore how to use cuML for key machine learning algorithms, comparing them to their scikit-learn counterparts. The algorithms covered include:

* <b>Random Forest Classifier</b>: An ensemble learning method that builds multiple decision trees and combines their outputs for robust and accurate classification.
* <b>Logistic Regression</b>: A linear model for binary or multi-class classification, useful for tasks where the target variable is categorical.
* <b>K-Nearest Neighbors Classifier</b>: A non-parametric algorithm that assigns a class to a data point based on the majority class of its nearest neighbors.





We start with classification models to predict forest cover type, such as elevation, aspect, slope, and soil-type

In [5]:
%load_ext cuml.accel

from sklearn.model_selection import train_test_split, GridSearchCV 
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
import numpy as np

train test data split

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y)

normalize the independent variables

In [7]:
# Scale
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### Random Forest Classifier

fit Random Forest model using the training dataset and make prediction using the test dataset 

In [9]:
%%cuml.accel.profile

# fit Random Forest model
rfc = RandomForestClassifier(n_estimators = 100, max_depth = 5, min_samples_split = 5, 
                             min_samples_leaf = 4, max_features = 1.0, bootstrap = True)
rfc.fit(X_train, np.array(y_train).ravel())

# predict using Random Forest model
y_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [12]:
%%cuml.accel.line_profile

# fit Random Forest model
rfc = RandomForestClassifier(n_estimators = 100, max_depth = 5, min_samples_split = 5, 
                             min_samples_leaf = 4, max_features = 1.0, bootstrap = True)
rfc.fit(X_train, np.array(y_train).ravel())

# predict using Random Forest model
y_pred = rfc.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [13]:
acc

0.7057649114050412

find the optimal hyperparameters using GridSearchCV, and make predictions

In [16]:
%%cuml.accel.profile

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', 1.0]
}

grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3)
grid_search.fit(X_train, np.array(y_train).ravel())

y_pred = grid_search.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [17]:
%%cuml.accel.line_profile

# Define the parameter grid to search over
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2', 1.0]
}

grid_search = GridSearchCV(estimator=rfc, param_grid=param_grid, cv=3)
grid_search.fit(X_train, np.array(y_train).ravel())

y_pred = grid_search.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [18]:
acc

0.9386504651342908

### Logistic Regression

fit Logistic Regression model using the training dataset and make prediction using the test dataset

In [19]:
%%cuml.accel.profile

lr = LogisticRegression(max_iter = 500)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [20]:
%%cuml.accel.line_profile

lr = LogisticRegression(max_iter = 500)
lr.fit(X_train, y_train)

y_pred = lr.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [21]:
acc

0.7243186492603462

### K-Nearest Neighbors
fit K-Nearest Neighbors model using the training dataset and make prediction using the test dataset


In [22]:
%%cuml.accel.profile

# fit KNN model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# predict using KNN model
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [23]:
%%cuml.accel.line_profile

# fit KNN model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)

# predict using KNN model
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)

In [24]:
acc

0.9288486527886544