# Environment Setup

This setup is intended for Colab and will reset connection a lot

## Environment Sanity Check ##

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4, or P100.

In [None]:
!nvidia-smi

Wed May  3 03:31:25 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   54C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Setup:
This notebook was built on RAPIDS 0.13 stable and is based on this [DataCamp Tutorial](https://www.datacamp.com/community/tutorials/xgboost-in-python).  tested and working on 0.19 stable.

## Setup:
Set up script installs
1. Updates gcc in Colab
1. Installs Conda
1. Install RAPIDS' current stable version of its libraries, as well as some external libraries including:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuSignal
  1. BlazingSQL
  1. xgboost
1. Copy RAPIDS .so files into current working directory, a neccessary workaround for RAPIDS+Colab integration.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/env-check.py

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
Traceback (most recent call last):
  File "/content/rapidsai-csp-utils/colab/env-check.py", line 28, in <module>
    if ('K80' not in gpu_name):
TypeError: a bytes-like object is required, not 'str'


In [None]:
# This will update the Colab environment and restart the kernel.  Don't run the next cell until you see the session crash.
!bash rapidsai-csp-utils/colab/update_gcc.sh
import os
os._exit(00)

Updating your Colab environment.  This will restart your kernel.  Don't Panic!
[0m

In [None]:
# This will install CondaColab.  This will restart your kernel one last time.  Run this cell by itself and only run the next cell once you see the session crash.
import condacolab
condacolab.install()

✨🍰✨ Everything looks OK!


In [None]:
# you can now run the rest of the cells as normal
import condacolab
condacolab.check()

✨🍰✨ Everything looks OK!


In [None]:
# Installing RAPIDS is now 'python rapidsai-csp-utils/colab/install_rapids.py <release> <packages>'
# The <release> options are 'stable' and 'nightly'.  Leaving it blank or adding any other words will default to stable.
### BELOW SETUP CODE TAKES ABOUT 20 MINUTES TO RUN
!python rapidsai-csp-utils/colab/install_rapids.py stable
import os
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
os.environ['CONDA_PREFIX'] = '/usr/local'

Found existing installation: cffi 1.15.0
Uninstalling cffi-1.15.0:
  Successfully uninstalled cffi-1.15.0
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cffi==1.15.0
  Using cached cffi-1.15.0-cp310-cp310-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (446 kB)
Installing collected packages: cffi
Successfully installed cffi-1.15.0
Installing RAPIDS Stable 22.12
Starting the RAPIDS install on Colab.  This will take about 15 minutes.
Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
failed with initial frozen solve. Retrying with flexible solve.
failed

SpecsConfigurationConflictError: Requested specs conflict with configured specs.
  requested specs:
    - cudatoolkit=11.2
    - dask-sql
    - gcsfs
    - llvmlite
    - mamba==1.4.1
    - opens

In [None]:
!pip install cudf-cu11 cuml-cu11 --extra-index-url=https://pypi.nvidia.com

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://pypi.nvidia.com
Collecting cupy-cuda11x<12.0.0a0,>=9.5.0
  Using cached cupy_cuda11x-11.6.0-cp310-cp310-manylinux1_x86_64.whl (91.2 MB)
Installing collected packages: cupy-cuda11x
Successfully installed cupy-cuda11x-11.6.0
[0m

In [None]:
import cudf
import pandas as pd
import cuml
import time

import pynvml
import numpy as np
import xgboost as xgb

from IPython.display import clear_output

# Code from us
import matplotlib.pyplot as plt
import seaborn as sns
import json
import requests
from copy import deepcopy
from cuml.metrics import mean_absolute_error
from cuml.metrics.regression import r2_score

sns.set_style('darkgrid')

import matplotlib as mpl
mpl.rcParams['agg.path.chunksize'] = 10000

pd.set_option('display.max_rows', 100)

#Connection with Google Drive
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

#Set the current directory
import os
os.chdir('/content/drive/My Drive/')


Mounted at /content/drive/


# Data Import
Data should already be preprocessed

In [None]:
%cd Advanced_Python_Project/

/content/drive/.shortcut-targets-by-id/1Im2gVDttCXVmzHLr3uVgCYZV0hhnWdo_/Advanced_Python_Project


In [None]:
clean_df = cudf.read_csv('./Data/clean_data.csv', nrows=5_000_001)

In [None]:
clean_df.shape

(5000001, 84)

In [None]:
train_df = clean_df[clean_df['year'].isin([2009, 2010, 2011, 2012, 2013, 2014])]
test_df = clean_df[clean_df['year'].isin([2015])]

In [None]:
X_train, y_train = train_df.drop('fare_amount', axis=1), train_df['fare_amount']
X_test, y_test = test_df.drop('fare_amount', axis=1), test_df['fare_amount']

In [None]:
X_test.shape

(344627, 83)

In [None]:
X_train.shape

(4655374, 83)

# Models

In [None]:
# To store Results
res_df = pd.DataFrame(columns=['model', 'train_mae', 'test_mae', 'train_r2', 'test_r2', 'train_time'])

## Linear Regression

In [None]:
# Train and test linear regression model
start_time = time.time()

# Create linear regression model
lr = cuml.LinearRegression()

# Train model
lr.fit(X_train, y_train)

# Predict on train and test data
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

# Calculate metrics
train_mae = mean_absolute_error(y_train.to_cupy().get(), y_train_pred.to_cupy().get()).get()
test_mae = mean_absolute_error(y_test.to_cupy().get(), y_test_pred.to_cupy().get()).get()
train_r2 = r2_score(y_train.to_cupy().get(), y_train_pred.to_cupy().get())
test_r2 = r2_score(y_test.to_cupy().get(), y_test_pred.to_cupy().get())

print(f"Linear Regression: Train MAE: {train_mae:.4f}, Test MAE: {test_mae:.4f}, Train R^2: {train_r2:.4f}, Test R^2: {test_r2:.4f}")
print(f"Linear Regression: Train time: {time.time() - start_time:.2f} seconds")

# Store results in a dataframe
res_df = pd.concat([res_df, pd.DataFrame({'model': ['linear_regression'], 'train_mae': [train_mae], 'test_mae': [test_mae], 'train_r2': [train_r2], 'test_r2': [test_r2], 'train_time': [time.time() - start_time]})], ignore_index=True)


Linear Regression: Train MAE: 2.0467, Test MAE: 2.4124, Train R^2: 0.7412, Test R^2: 0.7624
Linear Regression: Train time: 1.91 seconds


## Random Forest

In [None]:
# Train and test random forest model

start_time = time.time()
# Create random forest model
rf = cuml.RandomForestRegressor(n_estimators=100, max_depth=10)

# Train model
rf.fit(X_train, y_train)

# Predict on train and test data
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)

# Calculate metrics
train_mae = mean_absolute_error(y_train.to_cupy().get(), y_train_pred.to_cupy().get()).get()
test_mae = mean_absolute_error(y_test.to_cupy().get(), y_test_pred.to_cupy().get()).get()
train_r2 = r2_score(y_train.to_cupy().get(), y_train_pred.to_cupy().get())
test_r2 = r2_score(y_test.to_cupy().get(), y_test_pred.to_cupy().get())

print(f"Random Forest: Train MAE: {train_mae:.4f}, Test MAE: {test_mae:.4f}, Train R^2: {train_r2:.4f}, Test R^2: {test_r2:.4f}")
print(f"Random Forest: Train time: {time.time() - start_time:.2f} seconds")

# Store results in the dataframe
res_df = pd.concat([res_df, pd.DataFrame({'model': ['random_forest'], 'train_mae': [train_mae], 'test_mae': [test_mae], 'train_r2': [train_r2], 'test_r2': [test_r2], 'train_time': [time.time() - start_time]})], ignore_index=True)


  ret = func(*args, **kwargs)


Random Forest: Train MAE: 1.7146, Test MAE: 1.9156, Train R^2: 0.8308, Test R^2: 0.8377
Random Forest: Train time: 462.58 seconds


## XGBoost

In [None]:
# Train and test XGBoost model

# Convert data to DMatrix format
dtrain = xgb.DMatrix(
        X_train.astype(float),
        y_train    )

dtest = xgb.DMatrix(
        X_test.astype(float),
        y_test    )

start_time = time.time()

## Train the model
bst = xgb.train(
                {
                'objective': 'reg:squarederror',
                'max_depth': 9,
                'eta': 0.1,
                'subsample': 0.5,
                'num_round': 100,
                'tree_method':'gpu_hist',
                'learning_rate': 0.3,
                'colsample_bytree' : 1,
                'alpha' : .01
                },
                dtrain,
                num_boost_round=100, evals=[(dtrain, 'train')])

# Predict on train and test data
y_train_pred = bst.predict(dtrain)
y_test_pred = bst.predict(dtest)

# Calculate metrics
train_mae = mean_absolute_error(y_train, y_train_pred).get()
test_mae = mean_absolute_error(y_test, y_test_pred).get()
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

clear_output()

print(f"XGBoost: Train MAE: {train_mae:.4f}, Test MAE: {test_mae:.4f}, Train R^2: {train_r2:.4f}, Test R^2: {test_r2:.4f}")
print(f"XGBoost: Train time: {time.time() - start_time:.2f} seconds")

# Store results in the dataframe
res_df = pd.concat([res_df, pd.DataFrame({'model': ['xgboost'], 'train_mae': [train_mae], 'test_mae': [test_mae], 'train_r2': [train_r2], 'test_r2': [test_r2], 'train_time': [time.time() - start_time]})], ignore_index=True)

XGBoost: Train MAE: 1.6308, Test MAE: 1.8610, Train R^2: 0.8497, Test R^2: 0.8468
XGBoost: Train time: 9.94 seconds


## Write Results

In [None]:
print(res_df)

               model          train_mae            test_mae  train_r2  \
0  linear_regression  2.046650726506634  2.4124180990611235  0.741221   
1      random_forest  1.714643345554274   1.915618847291574  0.830829   
2            xgboost  1.630783107229689  1.8609794561602693  0.849680   

    test_r2  train_time  
0  0.762399    1.907229  
1  0.837721  462.579982  
2  0.846826    9.941125  


In [None]:
res_df.to_csv('Advanced_Python_Project/results/results_gpu_5mil.csv', index=False)