# Random Forest

## Spark

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1280px-Apache_Spark_logo.svg.png" width="400">

**Hardware**: 20 nodes, r5.2xlarge (8 CPU, 64 GB RAM)

# Load data

In [4]:
spark.stop()

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = (SparkSession
         .builder
         .appName("pyspark-rf-benchmark-gpu")
#         .master("local[1]")
#         .master("spark://ecs-python2:7077")
#         .master("spark://172.17.0.2:7077")
         .master("spark://master:7077")
         .config("spark.driver.extraClassPath","/tmp/lib/cudf-22.06.0-cuda11.jar:/tmp/lib/rapids-4-spark_2.12-22.06.0.jar:/tmp/lib/rapids-4-spark-ml_2.12-22.08.0-SNAPSHOT.jar") 
         .config("spark.executor.extraClassPath","/tmp/lib/cudf-22.06.0-cuda11.jar:/tmp/lib/rapids-4-spark_2.12-22.06.0.jar:/tmp/lib/rapids-4-spark-ml_2.12-22.08.0-SNAPSHOT.jar")         
         .config('spark.plugins','com.nvidia.spark.SQLPlugin')
         .config('spark.executor.memory', '16G')
         .config('spark.driver.memory', '16G')
         .config('spark.driver.maxResultSize', '16G')
         .getOrCreate())

#spark.conf.set('spark.rapids.sql.enabled','true')

#print(spark.conf.get('spark.driver.extraClassPath'))
#print(spark.conf.get('spark.executor.extraClassPath'))
#print(spark.conf.get('spark.rapids.sql.enabled'))

#sc = spark.sparkContext
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/30 14:29:19 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/12/30 14:29:20 WARN RapidsPluginUtils: RAPIDS Accelerator 22.06.0 using cudf 22.06.0.
22/12/30 14:29:20 WARN RapidsPluginUtils: RAPIDS Accelerator is enabled, to disable GPU support set `spark.rapids.sql.enabled` to false.


In [2]:
#import s3fs
import functools
from pyspark.sql.types import *
import pyspark.sql.functions as F
from pyspark.sql import DataFrame

In [3]:
# manually specify schema because inferSchema in read.csv is quite slow
schema = StructType([
    StructField('VendorID', DoubleType()),
    StructField('tpep_pickup_datetime', TimestampType()),
    StructField('tpep_dropoff_datetime', TimestampType()),
    StructField('passenger_count', DoubleType()),
    StructField('trip_distance', DoubleType()),
    StructField('RateCodeID', DoubleType()),
    StructField('store_and_fwd_flag', StringType()),
    #StructField('PULocationID', DoubleType()),
    #StructField('DOLocationID', DoubleType()),
    StructField('pickup_longitude', DoubleType()),
    StructField('pickup_latitude', DoubleType()), 
    StructField('dropoff_longitude', DoubleType()), 
    StructField('dropoff_latitude', DoubleType()),
    StructField('payment_type', DoubleType()),
    StructField('fare_amount', DoubleType()),
    StructField('extra', DoubleType()),
    StructField('mta_tax', DoubleType()),
    StructField('tip_amount', DoubleType()),
    StructField('tolls_amount', DoubleType()),
    StructField('improvement_surcharge', DoubleType()),
    StructField('total_amount', DoubleType()),
    StructField('congestion_surcharge', DoubleType()),
])

In [4]:
#path = "/rapids/notebooks/host/dataset/nyc-taxi/yellow_tripdata_2015.parquet"
path = "/home/cloud/dataset/nyc-taxi/yellow_tripdata_2015.parquet"
df = spark.read.parquet(path)

                                                                                

22/12/30 14:31:40 ERROR TaskSchedulerImpl: Lost executor 1 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:32:28 ERROR TaskSchedulerImpl: Lost executor 3 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:33:16 ERROR TaskSchedulerImpl: Lost executor 4 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:34:04 ERROR TaskSchedulerImpl: Lost executor 5 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.


In [6]:
%%time
taxi=df
print(f"{taxi.count(): }")




 146112989
CPU times: user 4.89 ms, sys: 115 µs, total: 5 ms
Wall time: 2.91 s


                                                                                

# Feature engineering

In [7]:
taxi = taxi.withColumn('pickup_weekday', F.dayofweek(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_hour', F.hour(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_minute', F.minute(taxi.tpep_pickup_datetime).cast(DoubleType()))
taxi = taxi.withColumn('pickup_week_hour', ((taxi.pickup_weekday * 24) + taxi.pickup_hour).cast(DoubleType()))
taxi = taxi.withColumn('store_and_fwd_flag', F.when(taxi.store_and_fwd_flag == 'Y', 1).otherwise(0))
# Spark ML expects "label" column for dependent variable
taxi = taxi.withColumn('label', taxi.total_amount)  
taxi = taxi.fillna(-1)

In [8]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.pipeline import Pipeline

features = ['pickup_weekday', 'pickup_hour', 'pickup_minute',
            'pickup_week_hour', 'passenger_count', 'VendorID', 
            'RateCodeID', 'store_and_fwd_flag', 'pickup_longitude', 'pickup_latitude', 
            'dropoff_longitude', 'dropoff_latitude']

assembler = VectorAssembler(
    inputCols=features,
    outputCol='features',
)

pipeline = Pipeline(stages=[assembler])

In [9]:
%%time
assembler_fitted = pipeline.fit(taxi)
X = assembler_fitted.transform(taxi)

X.cache()
X.count()



22/12/30 14:36:30 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.




22/12/30 14:36:34 ERROR TaskSchedulerImpl: Lost executor 12 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:37:15 ERROR TaskSchedulerImpl: Lost executor 13 on 10.200.0.14: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.




22/12/30 14:37:28 ERROR TaskSchedulerImpl: Lost executor 14 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.




22/12/30 14:38:48 ERROR TaskSchedulerImpl: Lost executor 16 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:38:48 ERROR TaskSchedulerImpl: Lost executor 15 on 10.200.0.14: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.




CPU times: user 81.9 ms, sys: 30.7 ms, total: 113 ms
Wall time: 3min 28s


                                                                                

146112989

# Train random forest!

In [10]:
from pyspark.ml.regression import RandomForestRegressor
rf = RandomForestRegressor(numTrees=100, maxDepth=10, seed=42)

In [None]:
%%time
fitted = rf.fit(X)



22/12/30 14:41:07 ERROR TaskSchedulerImpl: Lost executor 18 on 10.200.0.14: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/12/30 14:41:07 ERROR TaskSchedulerImpl: Lost executor 17 on 10.200.0.12: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.




In [None]:
sparc.close()