# Welcome to the Profiling Tool for the RAPIDS Accelerator for Apache Spark
To run the tool, you need to enter a log path that represents the DBFS location for your Spark GPU event logs.  Then you can select "Run all" to execute the notebook.  After the notebook completes, you will see various output tables show up below.

## GPU Job Tuning Recommendations
This has general suggestions for tuning your applications to run optimally on GPUs.

## Per-Job Profile
The profiler output includes information about the application, data sources, executors, SQL stages, Spark properties, and key application metrics at the job and stage levels.

In [0]:
import pandas as pd

In [0]:
%sh java -Xmx10g -cp /databricks/driver/rapids-4-spark-tools_2.12-22.12.0.jar:/databricks/driver/spark-3.3.1-bin-hadoop3/jars/* com.nvidia.spark.rapids.tool.profiling.ProfileMain --csv --auto-tuner /dbfs/FileStore/logs/

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
23/01/10 09:39:34 INFO Profiler: Threadpool size is 1
23/01/10 09:39:34 INFO ApplicationInfo: Parsing Event Log: file:/dbfs/FileStore/logs
23/01/10 09:39:34 WARN ApplicationInfo: ClassNotFoundException: DBCEventLoggingListenerMetadata
23/01/10 09:39:44 WARN ApplicationInfo: ClassNotFoundException: DBCEventLoggingListenerMetadata
23/01/10 09:39:44 INFO ApplicationInfo: Total number of events parsed: 65285 for file:/dbfs/FileStore/logs
23/01/10 09:39:44 WARN ApplicationInfo: Application End Time is unknown, estimating based on job and sql end times!
23/01/10 09:39:44 INFO Profiler: Took 10572ms to process file:/dbfs/FileStore/logs
23/01/10 09:39:45 INFO ToolTextFileWriter: Application Information CSV: output location: rapids_4_spark_profile/app-20230109071630-0000/application_information.csv
23/01/10 09:39:45 INFO ToolTextFileWriter: Application Log Path Mapping CSV: output location: rapids_4_spark_profile/a

In [0]:
import os

app_df = pd.DataFrame(columns = ['appId', 'appName'])

for x in os.scandir("rapids_4_spark_profile/"):
  tmp_df = pd.read_csv(x.path + "/application_information.csv")
  app_df = app_df.append(tmp_df[['appId', 'appName']])

## GPU Job Tuning Recommendations

In [0]:
app_list = app_df["appId"].tolist()
app_recommendations = pd.DataFrame(columns=['app', 'recommendations'])

for app in app_list:
  app_file = open("rapids_4_spark_profile/" + app + "/profile.log")
  recommendations_start = 0
  recommendations_str = ""
  for line in app_file:
    if recommendations_start == 1:
      recommendations_str = recommendations_str + line
    if "### D. Recommended Configuration ###" in line:
      recommendations_start = 1
  app_recommendations = app_recommendations.append({'app': app, 'recommendations': recommendations_str}, ignore_index=True)
    
display(app_recommendations)

app,recommendations
app-20230109071630-0000,"Cannot recommend properties. See Comments. Comments: - java.io.FileNotFoundException: File worker_info.yaml does not exist - 'spark.executor.memory' should be set to at least 2GB/core. - 'spark.executor.instances' should be set to (gpuCount * numWorkers). - 'spark.task.resource.gpu.amount' should be set to Max(1, (numCores / gpuCount)). - 'spark.rapids.sql.concurrentGpuTasks' should be set to Max(4, (gpuMemory / 8G)). - 'spark.rapids.memory.pinnedPool.size' should be set to 2048m. - 'spark.sql.adaptive.enabled' should be enabled for better performance."


## Per-App Profile

In [0]:
for x in os.scandir("rapids_4_spark_profile/"):
  print("APPLICATION ID = " + str(x))
  log = open(x.path + "/profile.log")
  print(log.read())

APPLICATION ID = <DirEntry 'app-20230109071630-0000'>
### A. Information Collected ###
Application Information:
+--------+----------------+-----------------------+---------+-------------+-------+--------+-----------+------------+-------------+
|appIndex|appName         |appId                  |sparkUser|startTime    |endTime|duration|durationStr|sparkVersion|pluginEnabled|
+--------+----------------+-----------------------+---------+-------------+-------+--------+-----------+------------+-------------+
|1       |Databricks Shell|app-20230109071630-0000|root     |1673248584969|       |1212660 |20 min     |            |false        |
+--------+----------------+-----------------------+---------+-------------+-------+--------+-----------+------------+-------------+

Application Log Path Mapping:
+--------+----------------+-----------------------+-------------------------+
|appIndex|appName         |appId                  |eventLogPath             |
+--------+----------------+--------------