# Welcome to the Qualification Tool for the RAPIDS Accelerator for Apache Spark
To run the tool, you need to enter a log path that represents the DBFS location for your Spark CPU event logs.  Then you can select "Run all" to execute the notebook.  After the notebook completes, you will see various output tables show up below.

## Summary Output
The report represents the entire app execution, including unsupported operators and non-SQL operations.  By default, the applications and queries are sorted in descending order by the following fields:
- Recommendation;
- Estimated GPU Speed-up;
- Estimated GPU Time Saved; and
- End Time.

## Stages Output
For each stage used in SQL operations, the Qualification tool generates the following information:
1. App ID
1. Stage ID
1. Average Speedup Factor: the average estimated speed-up of all the operators in the given stage.
1. Stage Task Duration: amount of time spent in tasks of SQL Dataframe operations for the given stage.
1. Unsupported Task Duration: sum of task durations for the unsupported operators. For more details, see Supported Operators.
1. Stage Estimated: True or False indicates if we had to estimate the stage duration.

## Execs Output
The Qualification tool generates a report of the “Exec” in the “SparkPlan” or “Executor Nodes” along with the estimated acceleration on the GPU. Please refer to the Supported Operators guide for more details on limitations on UDFs and unsupported operators.
1. App ID
1. SQL ID
1. Exec Name: example Filter, HashAggregate
1. Expression Name
1. Task Speedup Factor: it is simply the average acceleration of the operators based on the original CPU duration of the operator divided by the GPU duration. The tool uses historical queries and benchmarks to estimate a speed-up at an individual operator level to calculate how much a specific operator would accelerate on GPU.
1. Exec Duration: wall-Clock time measured since the operator starts till it is completed.
1. SQL Node Id
1. Exec Is Supported: whether the Exec is supported by RAPIDS or not. Please refer to the Supported Operators section.
1. Exec Stages: an array of stage IDs
1. Exec Children
1. Exec Children Node Ids
1. Exec Should Remove: whether the Op is removed from the migrated plan.

In [0]:
import pandas as pd

In [0]:
%sh wget https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.12.0/rapids-4-spark-tools_2.12-22.12.0.jar

--2023-01-10 09:02:34--  https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark-tools_2.12/22.12.0/rapids-4-spark-tools_2.12-22.12.0.jar
Resolving repo1.maven.org (repo1.maven.org)... 151.101.20.209
Connecting to repo1.maven.org (repo1.maven.org)|151.101.20.209|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2011685 (1.9M) [application/java-archive]
Saving to: ‘rapids-4-spark-tools_2.12-22.12.0.jar’

     0K .......... .......... .......... .......... ..........  2% 5.19M 0s
    50K .......... .......... .......... .......... ..........  5% 6.20M 0s
   100K .......... .......... .......... .......... ..........  7% 4.63M 0s
   150K .......... .......... .......... .......... .......... 10% 3.49M 0s
   200K .......... .......... .......... .......... .......... 12% 5.92M 0s
   250K .......... .......... .......... .......... .......... 15% 3.03M 0s
   300K .......... .......... .......... .......... .......... 17% 5.94M 0s
   350K .......... .......... ......

In [0]:
dbutils.widgets.text("log_path", "")
eventlog_string=dbutils.widgets.get("log_path")

q_command_string="java -Xmx10g -cp /tmp/rapids-4-spark-tools.jar:/databricks/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain -o {} ".format(OUTPUT_DIR) + eventlog_string
args = shlex.split(q_command_string)
cmd_out = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)


if cmd_out.returncode != 0:
  dbutils.notebook.exit("Qualification Tool failed with stderr:" + cmd_out.stderr)

In [0]:
%sh wget https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz

--2023-01-10 09:11:35--  https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz
Resolving dlcdn.apache.org (dlcdn.apache.org)... 151.101.2.132, 2a04:4e42::644
Connecting to dlcdn.apache.org (dlcdn.apache.org)|151.101.2.132|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 299350810 (285M) [application/x-gzip]
Saving to: ‘spark-3.3.1-bin-hadoop3.tgz’

     0K .......... .......... .......... .......... ..........  0% 5.14M 56s
    50K .......... .......... .......... .......... ..........  0% 4.72M 58s
   100K .......... .......... .......... .......... ..........  0% 5.32M 57s
   150K .......... .......... .......... .......... ..........  0% 39.4M 44s
   200K .......... .......... .......... .......... ..........  0% 22.8M 38s
   250K .......... .......... .......... .......... ..........  0% 9.06M 37s
   300K .......... .......... .......... .......... ..........  0% 31.2M 33s
   350K .......... .......... .......... .......... ..........  0% 5

In [0]:
%sh tar zxvf spark-3.3.1-bin-hadoop3.tgz

spark-3.3.1-bin-hadoop3/
spark-3.3.1-bin-hadoop3/LICENSE
tar: spark-3.3.1-bin-hadoop3/LICENSE: Cannot change ownership to uid 110302528, gid 1000: Invalid argument
spark-3.3.1-bin-hadoop3/NOTICE
tar: spark-3.3.1-bin-hadoop3/NOTICE: Cannot change ownership to uid 110302528, gid 1000: Invalid argument
spark-3.3.1-bin-hadoop3/R/
spark-3.3.1-bin-hadoop3/R/lib/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/DESCRIPTION
tar: spark-3.3.1-bin-hadoop3/R/lib/SparkR/DESCRIPTION: Cannot change ownership to uid 110302528, gid 1000: Invalid argument
spark-3.3.1-bin-hadoop3/R/lib/SparkR/INDEX
tar: spark-3.3.1-bin-hadoop3/R/lib/SparkR/INDEX: Cannot change ownership to uid 110302528, gid 1000: Invalid argument
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/Rd.rds
tar: spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/Rd.rds: Cannot change ownership to uid 110302528, gid 1000: Invalid argument
spark-3.3.1-bin-hadoop3/R/lib/SparkR/Meta/features.rd

In [0]:
%sh java -Xmx10g -cp /databricks/driver/rapids-4-spark-tools_2.12-22.12.0.jar:/databricks/driver/spark-3.3.1-bin-hadoop3/jars/* com.nvidia.spark.rapids.tool.qualification.QualificationMain /dbfs/FileStore/logs/

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
23/01/10 09:14:45 INFO Qualification: Threadpool size is 1
23/01/10 09:14:45 INFO QualificationAppInfo: Parsing Event Log: file:/dbfs/FileStore/logs
23/01/10 09:14:45 WARN QualificationAppInfo: ClassNotFoundException: DBCEventLoggingListenerMetadata
Qual Tool Progress 0% [>                                                          ] (0 succeeded + 0 failed + 0 N/A) / 1
23/01/10 09:14:56 WARN QualificationAppInfo: ClassNotFoundException: DBCEventLoggingListenerMetadata
23/01/10 09:14:56 INFO QualificationAppInfo: Total number of events parsed: 65285 for file:/dbfs/FileStore/logs
23/01/10 09:14:56 INFO QualificationAppInfo: file:/dbfs/FileStore/logs has App: app-20230109071630-0000
23/01/10 09:14:56 WARN QualificationAppInfo: Application End Time is unknown for app-20230109071630-0000, estimating based on job and sql end times!
23/01/10 09:14:56 INFO Qualification: Took 11650ms to process file:/dbfs/FileStore

In [0]:
%sh ls 

azure
conf
eventlogs
ganglia
hadoop_accessed_config.lst
logs
preload_class.lst
rapids-4-spark-tools_2.12-22.12.0.jar
rapids_4_spark_qualification_output
spark-3.3.1-bin-hadoop3
spark-3.3.1-bin-hadoop3.tgz


## Summary Output

In [0]:
summary_output=pd.read_csv("rapids_4_spark_qualification_output/rapids_4_spark_qualification_output.csv")
display(summary_output)

App Name,App ID,Recommendation,Estimated GPU Speedup,Estimated GPU Duration,Estimated GPU Time Saved,SQL DF Duration,SQL Dataframe Task Duration,App Duration,GPU Opportunity,Executor CPU Time Percent,SQL Ids with Failures,Unsupported Read File Formats and Types,Unsupported Write Data Format,Complex Types,Nested Complex Types,Potential Problems,Longest SQL Duration,NONSQL Task Duration Plus Overhead,Unsupported Task Duration,Supported SQL DF Task Duration,Task Speedup Factor,App Duration Estimated,Unsupported Execs,Unsupported Expressions,Cluster Tags
Databricks Shell,app-20230109071630-0000,Not Recommended,1.05,1146977.58,65682.41,127396,752167,1212660,124855,59.67,,,,,,,70353,239675,14998,737169,2.11,True,PhotonShuffleMapStage;AdaptiveSparkPlan;PhotonLocalLimit;CollectLimit;PhotonShuffleExchangeSink;PhotonShuffleExchangeSource;PhotonProject;HashAggregate;ShowTables;ShowNamespaces;LocalTableScan;PhotonSort;CommandResult;PhotonResultStage;Execute CreateViewCommand;PhotonScan parquet ;ColumnarToRow,finalmerge_count,ClusterId -> 0108-090539-pg4k5ml9;Name -> 8721196619973675-2699ba82-ab92-479e-8ab1-102a1d7b07a6-worker;ClusterName -> saurava@nvidia.com's Cluster;Creator -> saurava@nvidia.com;Vendor -> Databricks


## Stages Output

In [0]:
stages_output=pd.read_csv( "rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_stages.csv")
display(stages_output)

App ID,Stage ID,Average Speedup Factor,Stage Task Duration,Unsupported Task Duration,Stage Estimated
app-20230109071630-0000,18,1.7,1809,1356,False
app-20230109071630-0000,16,1.0,5758,5756,False
app-20230109071630-0000,17,2.5,0,0,True
app-20230109071630-0000,30,1.5,1681,1120,False
app-20230109071630-0000,28,1.0,172,172,False
app-20230109071630-0000,29,1.0,0,0,True
app-20230109071630-0000,23,1.0,229,226,True
app-20230109071630-0000,25,1.0,222,226,True
app-20230109071630-0000,26,1.0,211,226,True
app-20230109071630-0000,22,1.0,250,226,True


## Execs Output

In [0]:
execs_output=pd.read_csv("rapids_4_spark_qualification_output/rapids_4_spark_qualification_output_execs.csv")
display(execs_output)

App ID,SQL ID,Exec Name,Expression Name,Task Speedup Factor,Exec Duration,SQL Node Id,Exec Is Supported,Exec Stages,Exec Children,Exec Children Node Ids,Exec Should Remove
app-20230109071630-0000,10,HashAggregate,,4.5,0,3,True,15,,,False
app-20230109071630-0000,17,LocalTableScan,,1.0,0,0,False,,,,False
app-20230109071630-0000,12,AdaptiveSparkPlan,,1.0,0,0,False,,,,False
app-20230109071630-0000,12,PhotonSort,,1.0,0,8,False,21,,,False
app-20230109071630-0000,12,TakeOrderedAndProject,,3.0,0,1,True,,,,False
app-20230109071630-0000,11,Window,,3.0,0,4,True,,,,False
app-20230109071630-0000,11,PhotonResultStage,,1.0,0,7,False,18,,,False
app-20230109071630-0000,16,WholeStageCodegen (3),WholeStageCodegen (3),1.0,1330,1,False,41,HashAggregate,2,False
app-20230109071630-0000,12,PhotonScan parquet,,1.0,0,14,False,19,,,False
app-20230109071630-0000,16,HashAggregate,,1.0,0,2,False,41,,,False
