**Debugging**

some signs and symptoms of problems in Spark jobs

**Common Spark issues**

**Spark Jobs Not Starting**

During the process of setting up the cluster, we likely configured something incorrectly, and now the node that runs the driver cannot talk to the executors

This is most likely a cluster level, machine, or configuration issue. Another option is that the application requested more resources per executor than cluster manager currently has free,in which case the driver will be waiting forever for executors to be launched.

Ensure that machines can communicate with one another on the ports.Ideally, we should open up all ports between the worker nodes unless you have more
stringent security constraints.

**Errors During Execution**

**Signs**


*  One Spark job runs successfully on the cluster but the next one fails
*  A step in a multistep query fails
*  A scheduled job that ran yesterday is failing today
*  Difficult to parse error message


**Potential treatments**

Check to see if data exists or is in the format that is expected
This can change over time or some upstream change may lead to unintended consequences on application

Read through the stack trace to try to find clues about what components are involved

If a job execute tasks for some time and then fails, it could be due to a problem with the input data itself, wherein the schema might be specified incorrectly or a particular row does not conform to the expected schema.

You will see a task marked as “failed” on the Spark UI, and you can also view the logs on that machine to understand what it was doing when it failed.Try adding more logs inside your code to figure out which data record was being processed


**Slow Tasks or Stragglers**

**Sign**

Due to work not being evenly distributed across your machines (“skew”), or due to one of your machines being slower than the others

Scaling up the number of machines given to the Spark Application doesn’t really help.some tasks still take much longer than others.

In the Spark metrics, certain executors are reading and writing much more data than others

Slow tasks are often called “stragglers.” There are many reasons they may occur, but most often the source of this issue is that the data is partitioned unevenly into DataFrame.When this happens, some executors might need to work on much larger amount of data than others.
Example- group-by-key operation, one of the key has more data than others
In this case, when we look at the Spark UI, we might see the shuffle data for some nodes is much larger than for others


**Treatment**

*  Try increasing the number of partitions to have less data per partition.
*  Try repartitioning by another combination of columns
*  Try increasing the memory allocated to executors if possible.
*  Monitor the executor that is having trouble and see if it is the same machine across jobs.Might be an unhealthy executor or machine in the cluster. for example : one whose disk is nearly full.
*  Check the slow tasks is associated with join or an aggregation


Note : Stragglers can be one of the most difficult issues to debug, simply because there are so many possible causes. However, in all likelihood, the cause will be some kind of data skew, so definitely begin by checking the Spark UI for imbalanced amounts of data across tasks

**Slow Aggregations**

**Sign**
* GroupBy call slows down the tasks. Data in the job just has some skewed keys

**Treatment**

*  Increasing the number of partitions, prior to an aggregation
*  Repartition data
*  Increasing executor memory
*  Work only on those data that is needed
*  Ensure null values are represented correctly (using Spark’s concept of null) and not as some default value like " " or "EMPTY"


**Slow Joins**

*  Joins and aggregations both shuffles data across nodes.Experimenting with different join orderings can really help speed up jobs.especially if
some of those joins filter out a large amount of data, do those first.

*  Partitioning a dataset prior to joining can be very helpful for reducing data movement across the cluster

*  Slow joins can also be caused by data skew. Increasing the size of executors can help

*  Only required data is used in Joins

*  Ensure that null values are handled correctly
*  If you know that one of the tables that you are joining is small, you can try to force a broadcast.Use Spark’s statistics collection commands to let it analyze the table.

    

```
#   ANALYZE TABLE table_name COMPUTE STATISTICS;
      ANALYZE TABLE table_name COMPUTE STATISTICS FOR COLUMNS column1, column2;
      ANALYZE TABLE sales COMPUTE STATISTICS FOR ALL COLUMNS;
```
DESCRIBE EXTENDED table_name;
SET spark.sql.statistics.histogram.enabled = true;

```
# Join-heavy queries
Large Delta tables
Cost-based optimization (CBO)

```


**Driver OutOfMemoryError or Driver Unresponsive**

*  Collecting too much data back to the driver, making it run out of memory. The code might have tried to collect an overly large dataset to the driver node using operations such as collect.

*  You might be using a broadcast join where the data to be broadcast is too big

*  Use Spark’s maximum broadcast join configuration to better control the size it will broadcast.


spark.sql.autoBroadcastJoinThreshold --> maximum size (in bytes) of a table that Spark will automatically broadcast to all executors during a join

SET spark.sql.autoBroadcastJoinThreshold = 50MB;

Keep broadcast tables: < executor memory × safety margin



**Serialization Errors**

*  Data you are trying to share cannot be serialized.
*  Verify that you’re actually registering your classes so that they are indeed serialized.