<a href="https://colab.research.google.com/github/zwelshman/healthcare-data-analysis-in-python/blob/main/PySpark/debugging_in_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Debugging in PySpark

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

In [2]:
from pyspark.sql import functions as F

def generate_fake_data():
  """
  Simple function that generates fake data
  and returns as a Spark DataFrame
  """
  df = (
      spark.createDataFrame(
          [
              ("id_001", "2020-01-01", 52),
              ("id_002", "2021-06-23", 63),
              ("id_003", "2020-05-01", 16)
          ],
          ['person_id', 'date', 'age']
      )
      .withColumn('date', F.to_date(F.col('date')))
  )

  return df

df = generate_fake_data()

In [3]:
display(df)

person_id,date,age
id_001,2020-01-01,52
id_002,2021-06-23,63
id_003,2020-05-01,16


In [4]:
def add_days(df, date_col, num_days):
  """
  Function to create a new column with the number of days
  added to an original date column
  """

  df = (df
      .withColumn(f'DATE_PLUS{str(num_days)}',
                  F.date_add(df[f'{date_col}'], num_days))
  )

  return df

df_days = add_days(df, 'date', 10)

display(df_days)

person_id,date,age,DATE_PLUS10
id_001,2020-01-01,52,2020-01-11
id_002,2021-06-23,63,2021-07-03
id_003,2020-05-01,16,2020-05-11


## Debugging: Simple example

There are two ways to entering debug mode:

- Create a cell beneath the one with the error and type `%debug` or,
- At the top of a notebook automatically turn on the debugger with`%pdb on` where pdb stands for the python debugger.

You might already know what the problem is in the code below, however with more complex code and functions it might not be as obvious.

In [5]:
df_days_error = add_days(df, 'date_1', 10)

display(df_days_error)

AnalysisException: Cannot resolve column name "date_1" among (person_id, date, age)

The ```AnalysisExeception``` provides enough verbose to determine where the issues is, however we will ether the debug more to inspect.

- We will use `%debug` to enter debug mode and look at the `converted` variable that contains the exception error.
- Then we will jump up the stack trace, using the `u`  meaning up in the python debugger syntax, followed by pressing the return key, until we see the line of code where the exception started. (three jumps in this case).
- Using the pdb interactive shell look at the spark DataDrame called `df` that was fed into the function.
- The `add_days()` function can be accessed using `!help(add_days)`
- Using `a` to see the arguments fed into the `add_days()` function we can see the `date_col` variable to be 'date_1' to invoke the error.
- Rerun the add_days() function and set `date_col` variable to `'date'` using `date_col = 'date'`, like so `add_days(df,'date',20)`

In [6]:
%debug

> [0;32m/content/spark-3.1.1-bin-hadoop3.2/python/pyspark/sql/utils.py[0m(117)[0;36mdeco[0;34m()[0m
[0;32m    115 [0;31m                [0;31m# Hide where the exception came from that shows a non-Pythonic[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    116 [0;31m                [0;31m# JVM exception message.[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m--> 117 [0;31m                [0;32mraise[0m [0mconverted[0m [0;32mfrom[0m [0;32mNone[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    118 [0;31m            [0;32melse[0m[0;34m:[0m[0;34m[0m[0;34m[0m[0m
[0m[0;32m    119 [0;31m                [0;32mraise[0m[0;34m[0m[0;34m[0m[0m
[0m
ipdb> u 3
> [0;32m<ipython-input-4-9c63975990bf>[0m(9)[0;36madd_days[0;34m()[0m
[0;32m      7 [0;31m  df = (df
[0m[0;32m      8 [0;31m      .withColumn(f'DATE_PLUS{str(num_days)}',
[0m[0;32m----> 9 [0;31m                  F.date_add(df[f'{date_col}'], num_days))
[0m[0;32m     10 [0;31m  )
[0m[0;32m     11 [0;31m[0;


sys.settrace() should not be used when the debugger is being used.
This may cause the debugger to stop working correctly.
If this is needed, please check: 
http://pydev.blogspot.com/2007/06/why-cant-pydev-debugger-work-with.html
to see how to restore the debug tracing back correctly.
Call Location:
  File "/usr/lib/python3.10/bdb.py", line 347, in set_continue
    sys.settrace(None)



In [None]:
%pdb on