<a href="https://colab.research.google.com/github/zwelshman/healthcare-data-analysis-in-python/blob/main/PySpark/debugging_in_pyspark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Debugging in PySpark

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"

import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # Property used to format output tables better
spark

In [None]:
from pyspark.sql import functions as F

def generate_fake_data():
  """
  Simple function that generates fake data
  and returns as a Spark DataFrame
  """
  df = (
      spark.createDataFrame(
          [
              ("id_001", "2020-01-01", 52),
              ("id_002", "2021-06-23", 63),
              ("id_003", "2020-05-01", 16)
          ],
          ['person_id', 'date', 'age']
      )
      .withColumn('date', F.to_date(F.col('date')))
  )

  return df

df = generate_fake_data()

In [None]:
display(df)

In [None]:
def add_days(df, date_col, num_days):
  """
  Function to create a new column with the number of days
  added to an original date column
  """
  return (
      df
      .withColumn(f'DATE_PLUS{str(num_days)}',
                  F.date_add(df[f'{date_col}'], num_days))
  )

df_days = add_days(df, 'date', 10)

display(df_days)

## Debugging: Simple example

There are two ways to entering debug mode:

- Create a cell beneath the one with the error and type `%debug` or,
- At the top of a notebook automatically turn on the debugger with`%pdb on` where pdb stands for the python debugger.

You might already know what the problem is in the code below, however with more complex code and functions it might not be as obvious.

In [None]:
df_days_error = add_days(df, 'date_1', 10)

display(df_days_error)

The ```AnalysisExeception``` provides enough verbose to determine where the issues is, however we will ether the debug more to inspect.

- We will use `%debug` to enter debug mode and look at the `converted` variable that contains the exception error.
- Then we will jump up the stack trace, using the `u`  meaning up in the python debugger syntax, followed by pressing the return key, until we see the line of code where the exception started. (three jumps in this case).
- Using the pdb `a` we can see
- Then we will create an interactive shell using `interact` to inspect objects before the error occured.
- As we created an interative shell before the error occurred we have access to objects such as `df` and we can call help the function like so `help(add_days())`

In [None]:
%debug

In [None]:
%pdb on