# **Databricks pyspark handling Null in filter**

**isNull()**
*   The function is isNull() returns all the rows where certain column on which we apply this fucntion contains Null values.

In [50]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark=SparkSession.builder.config("spark.driver.host",'localhost').getOrCreate()

In [51]:
spark

In [52]:
# Sample data with null values
data = [(1, "Alice", 25),
        (2, "Bob", None),
        (3, "Charlie", 22),
        (4, "David", 28),
        (5, None, 35)]



In [53]:
# Define schema for the DataFrame
schema = ["id", "name", "age"]

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

# Show the original DataFrame
print("Original DataFrame:")
df.show()


Original DataFrame:
+---+-------+----+
| id|   name| age|
+---+-------+----+
|  1|  Alice|  25|
|  2|    Bob|null|
|  3|Charlie|  22|
|  4|  David|  28|
|  5|   null|  35|
+---+-------+----+



In [54]:
from pyspark.sql.functions import col


# Check for null values in the 'name' column
df_with_nulls = df.filter(col("name").isNull())

# Show the DataFrame with null values in the 'name' column
print("\nRows with null values in the 'name' column:")
df_with_nulls.show()

# Stop the Spark session
spark.stop()


Rows with null values in the 'name' column:
+---+----+---+
| id|name|age|
+---+----+---+
|  5|null| 35|
+---+----+---+



**isNotNull()**

In [55]:
df_notNull = df.filter(df['name'].isNotNull())

In [56]:
df_notNull.show()

Py4JJavaError: An error occurred while calling o246.showString.
: java.lang.IllegalStateException: SparkContext has been shutdown
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2255)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2284)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2303)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4177)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3161)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4167)
	at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:526)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4165)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4165)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:3161)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:3382)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:284)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:323)
	at sun.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:748)


In [57]:
spark.stop()

### **na.drop()**

**To drop the record if any or all columns contain the null value.**

syntax -\
**df_drop=df.na.drop("parameter")**
*   parameter - default is any
    *  **any** - In a record is any column contain null value than it will drop that record.
       *  df.na.drop("any") - If any column in a row have null value than drop that record.
    *  **all** - In a record is all column contain null value than drop that record.
       *  df.na.drop("all") - If in a row all column are null than drop that row.
    *  **subset=['col1','col3']** - If any column in the specified subset (col1 or col3 in this case) contains a null value, the corresponding rows will be dropped.

In [58]:
spark= SparkSession.builder.config("spark.driver.host","localhost").getOrCreate()

In [66]:
data = [("Alice", 25, None),
        ("Bob", None, 30),
        ("Charlie", 35, 40),
        (None,23,89)]

columns = ["name", "age", "score"]
df_new = spark.createDataFrame(data, columns)

In [67]:
df_new.show()

+-------+----+-----+
|   name| age|score|
+-------+----+-----+
|  Alice|  25| null|
|    Bob|null|   30|
|Charlie|  35|   40|
|   null|  23|   89|
+-------+----+-----+



In [68]:
df_drop=df_new.na.drop(subset=['age','score'])

In [69]:
df_drop.show()

+-------+---+-----+
|   name|age|score|
+-------+---+-----+
|Charlie| 35|   40|
|   null| 23|   89|
+-------+---+-----+



### **na.fill()**

Populate dummy value for all null vlaue given parameter to function.
*  syntax - 
   * df1=df.na.fill(value="dummy val",subset=["col1,col2"])
   * value - value to be filled in Null vlaue
   * subset - check Null in the mentioned collumn 
* example
  * df.na.fill(value=0)
    * In above all the column are searched which ahve int datatype than only we can replace the null value to int.
  * df.na.fill(value="dummy")
    * As above the value is string thus it will search for null values only in column having string datatype.

In [77]:
data = [("Alice", 25, None),
        ("Bob", None, 30),
        ("Charlie", 35, 40),
        (None,52,76)]

columns = ["name", "age", "score"]
df3 = spark.createDataFrame(data, columns)

In [81]:
df_fill_string = df3.na.fill(value="replace Null")
df_fill_string.show()

+------------+----+-----+
|        name| age|score|
+------------+----+-----+
|       Alice|  25| null|
|         Bob|null|   30|
|     Charlie|  35|   40|
|replace Null|  52|   76|
+------------+----+-----+



In above as you can see if no parameter is mentioned than by default it is all column. As the value is string hence the function will replace the null value of column having string datatype.

In [80]:
df.fill_int=df3.na.fill(value=420)
df.fill_int.show()

+-------+---+-----+
|   name|age|score|
+-------+---+-----+
|  Alice| 25|  420|
|    Bob|420|   30|
|Charlie| 35|   40|
|   null| 52|   76|
+-------+---+-----+



In above as you can see if no parameter is mentioned than by default it is all column. As the value is integer hence the function will replace the null value of column having integer datatype.

**Note -**
*  If we mention the subset than only these column are searched for null value and value is replaced in these column only according to type of value and column datatype.