# PySpark SQL `where()` Function: SQL-like Filtering Made Easy

## Introduction to the `where()` Function

The `where()` function in PySpark is used to filter rows based on a condition, similar to the SQL `WHERE` clause. It’s essentially identical to the `filter()` function but written in a more SQL-like syntax.


## Basic Syntax:

```
DataFrame.where(condition)
```

The condition can be any valid PySpark SQL expression that evaluates to True or False.

## Why Use `where()`?

Although `where()` and `filter()` are interchangeable, some developers prefer `where()` for SQL-like syntax consistency when writing queries or when transitioning from SQL-based data environments.


## Practical Examples

### 1. Basic SQL-like Filtering

**Scenario**: You have a DataFrame with employee salary data, and you want to select employees who earn more than $3,000.

**Code Example**:


In [0]:
df = spark.createDataFrame([
    (1, "John", 2500),
    (2, "Jane", 4000),
    (3, "Tom", 3500)
], ["EMPLOYEE_ID", "NAME", "SALARY"])

# SQL-like filtering: where SALARY > 3000
df.where("SALARY > 3000").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          2|Jane|  4000|
|          3| Tom|  3500|
+-----------+----+------+



### 2. Filtering Using Multiple Conditions

**Scenario**: You want to filter employees whose salary is between $3,000 and $4,000.

**Code Example**:

In [0]:
# SQL-like filtering: where SALARY between 3000 and 4000
df.where("SALARY > 3000 AND SALARY < 4000").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+



### 3. String-Based Filtering

**Scenario**: You want to filter employees whose names start with the letter "J."

**Code Example**:

In [0]:
# SQL-like filtering: where NAME starts with 'J'
df.where("NAME LIKE 'J%'").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  2500|
|          2|Jane|  4000|
+-----------+----+------+



### 4. Filtering Using Multiple Columns

**Scenario**: You want to filter employees whose salary is greater than $3,000 and whose name starts with "T."

**Code Example**:

In [0]:
# SQL-like filtering: where SALARY > 3000 and NAME starts with 'T'
df.where("SALARY > 3000 AND NAME LIKE 'T%'").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+



### 5. Filtering Null and Not Null Values

**Scenario**: You want to filter rows where the `Salary` column is null or not null.

**Code Example**:

In [0]:
df_with_null = spark.createDataFrame([
    (1, "John", 2500),
    (2, "Jane", None),
    (3, "Tom", 3500)
], ["EMPLOYEE_ID", "NAME", "SALARY"])

# SQL-like filtering: where SALARY is not null
df_with_null.where("SALARY IS NOT NULL").show()

# SQL-like filtering: where SALARY is null
df_with_null.where("SALARY IS NULL").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  2500|
|          3| Tom|  3500|
+-----------+----+------+

+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          2|Jane|  null|
+-----------+----+------+



### 6. Combining SQL-like Filtering with Expressions

**Scenario**: You want to filter employees whose salary, after a 10% increase, will exceed $3,500.

**Code Example**:

In [0]:
# SQL-like filtering: where (SALARY * 1.1) > 3500
df.where("(SALARY * 1.1) > 3500").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          2|Jane|  4000|
|          3| Tom|  3500|
+-----------+----+------+



### 7. Using `where()` with PySpark Expressions

**Scenario**: You prefer to use PySpark column expressions in your filtering logic.

**Code Example**:

In [0]:
# Using PySpark column expressions in where()
from pyspark.sql.functions import col

df.where((col("SALARY") > 3000) & (col("SALARY") < 4000)).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+

