# PySpark SQL filter() Function: How to Filter Rows

# Introduction to the `filter()` Function

The `filter()` function in PySpark is used to filter rows from a DataFrame based on a specified condition. It’s similar to the `WHERE` clause in SQL and allows you to apply logical conditions to remove rows that don’t match the criteria.


## Basic Syntax:

```
DataFrame.filter(condition)
```

Where condition is any valid expression that evaluates to True or False.


## How It Works

The `filter()` function checks each row of the DataFrame, and if the row satisfies the condition, it will be retained in the resulting DataFrame. Otherwise, it will be filtered out.


# Practical Examples

### 1. Basic Filtering: Simple Condition

**Scenario**: You have a DataFrame with employee salary data, and you want to filter out employees earning more than $3,000.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    (1, "John", 2500),
    (2, "Jane", 4000),
    (3, "Tom", 3500)
], ["EMPLOYEE_ID", "NAME", "SALARY"])

# Filter employees with SALARY > 3000
df.filter(df.SALARY > 3000).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          2|Jane|  4000|
|          3| Tom|  3500|
+-----------+----+------+



### 2. Filtering Using Multiple Conditions

**Scenario**: You want to filter employees whose salary is greater than $3,000 and less than $4,000.

**Code Example**:

In [0]:
# Filter employees with salary between 3000 and 4000
df.filter((df.SALARY > 3000) & (df.SALARY < 4000)).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+



### 3. Filtering Using String Conditions

**Scenario**: You want to filter out employees whose names start with the letter "J."

**Code Example**:

In [0]:
# Filter employees whose names start with 'J'
df.filter(df.NAME.startswith("J")).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  2500|
|          2|Jane|  4000|
+-----------+----+------+



### 4. Filtering Using `isin()` Function

**Scenario**: You have a list of employee names, and you want to filter out rows where the employee name is in this list.

**Code Example**:

In [0]:
# Filter employees whose names are in the given list
df.filter(df.NAME.isin("John", "Tom")).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  2500|
|          3| Tom|  3500|
+-----------+----+------+



### 5. Filtering Null Values

**Scenario**: You want to filter rows where the `Salary` column is null or not null.

**Code Example**:

In [0]:
df_with_null = spark.createDataFrame([
    (1, "John", 2500),
    (2, "Jane", None),
    (3, "Tom", 3500)
], ["EMPLOYEE_ID", "NAME", "SALARY"])

# Filter rows where SALARY is not null
df_with_null.filter(df_with_null.SALARY.isNotNull()).show()

# Filter rows where SALARY is null
df_with_null.filter(df_with_null.SALARY.isNull()).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          1|John|  2500|
|          3| Tom|  3500|
+-----------+----+------+

+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          2|Jane|  null|
+-----------+----+------+



### 6. Filtering Using SQL-like Expressions

**Scenario**: You prefer to write SQL-like expressions within your filter statement.

**Code Example**:

In [0]:
# Use SQL-like syntax in filter()
df.filter("SALARY > 3000 AND SALARY < 4000").show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+



### 7. Filtering with Complex Expressions

**Scenario**: You want to filter employees based on multiple conditions, such as a specific name and a salary range.

**Code Example**:

In [0]:
# Complex filtering: employee name is 'Tom' and salary > 3000
df.filter((df.NAME == "Tom") & (df.SALARY > 3000)).show()


+-----------+----+------+
|EMPLOYEE_ID|NAME|SALARY|
+-----------+----+------+
|          3| Tom|  3500|
+-----------+----+------+

