# PySpark SQL between() Function: Checking Values in a Range

## Introduction to the `between()` Function

The `between()` function in PySpark is used to check if a column’s value falls within a specific range. It is similar to the SQL `BETWEEN` clause, where you define a lower and an upper bound, and the function returns true if the value is within that range. This function is commonly used for filtering numeric columns, dates, or other ordered data types.


## Basic Syntax:

```
DataFrame.filter(column.between(lower_bound, upper_bound))
```

### Parameters:

- **`column`**: The column whose values are checked.
- **`lower_bound`**: The lower boundary of the range.
- **`upper_bound`**: The upper boundary of the range.


## Why Use `between()`?

- It simplifies the process of filtering data based on a range. 
- Instead of writing multiple conditions (e.g., `>=` and `<=`), you can use `between()` for a more readable and concise approach.
- It is especially useful for filtering numeric values, dates, or any data types that follow a natural order.


## Practical Examples

### 1. Filtering Rows Based on a Numeric Range

**Scenario**: You have a DataFrame with sales data, and you want to filter rows where sales fall between 300 and 600.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    ("Product A", 500),
    ("Product B", 300),
    ("Product C", 700),
    ("Product D", 600)
], ["product_name", "sales"])

# Filter rows where sales are between 300 and 600
df.filter(df.sales.between(300, 600)).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product B|  300|
|   Product D|  600|
+------------+-----+



### 2. Using `between()` for Date Ranges

**Scenario**: You have a DataFrame with transaction dates, and you want to filter rows where the date falls between two specific dates.

**Code Example**:

In [0]:
from pyspark.sql import functions as F

df_dates = spark.createDataFrame([
    ("Transaction A", "2024-01-01"),
    ("Transaction B", "2024-06-15"),
    ("Transaction C", "2024-10-01")
], ["transaction", "transaction_date"])

# Filter rows where the transaction_date is between two dates
df_dates.filter(F.col("transaction_date").between("2024-01-01", "2024-09-30")).show()


+-------------+----------------+
|  transaction|transaction_date|
+-------------+----------------+
|Transaction A|      2024-01-01|
|Transaction B|      2024-06-15|
+-------------+----------------+



### 3. Combining `between()` with Other Filters

**Scenario**: You want to filter rows where the sales are between 300 and 600, and the product name starts with "Product A" or "Product B."

**Code Example**:

In [0]:
# Combine between() with another condition
df.filter(df.sales.between(300, 600) & df.product_name.startswith("Product A")).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
+------------+-----+



### 4. Using `between()` for Conditional Aggregations

**Scenario**: You want to calculate the total sales only for products whose sales are between 300 and 600.

**Code Example**:

In [0]:
from pyspark.sql.functions import sum

# Aggregate sales for products between a specific sales range
df.filter(df.sales.between(300, 600)).agg(sum("sales").alias("total_sales")).show()


+-----------+
|total_sales|
+-----------+
|       1400|
+-----------+



### 5. Applying `between()` to String Values

**Scenario**: You want to check if a string value (like product codes) falls alphabetically between two values.

**Code Example**:

In [0]:
df_codes = spark.createDataFrame([
    ("Product A", "A001"),
    ("Product B", "B002"),
    ("Product C", "C003")
], ["product_name", "product_code"])

# Filter rows where product_code is between 'A001' and 'B999'
df_codes.filter(df_codes.product_code.between("A001", "B999")).show()


+------------+------------+
|product_name|product_code|
+------------+------------+
|   Product A|        A001|
|   Product B|        B002|
+------------+------------+



### 6. Handling Null Values with `between()`

**Scenario**: You have a DataFrame with null values, and you want to apply `between()` while handling null values appropriately.

**Code Example**:

In [0]:
df_with_nulls = spark.createDataFrame([
    ("Product A", 500),
    ("Product B", None),
    ("Product C", 600)
], ["product_name", "sales"])

# Apply between() and handle null values
df_with_nulls.filter(df_with_nulls.sales.between(300, 600)).show()


+------------+-----+
|product_name|sales|
+------------+-----+
|   Product A|  500|
|   Product C|  600|
+------------+-----+

