# updated

Got it! I'll now structure the **Employee Data Processing with PySpark** project into a **Jupyter Notebook (`.ipynb`) format**, incorporating all the **Spark SQL functions** you requested.  

### **🚀 Project: Employee Data Processing with PySpark**
We'll cover the following:  
✅ **Data Ingestion** (CSV/Parquet)  
✅ **Data Cleaning & Transformation**  
✅ **Data Aggregation**  
✅ **Feature Engineering**  
✅ **Performance Optimization (Spark Config Tuning)**  
✅ **Unit Testing for Data Quality**  

---
### **📌 1. Setup PySpark in Jupyter Notebook**
```python
# Install PySpark if not installed (Only needed once)
!pip install pyspark
```

```python
# Import Required Libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
```

```python
# Create Spark Session
spark = SparkSession.builder \
    .appName("Employee Data Processing") \
    .config("spark.executor.memory", "2g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()
```

---
### **📌 2. Load Employee Data**
```python
# Define schema
schema = StructType([
    StructField("First Name", StringType(), True),
    StructField("Gender", StringType(), True),
    StructField("Start Date", StringType(), True),
    StructField("Last Login Time", StringType(), True),
    StructField("Salary", IntegerType(), True),
    StructField("Bonus %", DoubleType(), True),
    StructField("Senior Management", BooleanType(), True),
    StructField("Team", StringType(), True)
])

# Load Data
file_path = "employee_data.csv"  # Change path if needed
df = spark.read.csv(file_path, header=True, schema=schema)
```

---
### **📌 3. Data Cleaning & Transformation**
```python
# Convert date columns to DateType
df = df.withColumn("Start Date", to_date(df["Start Date"], "yyyy-MM-dd")) \
       .withColumn("Last Login Time", to_timestamp(df["Last Login Time"], "yyyy-MM-dd HH:mm:ss"))

# Handle missing values
df = df.fillna({"Salary": 0, "Bonus %": 0.0, "Senior Management": False, "Team": "Unknown"})
```

---
### **📌 4. Feature Engineering**
```python
# Calculate total earnings (Salary + Bonus)
df = df.withColumn("Total Earnings", col("Salary") + (col("Salary") * col("Bonus %") / 100))

# Add months to Start Date (Example: Probation end date)
df = df.withColumn("Probation End Date", add_months(col("Start Date"), 3))

# Generate employee full name as an array
df = df.withColumn("Full Name", array("First Name", "Team"))
```

---
### **📌 5. Data Aggregation**
```python
# Average salary per team
df_avg_salary = df.groupBy("Team").agg(avg("Salary").alias("Avg Salary"))

# Collect list of employees per team
df_team_members = df.groupBy("Team").agg(collect_list("First Name").alias("Team Members"))

# Collect set of unique employees per team
df_unique_team_members = df.groupBy("Team").agg(collect_set("First Name").alias("Unique Team Members"))
```

---
### **📌 6. Performance Optimization**
```python
# Cache the DataFrame to optimize repeated operations
df.cache()

# Reduce shuffle partitions for small data
spark.conf.set("spark.sql.shuffle.partitions", "2")
```

---
### **📌 7. Unit Testing for Data Quality**
```python
# Check for duplicate records
df_duplicates = df.groupBy(df.columns).count().filter(col("count") > 1)
assert df_duplicates.count() == 0, "Duplicates found in data!"

# Check for null values in key columns
assert df.filter(col("First Name").isNull()).count() == 0, "Null values found in First Name!"
assert df.filter(col("Salary").isNull()).count() == 0, "Null values found in Salary!"

# Check if salary is non-negative
assert df.filter(col("Salary") < 0).count() == 0, "Negative salaries found!"
```

---
### **📌 8. Save Processed Data**
```python
df.write.mode("overwrite").parquet("processed_employee_data.parquet")
```

---
### **✅ Next Steps**
- Run each cell step by step in your **Jupyter Notebook** (`.ipynb`).
- Modify the **file path** for input/output data as needed.
- Experiment with different **Spark SQL functions** for practice.

---
This project ensures you're **using all the essential PySpark functions** a **data engineer** works with **daily** while covering **data ingestion, transformation, feature engineering, aggregation, and optimization**.

Let me know if you need any refinements! 🚀

### **JIRA Ticket: Employee Data Processing with PySpark**  

**📌 Project:** Employee Data Processing with PySpark  
**💼 Epic:** Data Pipeline Development  
**📅 Created Date:** [Today's Date]  
**👤 Assignee:** [Your Name]  
**📌 Priority:** High  
**🔁 Sprint:** [Sprint Name]  

---

## **🎯 Summary**  
Build a **PySpark-based data processing pipeline** to handle **employee data**. This pipeline should cover **data ingestion, transformation, feature engineering, aggregation, performance optimization, and unit testing** in a **Jupyter Notebook (`.ipynb`)** environment.  

---

## **📝 Description**  

### **Task Breakdown:**  

### **1️⃣ Data Ingestion**  
- Load employee data from **CSV/Parquet**.  
- Define **schema** using `StructType`.  
- Read the data into a **Spark DataFrame**.  

### **2️⃣ Data Cleaning & Transformation**  
- Convert **Start Date** and **Last Login Time** to proper date formats.  
- Handle **missing values** (`fillna` for Salary, Bonus %, etc.).  
- Remove **duplicates**.  

### **3️⃣ Feature Engineering**  
- Calculate **Total Earnings** (`Salary + Bonus`).  
- Add a **Probation End Date** (`Start Date` + 3 months).  
- Generate **Full Name** as an **array column**.  

### **4️⃣ Data Aggregation**  
- Calculate **average salary per team**.  
- Collect **team members** using `collect_list()`.  
- Collect **unique team members** using `collect_set()`.  

### **5️⃣ Performance Optimization**  
- **Cache** DataFrame to avoid recomputation.  
- Reduce **shuffle partitions** for small datasets (`spark.sql.shuffle.partitions`).  

### **6️⃣ Unit Testing for Data Quality**  
- Check for **duplicate records**.  
- Check for **null values** in key columns.  
- Ensure **salary is non-negative**.  

### **7️⃣ Save Processed Data**  
- Write processed data to **Parquet format**.  

---

## **📌 Acceptance Criteria**  
✅ **Employee data is loaded successfully into PySpark**.  
✅ **Data cleaning and transformations are correctly applied**.  
✅ **Feature engineering steps are implemented**.  
✅ **Aggregations provide expected results**.  
✅ **Performance optimizations are applied correctly**.  
✅ **All unit tests pass successfully**.  
✅ **Processed data is saved in the correct format**.  

---

## **⏳ Estimated Time**  
**5-6 hours**  

---

## **🔗 Dependencies**  
- PySpark environment setup  
- Employee data file (CSV/Parquet)  

---

## **🛠️ Labels**  
- `PySpark`  
- `Data Engineering`  
- `ETL`  
- `Transformation`  
- `Optimization`  

---

## **📌 JIRA Workflow Status**  
**To Do** → In Progress → Code Review → Done  

---

This JIRA ticket will help you **implement and track** your PySpark data pipeline. 🚀 Let me know if you need any modifications! 💡

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# Initialize Spark session
spark = SparkSession.builder.appName("PySparkExamples").getOrCreate()

# Define Schema
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType, BinaryType, FloatType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Salary", FloatType(), True),
    StructField("City", StringType(), True),
    StructField("JoinDate", StringType(), True),  # Date stored as string
    StructField("Skills1", ArrayType(StringType()), True),
    StructField("Skills2", ArrayType(StringType()), True),
    StructField("Scores", ArrayType(IntegerType()), True),
    StructField("Phone", StringType(), True),
    StructField("Email", StringType(), True),
    StructField("Height", FloatType(), True),
    StructField("Weight", FloatType(), True),
    StructField("BinaryData", BinaryType(), True),
    StructField("ID", IntegerType(), True)
])

# Sample Data
data = [
    ("Alice", 30, 60000.50, "New York", "2020-01-15", ["Python", "SQL"], ["SQL", "Java"], [90, 85, 88], None, "alice@example.com", 165.2, 60.5, b"AliceBinary", 1),
    ("Bob", 25, 50000.75, "London", "2019-07-21", ["Java", "C++"], ["Python", "C++"], [78, 80, 82], "1234567890", None, 170.4, 68.2, b"BobBinary", 2),
    ("Charlie", 35, 75000.00, "San Francisco", "2018-05-10", ["JavaScript", "Scala"], ["Scala", "Rust"], [95, 92, 89], "0987654321", "charlie@example.com", 180.3, 75.1, b"CharlieBinary", 3)
]

# Create DataFrame
df = spark.createDataFrame(data, schema=schema)
df.show(truncate=False)

# # 🔹 Numeric Functions
# df.select("Name", "Age", abs(df["Age"]).alias("AbsoluteAge")).show()
# df.select("Name", "Salary", ceil(df["Salary"]).alias("CeilingSalary")).show()
# df.select("Name", "Age", cbrt(df["Age"]).alias("CubeRootAge")).show()
# df.select("Name", "Age", acos(df["Age"]/100).alias("AcosAge")).show()
# df.select("Name", "Age", asin(df["Age"]/100).alias("AsinAge")).show()
# df.select("Name", "Age", atan(df["Age"]/100).alias("AtanAge")).show()
# df.select("Name", "Height", "Weight", atan2(df["Height"], df["Weight"]).alias("Atan2HeightWeight")).show()

# # 🔹 Array Functions
# df.select("Name", "Skills1", "Skills2", array_intersect(df["Skills1"], df["Skills2"]).alias("CommonSkills")).show()
# df.select("Name", "Scores", array_max(df["Scores"]).alias("MaxScore")).show()
# df.select("Name", "Scores", array_min(df["Scores"]).alias("MinScore")).show()
# df.select("Name", "Skills1", array_join(df["Skills1"], ", ").alias("SkillsAsString")).show()
# df.select("Name", "Skills1", array_repeat(df["Skills1"], 2).alias("RepeatedSkills")).show()
# df.select("Name", "Skills1", array_sort(df["Skills1"]).alias("SortedSkills")).show()
# df.select("Name", "Skills1", "Skills2", arrays_zip(df["Skills1"], df["Skills2"]).alias("ZippedSkills")).show()

# # 🔹 String Functions
# df.select("Name", ascii(df["Name"]).alias("ASCII_FirstChar")).show()
# df.select("Name", base64(df["BinaryData"]).alias("Base64Encoded")).show()

# # 🔹 Bitwise & Binary Functions
# df.select("Name", "ID", bin(df["ID"]).alias("BinaryRepresentation")).show()
# df.select("Name", "ID", bitwise_not(df["ID"]).alias("BitwiseNotID")).show()

# # 🔹 Date Functions
# df.select("Name", "JoinDate", add_months(df["JoinDate"], 3).alias("DateAfter3Months")).show()

# # 🔹 Aggregation Functions
# df.groupBy("City").agg(avg(df["Salary"]).alias("AverageSalary")).show()
# df.groupBy("City").agg(collect_list(df["Name"]).alias("PeopleInCity")).show()
# df.groupBy("City").agg(collect_set(df["Name"]).alias("UniquePeopleInCity")).show()


SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/02/01 23:21:53 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
                       

+-------+---+--------+-------------+----------+-------------------+-------------+------------+----------+-------------------+------+------+----------------------------------------+---+
|Name   |Age|Salary  |City         |JoinDate  |Skills1            |Skills2      |Scores      |Phone     |Email              |Height|Weight|BinaryData                              |ID |
+-------+---+--------+-------------+----------+-------------------+-------------+------------+----------+-------------------+------+------+----------------------------------------+---+
|Alice  |30 |60000.5 |New York     |2020-01-15|[Python, SQL]      |[SQL, Java]  |[90, 85, 88]|null      |alice@example.com  |165.2 |60.5  |[41 6C 69 63 65 42 69 6E 61 72 79]      |1  |
|Bob    |25 |50000.75|London       |2019-07-21|[Java, C++]        |[Python, C++]|[78, 80, 82]|1234567890|null               |170.4 |68.2  |[42 6F 62 42 69 6E 61 72 79]            |2  |
|Charlie|35 |75000.0 |San Francisco|2018-05-10|[JavaScript, Scala]|[Scala, 

                                                                                

In [2]:
df.show()

+-------+---+--------+-------------+----------+-------------------+-------------+------------+----------+-------------------+------+------+--------------------+---+
|   Name|Age|  Salary|         City|  JoinDate|            Skills1|      Skills2|      Scores|     Phone|              Email|Height|Weight|          BinaryData| ID|
+-------+---+--------+-------------+----------+-------------------+-------------+------------+----------+-------------------+------+------+--------------------+---+
|  Alice| 30| 60000.5|     New York|2020-01-15|      [Python, SQL]|  [SQL, Java]|[90, 85, 88]|      null|  alice@example.com| 165.2|  60.5|[41 6C 69 63 65 4...|  1|
|    Bob| 25|50000.75|       London|2019-07-21|        [Java, C++]|[Python, C++]|[78, 80, 82]|1234567890|               null| 170.4|  68.2|[42 6F 62 42 69 6...|  2|
|Charlie| 35| 75000.0|San Francisco|2018-05-10|[JavaScript, Scala]|[Scala, Rust]|[95, 92, 89]|0987654321|charlie@example.com| 180.3|  75.1|[43 68 61 72 6C 6...|  3|
+-------+-

In [None]:
df.select("Name", "JoinDate", add_months(df["2012-12-12"], 3).alias("DateAfter3Months")).show()


# Jira Ticket

code 

### **Project: Employee Data Processing with PySpark**

This project will help you prepare for your PySpark interview by applying common functions, transformations, and data processing techniques. The dataset will be sourced from an open-source repository, and the project will cover tasks like data cleaning, transformations, aggregations, and saving the processed data to AWS S3.

---

### **Project Structure**

1. **Objective:** Load, clean, and transform employee data.
2. **Dataset:** Use a publicly available employee dataset (e.g., from [Kaggle Employee Data](https://www.kaggle.com/datasets)).
3. **Tools:** PySpark for data processing, AWS S3 for storage.

---

### **Step 1: Setup PySpark and Download Dataset**
Make sure you have the necessary libraries installed.

#### Install PySpark:
```bash
pip install pyspark
```

Download the dataset from Kaggle or any open-source platform. Here, let's assume you're using the "Employee.csv" dataset with the following columns:
- `Name`
- `Age`
- `JoinDate`
- `Department`
- `Skills`
- `Salary`

---

### **Step 2: Initialize PySpark Session**
Create a Spark session to load and process data.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder \
    .appName("EmployeeDataProcessing") \
    .getOrCreate()
```

---

### **Step 3: Load the Dataset**
Use PySpark's `csv` reader to load the dataset.

```python
# Load the CSV dataset
df = spark.read.option("header", "true").csv("employee_data.csv", inferSchema=True)

# Show the top 5 rows of the dataset
df.show(5)
```

---

### **Step 4: Data Cleaning and Transformation**

#### 4.1. **Handling Missing Values**
Replace missing values for `Department` with "Unknown" and `Salary` with 0.

```python
df = df.fillna({"Department": "Unknown", "Salary": 0})
```

#### 4.2. **Convert Data Types**
Ensure that columns have the correct data types.

```python
df = df.withColumn("Age", df["Age"].cast("int"))
df = df.withColumn("Salary", df["Salary"].cast("double"))
df = df.withColumn("JoinDate", to_date(df["JoinDate"], "yyyy-MM-dd"))
```

#### 4.3. **Feature Engineering**
Use PySpark functions to add new columns and perform calculations.

- Add 3 months to `JoinDate`
- Calculate new `Salary` after a 10% increment
- Extract the first skill from the `Skills` array

```python
df = df.withColumn("DateAfter3Months", add_months(df["JoinDate"], 3))
df = df.withColumn("NewSalary", df["Salary"] * 1.1)
df = df.withColumn("FirstSkill", df["Skills"].getItem(0))
```

---

### **Step 5: Data Analysis with PySpark Functions**

#### 5.1. **Find the Average Salary per Department**
```python
df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")).show()
```

#### 5.2. **Find Employees Who Know "Python"**
```python
df.filter(array_contains(df["Skills"], "Python")).show()
```

#### 5.3. **Find the Youngest and Oldest Employees**
```python
df.select(min("Age").alias("Youngest"), max("Age").alias("Oldest")).show()
```

#### 5.4. **List Unique Departments**
```python
df.select(collect_set("Department")).show()
```

---

### **Step 6: Save Processed Data to AWS S3 using Multipart Upload**

To use multipart uploads, you can store the processed data in S3 after partitioning it into smaller files.

```python
output_path = "s3://your-bucket-name/processed_employees/"
df.write.mode("overwrite").parquet(output_path)
```

#### Save to S3
Before you run this, make sure you've set up your AWS credentials using either environment variables or an IAM role.

### **Step 7: Combine Multiple Files into One (Optional)**
If the data has been split across multiple files and you want to combine them, you can use the following:

```python
# Read two files and merge
df1 = spark.read.parquet("s3://your-bucket-name/file1.parquet")
df2 = spark.read.parquet("s3://your-bucket-name/file2.parquet")

# Merge them into a single DataFrame
merged_df = df1.union(df2)

# Write the merged data to S3
merged_df.write.mode("overwrite").parquet("s3://your-bucket-name/merged_employees.parquet")
```

---

### **Step 8: Example of Interview Questions Based on Functions Used**

| Function | What It Does | Example | Frequency (1-5) |
|----------|--------------|---------|-----------------|
| `abs()` | Returns the absolute value of a column. | `df.select("Age", abs(df["Age"])).show()` | 3 |
| `add_months()` | Adds months to a date. | `df.select(add_months(df["JoinDate"], 3).alias("DateAfter3Months")).show()` | 4 |
| `array_contains()` | Checks if an array contains a value. | `df.filter(array_contains(df["Skills"], "Python")).show()` | 4 |
| `avg()` | Calculates the average of a numeric column. | `df.groupBy("Department").agg(avg("Salary").alias("AvgSalary")).show()` | 5 |
| `collect_list()` | Collects the values of a column into a list. | `df.groupBy("Department").agg(collect_list("Name")).show()` | 3 |
| `min()`, `max()` | Finds the minimum or maximum value. | `df.select(min("Age"), max("Age")).show()` | 4 |
| `coalesce()` | Returns the first non-null value. | `df.select(coalesce(df["Phone"], df["Email"]).alias("PreferredContact")).show()` | 3 |

---

### **Conclusion**
This project demonstrates how to:
- Use **PySpark SQL functions** for data cleaning, transformation, and analysis.
- Perform **feature engineering** using built-in functions like `add_months`, `abs`, `array_contains`.
- **Aggregate** data using functions like `avg`, `min`, `max`.
- Save the data to **AWS S3** using multipart upload and partitioning.

### **Next Steps:**
- Implement additional features like error handling and logging.
- Set up a job scheduler (e.g., Apache Airflow) for automated execution.
- Extend the project to handle larger datasets and optimize the performance.

This project can be used as a base for interview preparation and hands-on experience. Let me know if you need more help or specific instructions!

Unittest

For your data pipeline project, I will provide a sample **unit test** code using **PySpark**'s `unittest` framework and `pytest` for testing. This assumes the pipeline processes some data transformations (such as filtering, aggregating, or joining data).

### Step-by-Step Unit Test Setup:
- We will simulate a **data pipeline** that processes a DataFrame.
- I'll create a few example transformations, then create unit tests to validate that the expected results are returned.

Let's assume a simple pipeline:
1. **Data loading**: A sample CSV file or data is loaded into a DataFrame.
2. **Data Transformation**: A few transformations are applied (filter, aggregation, etc.).
3. **Unit Test**: Validate that the transformation works correctly using assertions.

Here is how we can create the test code:

### Example Data Pipeline Code
Let's assume the pipeline contains the following transformations:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, avg

class DataPipeline:
    def __init__(self, spark_session: SparkSession):
        self.spark = spark_session

    def load_data(self, data):
        """ Load data into DataFrame """
        return self.spark.createDataFrame(data)

    def filter_data(self, df):
        """ Filter out rows where 'age' is less than 30 """
        return df.filter(col("age") >= 30)

    def calculate_avg_salary(self, df):
        """ Calculate the average salary """
        return df.agg(avg("salary").alias("avg_salary")).collect()[0]["avg_salary"]

    def process_pipeline(self, data):
        """ Full pipeline: load, filter and calculate average salary """
        df = self.load_data(data)
        df = self.filter_data(df)
        return self.calculate_avg_salary(df)
```

### Unit Test Code for Data Pipeline

```python
import unittest
from pyspark.sql import SparkSession
from data_pipeline import DataPipeline  # Assuming the pipeline code is in a file named `data_pipeline.py`

class DataPipelineTests(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        """ Set up the Spark session before any tests """
        cls.spark = SparkSession.builder.master("local[1]").appName("DataPipelineTest").getOrCreate()
        cls.pipeline = DataPipeline(cls.spark)

    def test_load_data(self):
        """ Test data loading into DataFrame """
        test_data = [("John", 35, 1000), ("Anna", 28, 1500)]
        df = self.pipeline.load_data(test_data)

        # Check if DataFrame is loaded correctly
        self.assertEqual(df.count(), 2)

    def test_filter_data(self):
        """ Test filtering data where 'age' >= 30 """
        test_data = [("John", 35, 1000), ("Anna", 28, 1500)]
        df = self.pipeline.load_data(test_data)
        filtered_df = self.pipeline.filter_data(df)

        # Verify that 'Anna' is filtered out (she's under 30)
        self.assertEqual(filtered_df.count(), 1)
        self.assertEqual(filtered_df.collect()[0]["name"], "John")

    def test_calculate_avg_salary(self):
        """ Test calculating the average salary """
        test_data = [("John", 35, 1000), ("Anna", 28, 1500)]
        df = self.pipeline.load_data(test_data)
        filtered_df = self.pipeline.filter_data(df)
        avg_salary = self.pipeline.calculate_avg_salary(filtered_df)

        # The only record left is John, so the average salary is 1000
        self.assertEqual(avg_salary, 1000)

    def test_process_pipeline(self):
        """ Test the entire data pipeline process """
        test_data = [("John", 35, 1000), ("Anna", 28, 1500)]
        avg_salary = self.pipeline.process_pipeline(test_data)

        # The average salary after filtering 'Anna' should be 1000
        self.assertEqual(avg_salary, 1000)

    @classmethod
    def tearDownClass(cls):
        """ Stop the Spark session after tests """
        cls.spark.stop()

if __name__ == '__main__':
    unittest.main()
```

### Explanation:
- **setUpClass()**: Initializes the `SparkSession` before running any tests. This method is run once for the entire test class.
- **test_load_data()**: Tests if the data is loaded correctly into a DataFrame.
- **test_filter_data()**: Verifies that the `filter_data()` method correctly filters out rows where `age` is less than 30.
- **test_calculate_avg_salary()**: Tests the logic of calculating the average salary after the filtering.
- **test_process_pipeline()**: Validates the entire pipeline by testing `process_pipeline()`.
- **tearDownClass()**: Stops the Spark session after all tests have been executed.

### Running the Tests:
You can run the tests using the following command in your terminal:

```bash
pytest test_data_pipeline.py
```

This will run the unit tests and give you a summary of the test results.

### Key Considerations:
- Make sure to have a **test-friendly dataset** (mock data) so the tests remain isolated and deterministic.
- You can use additional **assertions** like `assertEqual()`, `assertGreater()`, or `assertTrue()` to validate expected behaviors.
- The **SparkSession** is only created once for the entire test class to save time.

This is a simple example of testing your data pipeline, but you can expand on it to include more complex logic, error handling, and edge cases.