# Benchmarking Data Processing Frameworks: Pandas, Polars, and Spark

## 1. Definitions and Package  Details

### 1.1. Pandas

![image.png](attachment:image.png)

Pandas is a powerful data manipulation and analysis library for Python. It provides versatile data structures like DataFrames, Series, and tools for reading and writing data from various sources, making it a go-to package for data wrangling and exploration. With its intuitive syntax and extensive functionality, Pandas simplifies tasks like data cleaning, transformation, aggregation, and visualization, making it an essential tool for data scientists and analysts.

### 1.2. Polars

![image.png](attachment:image.png)

Polars is a fast and efficient data manipulation library for Python, particularly designed for working with large-scale datasets. It leverages Rust's performance to provide blazingly fast data operations while maintaining a Pandas-like API, making it a valuable choice for handling big data in a Python environment. Polars excels in data processing tasks such as filtering, aggregating, joining, and transforming data frames, making it a compelling option for those seeking both speed and ease of use.

### 1.3. PySpark (Apache Spark)

![image-3.png](attachment:image-3.png)

PySpark is a Python library that interfaces with Apache Spark, a distributed data processing framework. It enables users to perform large-scale data processing, machine learning, and graph processing tasks across distributed clusters with ease. PySpark leverages the power of Spark's distributed computing capabilities, making it suitable for handling massive datasets and complex computations. With its versatile APIs for Spark SQL, Spark Streaming, and MLlib, PySpark empowers data engineers and data scientists to build scalable and high-performance data applications.

## 2. Benchmarking

In the following part, I will test different tasks using the same datasets but different frameworks to understand which package outperforms the others. 

In [1]:
import sys
sys.path.append('../scripts')
from functions import measure_read_csv_time

In [None]:
# Create a DataFrame to store benchmark results
benchmark_df = pd.DataFrame(columns=['PackageName', 'Time'])

### 2.1. Reading CSV Files

In [None]:
csv_file_path = '../data/employees.csv'

# Run the benchmarks for each package
task = 'Reading CSV'
for package in ["Pandas", "Polars", "PySpark"]:
    package_name, elapsed_time = measure_read_csv_time(package, csv_file_path)
    benchmark_df = benchmark_df.append({'PackageName': package_name, 'Task': task, 'Time': elapsed_time}, ignore_index=True)

### 2.2. Aggregation

### 2.3 Appending Two Identical Datasets

### 2.4. Joining Two Tables

### 2.5. Writing CSV Files

## 3. Visualization

## 4. Conclusion