# **PySpark vs Pandas Execution in the Same Script**

This notebook demonstrates how to use **PySpark** and **Pandas** together.  

- **PySpark DataFrame**: Runs on a **distributed Spark cluster** (`spark://localhost:7077`), processing data in parallel.  
- **Pandas DataFrame**: Runs **locally**, processing data in-memory on a single machine.  
- **Spark Session**: Created with `SparkSession.builder`, enabling distributed computing.  
- **Execution Flow**:
  1. Connect to Spark.
  2. Create a Spark DataFrame and display it.
  3. Create a Pandas DataFrame and print it.
  4. Stop the Spark session to release resources.

⚡ **Use PySpark for big data & parallel processing, Pandas for small data & quick analysis.**


In [7]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark-Test") \
    .master("spark://localhost:7077") \
    .getOrCreate()

df = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
df.show()

import pandas as pd
df2 = pd.DataFrame([(1, "Alice"), (2, "Bob")], columns=["id", "name"])
print(df2)

spark.stop()

                                                                                

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

   id   name
0   1  Alice
1   2    Bob
