#  Pandas vs. PySpark: Basic Function Comparison


## 📁 Sample Dataset

**sample.csv**



# 📊 Pandas vs. PySpark: Basic Function Comparison

This notebook shows a side-by-side comparison of **Pandas** and **PySpark** for common data manipulation tasks using the same dataset. Useful for understanding how operations translate between small-scale (Pandas) and big data (PySpark) processing.

---

## 📁 Sample Dataset

We'll use the following dataset:

```
name,age,salary
Alice,25,50000
Bob,30,60000
Charlie,35,70000
David,40,80000
Eva,45,90000
```

---

## 🔹 1. Load CSV File

###  Pandas

```python
import pandas as pd

df = pd.read_csv("sample.csv")
print(df)
```

###  PySpark

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BasicComparison").getOrCreate()
df = spark.read.csv("sample.csv", header=True, inferSchema=True)
df.show()
```

---

## 🔹 2. Show Schema / Info

###  Pandas

```python
df.info()
```

###  PySpark

```python
df.printSchema()
```

---

## 🔹 3. Basic Statistics

###  Pandas

```python
df.describe()
```

###  PySpark

```python
df.describe().show()
```

---

## 🔹 4. Filter Rows (age > 30)

###  Pandas

```python
df[df['age'] > 30]
```

###  PySpark

```python
df.filter(df.age > 30).show()
```

---

## 🔹 5. Select Specific Columns

###  Pandas

```python
df[['name', 'salary']]
```

###  PySpark

```python
df.select("name", "salary").show()
```

---

## 🔹 6. GroupBy and Aggregate (Average Salary by Age)

###  Pandas

```python
df.groupby("age").mean()
```

###  PySpark

```python
from pyspark.sql.functions import avg

df.groupBy("age").agg(avg("salary")).show()
```

---

## 🔹 7. Add a New Column (bonus = salary * 0.1)

###  Pandas

```python
df["bonus"] = df["salary"] * 0.1
df.head()
```

###  PySpark

```python
from pyspark.sql.functions import col

df = df.withColumn("bonus", col("salary") * 0.1)
df.show()
```

---

## 🔹 8. Sorting by Salary (Descending)

###  Pandas

```python
df.sort_values("salary", ascending=False)
```

###  PySpark

```python
df.orderBy("salary", ascending=False).show()
```

---

##  Summary Table

| Operation | Pandas | PySpark |
| --- | --- | --- |
| Read CSV | `pd.read_csv()` | `spark.read.csv()` |
| Show Schema | `df.info()` | `df.printSchema()` |
| Filter | `df[df['age'] > 30]` | `df.filter(df.age > 30)` |
| Select Columns | `df[['name', 'salary']]` | `df.select("name", "salary")` |
| GroupBy & Aggregate | `df.groupby().mean()` | `df.groupBy().agg()` |
| Add Column | `df["new"] = ...` | `df.withColumn()` |
| Sort | `df.sort_values()` | `df.orderBy()` |

---

In [2]:
import findspark
findspark.init()
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("BasicComparison").getOrCreate()

In [3]:
print(spark.sparkContext.appName)

BasicComparison


In [4]:
spark.stop()

In [5]:
print(spark.sparkContext.appName)

BasicComparison


In [6]:
print("App Name:", spark.sparkContext.appName)
print("App ID:", spark.sparkContext.applicationId)
print("Master:", spark.sparkContext.master)
print("Spark UI:", spark.sparkContext.uiWebUrl)

App Name: BasicComparison


AttributeError: 'NoneType' object has no attribute 'sc'