# Data Analysis with PySpark in Jupyter Notebook

In this lab, you'll learn how to: 
- Set up a Jupyter Notebook for PySpark.
- Load data from various formats (JSON, CSV, Parquet, Avro).
- Perform SQL queries and data analysis using PySpark.

## Step 1: Set Up PySpark in Jupyter Notebook

In [None]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("DataProcessingLab") \
    .getOrCreate()

# Check Spark version
print(f"Spark Version: {spark.version}")

## Step 2: Load Data from Different Formats

### 1. Reading JSON Data

In [None]:
# Load JSON data
json_df = spark.read.json("employees.json")
json_df.show()
json_df.printSchema()

### 2. Reading CSV Data

In [None]:
# Load CSV data
csv_df = spark.read.csv("employees.csv", header=True, inferSchema=True)
csv_df.show()
csv_df.printSchema()

### 3. Reading Parquet Data

In [None]:
# Save CSV DataFrame as Parquet and read it back
csv_df.write.mode("overwrite").parquet("employees.parquet")
parquet_df = spark.read.parquet("employees.parquet")
parquet_df.show()
parquet_df.printSchema()

### 4. Reading Avro Data

In [None]:
# Save DataFrame as Avro and read it (requires spark-avro package)
parquet_df.write.mode("overwrite").format("avro").save("employees.avro")
avro_df = spark.read.format("avro").load("employees.avro")
avro_df.show()
avro_df.printSchema()

## Step 3: Create Temporary Views and Run SQL Queries

In [None]:
# Create temporary views
json_df.createOrReplaceTempView("employees_json")
csv_df.createOrReplaceTempView("employees_csv")
parquet_df.createOrReplaceTempView("employees_parquet")

# Run SQL query
result = spark.sql("SELECT name, department, salary FROM employees_json WHERE salary > 55000")
result.show()

## Step 4: Perform Data Analysis Using DataFrame API

In [None]:
# Filter employees who joined after 2021
recent_employees = json_df.filter(json_df.join_date > "2021-01-01")
recent_employees.show()

In [None]:
# Calculate average salary by department
avg_salary = csv_df.groupBy("department").avg("salary")
avg_salary.show()

In [None]:
# Sort employees by salary in descending order
sorted_employees = parquet_df.orderBy(parquet_df.salary.desc())
sorted_employees.show()

## Step 5: Save Results

In [None]:
# Save as CSV
result.write.mode("overwrite").csv("filtered_employees.csv")

# Save as JSON
avg_salary.write.mode("overwrite").json("average_salary.json")

# Save as Parquet
avg_salary.write.mode("overwrite").parquet("average_salary.parquet")

### Lab Summary
In this lab, you've learned how to:
- Set up PySpark in Jupyter Notebook.
- Read data from different file formats.
- Perform SQL queries and DataFrame operations.
- Save results in different formats.