# PySpark Interview Q&A with Practical Examples

This notebook contains 56 PySpark interview questions with concise, interview-style answers and example code snippets.
All examples use the following preloaded DataFrames from your uploaded notebook:
- `emp_df` (Employees)
- `dept_df` (Departments)
- `orders_df` (Orders)
- `sales_df` (Sales)


### 1. What is PySpark?
**Answer:** PySpark is the Python API for Apache Spark, enabling distributed data processing using Spark’s RDD/DataFrame/Dataset abstractions.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("InterviewDemo").getOrCreate()
emp_df.show(5)

### 2. Explain RDD.
**Answer:** RDD (Resilient Distributed Dataset) is Spark’s low-level immutable distributed collection. It supports transformations and actions and provides fault tolerance via lineage.

In [None]:
rdd = orders_df.rdd
amounts = rdd.map(lambda r: r['amount']).collect()
print(amounts)

### 3. What is a DataFrame?
**Answer:** A DataFrame is a distributed dataset organized into named columns with a schema. It provides optimizations via the Catalyst optimizer and supports SQL-like operations.

In [None]:
orders_df.printSchema()
orders_df.show(5)

### 4. Difference between RDD and DataFrame.
**Answer:** DataFrames have schema and optimization support; RDDs are low-level and untyped. DataFrames are preferred for structured data.

In [None]:
print("DF rows:", orders_df.count())
print("RDD rows:", orders_df.rdd.count())

### 5. What is SparkSession?
**Answer:** SparkSession is the unified entry point to Spark; it replaces older contexts and gives access to all Spark features.

In [None]:
spark = SparkSession.builder.getOrCreate()
print("Spark version:", spark.version)