This is a **very large request** (100 methods √ó *what/why/how/scenario/code/interview angle*).
To keep it **interview-ready and actually usable**, I‚Äôll deliver this in **4 structured parts (25 methods each)**.

Below is **PART 1: Core PySpark DataFrame & Column APIs (Most Asked in Interviews)**
(These alone already cover ~70% of real interview questions)

---

# ‚úÖ PySpark Top 100 Methods ‚Äî **PART 1 (1‚Äì25)**

**Category: DataFrame Creation, Selection, Filtering, Transformation**

---

## 1Ô∏è‚É£ `SparkSession.builder.getOrCreate()`

### **What**

Creates or retrieves the active Spark session.

### **Why**

Entry point for **all PySpark operations** (DataFrame, SQL, streaming).

### **How**

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("InterviewPrep") \
    .getOrCreate()

### **When / Scenario**

* First line in **every PySpark job**
* Required for DataFrame & SQL APIs

### **Interview Tip**

> SparkSession **replaced SparkContext + SQLContext + HiveContext**

---

## 2Ô∏è‚É£ `spark.read`

### **What**

Entry point to read external data.

### **Why**

Supports multiple formats (CSV, JSON, Parquet, ORC, Avro).

### **How**

In [None]:
df = spark.read.csv("data.csv", header=True, inferSchema=True)

### **Scenario**

* Data ingestion layer
* Batch pipelines

---

## 3Ô∏è‚É£ `spark.read.format()`

### **What**

Explicitly defines file format.

### **Why**

Needed for advanced formats (Parquet, Delta, JDBC).

### **How**

In [None]:
df = spark.read.format("parquet").load("s3://bucket/data")

### **Interview Angle**

> Preferred over `.csv()` / `.json()` in **production pipelines**

---

## 4Ô∏è‚É£ `spark.read.option()`

### **What**

Sets read-time options.

### **Why**

Controls parsing behavior (delimiter, schema, encoding).

### **How**

In [None]:
df = spark.read.option("delimiter", "|").csv("data.txt")

---

## 5Ô∏è‚É£ `df.show()`

### **What**

Displays rows.

### **Why**

Quick debugging & inspection.

### **How**

In [None]:
df.show(5, truncate=False)

### **Interview Trick**

> `show()` **triggers an action**

---

## 6Ô∏è‚É£ `df.printSchema()`

### **What**

Displays schema tree.

### **Why**

Critical for **debugging type issues**.

### **How**

In [None]:
df.printSchema()

---

## 7Ô∏è‚É£ `df.schema`

### **What**

Returns schema object.

### **Why**

Used for **programmatic schema validation**.

### **How**

In [None]:
df.schema

---

## 8Ô∏è‚É£ `df.select()`

### **What**

Selects columns.

### **Why**

Column pruning ‚Üí performance optimization.

### **How**

In [None]:
df.select("id", "salary").show()

---

## 9Ô∏è‚É£ `df.selectExpr()`

### **What**

Select using SQL expressions.

### **Why**

Cleaner transformations without `withColumn`.

### **How**

In [None]:
df.selectExpr("salary * 1.1 as new_salary")

### **Interview**

> Faster for **simple derived columns**

---

## üîü `df.withColumn()`

### **What**

Adds or replaces a column.

### **Why**

Core transformation API.

### **How**

In [None]:
from pyspark.sql.functions import col
df = df.withColumn("bonus", col("salary") * 0.1)

### **Interview Warning**

> Multiple `withColumn()` calls = **multiple DAG stages**

---

## 1Ô∏è‚É£1Ô∏è‚É£ `df.withColumnRenamed()`

### **What**

Renames a column.

### **How**

In [None]:
df = df.withColumnRenamed("sal", "salary")

---

## 1Ô∏è‚É£2Ô∏è‚É£ `df.drop()`

### **What**

Removes columns.

### **Why**

Reduce memory & shuffle size.

### **How**

In [None]:
df.drop("temp_col")

---

## 1Ô∏è‚É£3Ô∏è‚É£ `df.filter()` / `df.where()`

### **What**

Row-level filtering.

### **Why**

Predicate pushdown optimization.

### **How**

In [None]:
df.filter(col("salary") > 50000)

### **Interview**

> `filter()` and `where()` are **identical**

---

## 1Ô∏è‚É£4Ô∏è‚É£ `df.distinct()`

### **What**

Removes duplicate rows.

### **Why**

Data deduplication.

### **How**

In [None]:
df.distinct()

### **Cost**

‚ö†Ô∏è Triggers **shuffle**

---

## 1Ô∏è‚É£5Ô∏è‚É£ `df.dropDuplicates()`

### **What**

Removes duplicates based on columns.

### **How**

In [None]:
df.dropDuplicates(["id"])

### **Why**

More control than `distinct()`

---

## 1Ô∏è‚É£6Ô∏è‚É£ `df.orderBy()` / `df.sort()`

### **What**

Sorts rows.

### **How**

In [None]:
df.orderBy(col("salary").desc())

### **Interview**

> Sorting causes **wide transformation (shuffle)**

---

## 1Ô∏è‚É£7Ô∏è‚É£ `df.limit()`

### **What**

Restricts number of rows.

### **Why**

Sampling / debugging.

### **How**

In [None]:
df.limit(10)

---

## 1Ô∏è‚É£8Ô∏è‚É£ `df.sample()`

### **What**

Random sampling.

### **How**

In [None]:
df.sample(fraction=0.1, seed=42)

### **Scenario**

* Model training
* Data analysis

---

## 1Ô∏è‚É£9Ô∏è‚É£ `df.count()`

### **What**

Counts rows.

### **Why**

Validation & metrics.

### **Interview**

> Very expensive on large datasets

---

## 2Ô∏è‚É£0Ô∏è‚É£ `df.collect()`

### **What**

Brings data to driver.

### **Why**

Debugging small datasets.

### **Danger**

‚ùå **OOM risk**

---

## 2Ô∏è‚É£1Ô∏è‚É£ `df.take()`

### **What**

Returns first N rows.

### **How**

In [None]:
df.take(5)

---

## 2Ô∏è‚É£2Ô∏è‚É£ `df.head()`

### **What**

Same as `take(1)`.

---

## 2Ô∏è‚É£3Ô∏è‚É£ `df.columns`

### **What**

List of column names.

### **How**

In [None]:
df.columns

---

## 2Ô∏è‚É£4Ô∏è‚É£ `df.dtypes`

### **What**

Column names with data types.

---

## 2Ô∏è‚É£5Ô∏è‚É£ `df.describe()`

### **What**

Basic statistics.

### **How**

In [None]:
df.describe("salary").show()

---

# üî• Interview Coverage Achieved (Part 1)

‚úî Data ingestion
‚úî Column operations
‚úî Filtering
‚úî Performance implications
‚úî Lazy vs action operations

---

## üëâ Next Parts (coming next)

**PART 2:** Aggregations, GroupBy, Joins (26‚Äì50)
**PART 3:** Functions, Window, UDF, Date/JSON (51‚Äì75)
**PART 4:** Performance, Partitioning, Caching, Writing, Spark Internals (76‚Äì100)

If you want, I can:

* Map **each method ‚Üí SQL equivalent**
* Add **real interview questions per method**
* Add **Spark execution plan & DAG explanations**

Just say **‚ÄúContinue with Part 2‚Äù**