In [2]:
import pyspark



# ‚úÖ PySpark Top 100 Methods ‚Äî **PART 1 (1‚Äì25)**


**Category: DataFrame Creation, Selection, Filtering, Transformation**




## 1Ô∏è‚É£ `SparkSession.builder.getOrCreate()`

### **What**

Creates or retrieves the active Spark session.

### **Why**

Entry point for **all PySpark operations** (DataFrame, SQL, streaming).

### **How**

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master("local[*]") \
.config("spark.app.name", "MyApp") \
.config("spark.driver.host", "127.0.0.1") \
.config("spark.driver.bindAddress", "127.0.0.1") \
.config("spark.driver.port", 4040) \
.config("spark.blockManager.port", 4041) \
.getOrCreate()


### **When / Scenario**

* First line in **every PySpark job**
* Required for DataFrame & SQL APIs

### **Interview Tip**

> SparkSession **replaced SparkContext + SQLContext + HiveContext**



## 2Ô∏è‚É£ `spark.read`

### **What**

Entry point to read external data.

### **Why**

Supports multiple formats (CSV, JSON, Parquet, ORC, Avro).

### **How**

In [2]:
import pathlib

In [3]:

filepath = str(pathlib.Path().cwd().parent / 'data' / 'Spotify_Artists.csv')
filepath

'D:\\shra1\\github\\pyspark-practice\\data\\Spotify_Artists.csv'

In [4]:
df = spark.read.csv(filepath,
                    header=True,
                    inferSchema=True)

In [5]:
df.show(5)

+---------+--------+----------+---------+
|artist_id|    name|     genre|  country|
+---------+--------+----------+---------+
|        1|Artist_1|Electronic|   France|
|        2|Artist_2|Electronic|Australia|
|        3|Artist_3|      Jazz|   France|
|        4|Artist_4| Classical|Australia|
|        5|Artist_5|   Hip-Hop|      USA|
+---------+--------+----------+---------+
only showing top 5 rows


### **Scenario**

* Data ingestion layer
* Batch pipelines




## 3Ô∏è‚É£ `spark.read.format()`

### **What**

Explicitly defines file format.

### **Why**

Needed for advanced formats (Parquet, Delta, JDBC).

### **How**

In [7]:
# df = spark.read.format("parquet").load("s3://bucket/data")

### **Interview Angle**

> Preferred over `.csv()` / `.json()` in **production pipelines**




## 4Ô∏è‚É£ `spark.read.option()`

### **What**

Sets read-time options.

### **Why**

Controls parsing behavior (delimiter, schema, encoding).

### **How**

In [None]:
# df = spark.read.option("delimiter", "|").csv("data.txt")



## 5Ô∏è‚É£ `df.show()`

### **What**

Displays rows.

### **Why**

Quick debugging & inspection.

### **How**

In [6]:
df.show(5, truncate=False)

+---------+--------+----------+---------+
|artist_id|name    |genre     |country  |
+---------+--------+----------+---------+
|1        |Artist_1|Electronic|France   |
|2        |Artist_2|Electronic|Australia|
|3        |Artist_3|Jazz      |France   |
|4        |Artist_4|Classical |Australia|
|5        |Artist_5|Hip-Hop   |USA      |
+---------+--------+----------+---------+
only showing top 5 rows


### **Interview Trick**

> `show()` **triggers an action**




## 6Ô∏è‚É£ `df.printSchema()`

### **What**

Displays schema tree.

### **Why**

Critical for **debugging type issues**.

### **How**

In [7]:
df.printSchema()

root
 |-- artist_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- country: string (nullable = true)





## 7Ô∏è‚É£ `df.schema`

### **What**

Returns schema object.

### **Why**

Used for **programmatic schema validation**.

### **How**

In [8]:
df.schema

StructType([StructField('artist_id', IntegerType(), True), StructField('name', StringType(), True), StructField('genre', StringType(), True), StructField('country', StringType(), True)])



## 8Ô∏è‚É£ `df.select()`

### **What**

Selects columns.

### **Why**

Column pruning ‚Üí performance optimization.

### **How**

In [9]:
df.select("name", "country").show(5)

+--------+---------+
|    name|  country|
+--------+---------+
|Artist_1|   France|
|Artist_2|Australia|
|Artist_3|   France|
|Artist_4|Australia|
|Artist_5|      USA|
+--------+---------+
only showing top 5 rows




## 9Ô∏è‚É£ `df.selectExpr()`

### **What**

Select using SQL expressions.

### **Why**

Cleaner transformations without `withColumn`.

### **How**

In [10]:
df.selectExpr("country as nation").show(5)

+---------+
|   nation|
+---------+
|   France|
|Australia|
|   France|
|Australia|
|      USA|
+---------+
only showing top 5 rows


### **Interview**

> Faster for **simple derived columns**




## üîü `df.withColumn()`

### **What**

Adds or replaces a column.

### **Why**

Core transformation API.

### **How**

In [12]:
from pyspark.sql.functions import col

df = df.withColumn("id_with_name", col("artist_id").cast("string") + "_" + col("name"))
df.show(5)

ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

### **Interview Warning**

> Multiple `withColumn()` calls = **multiple DAG stages**

---

## 1Ô∏è‚É£1Ô∏è‚É£ `df.withColumnRenamed()`

### **What**

Renames a column.

### **How**

In [None]:
df = df.withColumnRenamed("sal", "salary")

---

## 1Ô∏è‚É£2Ô∏è‚É£ `df.drop()`

### **What**

Removes columns.

### **Why**

Reduce memory & shuffle size.

### **How**

In [None]:
df.drop("temp_col")

---

## 1Ô∏è‚É£3Ô∏è‚É£ `df.filter()` / `df.where()`

### **What**

Row-level filtering.

### **Why**

Predicate pushdown optimization.

### **How**

In [None]:
df.filter(col("salary") > 50000)

### **Interview**

> `filter()` and `where()` are **identical**

---

## 1Ô∏è‚É£4Ô∏è‚É£ `df.distinct()`

### **What**

Removes duplicate rows.

### **Why**

Data deduplication.

### **How**

In [None]:
df.distinct()

### **Cost**

‚ö†Ô∏è Triggers **shuffle**

---

## 1Ô∏è‚É£5Ô∏è‚É£ `df.dropDuplicates()`

### **What**

Removes duplicates based on columns.

### **How**

In [None]:
df.dropDuplicates(["id"])

### **Why**

More control than `distinct()`

---

## 1Ô∏è‚É£6Ô∏è‚É£ `df.orderBy()` / `df.sort()`

### **What**

Sorts rows.

### **How**

In [None]:
df.orderBy(col("salary").desc())

### **Interview**

> Sorting causes **wide transformation (shuffle)**

---

## 1Ô∏è‚É£7Ô∏è‚É£ `df.limit()`

### **What**

Restricts number of rows.

### **Why**

Sampling / debugging.

### **How**

In [None]:
df.limit(10)

---

## 1Ô∏è‚É£8Ô∏è‚É£ `df.sample()`

### **What**

Random sampling.

### **How**

In [None]:
df.sample(fraction=0.1, seed=42)

### **Scenario**

* Model training
* Data analysis

---

## 1Ô∏è‚É£9Ô∏è‚É£ `df.count()`

### **What**

Counts rows.

### **Why**

Validation & metrics.

### **Interview**

> Very expensive on large datasets

---

## 2Ô∏è‚É£0Ô∏è‚É£ `df.collect()`

### **What**

Brings data to driver.

### **Why**

Debugging small datasets.

### **Danger**

‚ùå **OOM risk**

---

## 2Ô∏è‚É£1Ô∏è‚É£ `df.take()`

### **What**

Returns first N rows.

### **How**

In [None]:
df.take(5)

---

## 2Ô∏è‚É£2Ô∏è‚É£ `df.head()`

### **What**

Same as `take(1)`.

---

## 2Ô∏è‚É£3Ô∏è‚É£ `df.columns`

### **What**

List of column names.

### **How**

In [None]:
df.columns

---

## 2Ô∏è‚É£4Ô∏è‚É£ `df.dtypes`

### **What**

Column names with data types.

---

## 2Ô∏è‚É£5Ô∏è‚É£ `df.describe()`

### **What**

Basic statistics.

### **How**

In [None]:
df.describe("salary").show()

---

# üî• Interview Coverage Achieved (Part 1)

‚úî Data ingestion
‚úî Column operations
‚úî Filtering
‚úî Performance implications
‚úî Lazy vs action operations

---

## üëâ Next Parts (coming next)

**PART 2:** Aggregations, GroupBy, Joins (26‚Äì50)
**PART 3:** Functions, Window, UDF, Date/JSON (51‚Äì75)
**PART 4:** Performance, Partitioning, Caching, Writing, Spark Internals (76‚Äì100)

If you want, I can:

* Map **each method ‚Üí SQL equivalent**
* Add **real interview questions per method**
* Add **Spark execution plan & DAG explanations**

Just say **‚ÄúContinue with Part 2‚Äù**