In [170]:
import pyspark



# ‚úÖ PySpark Top 100 Methods ‚Äî **PART 1 (1‚Äì25)**


**Category: DataFrame Creation, Selection, Filtering, Transformation**




## 1Ô∏è‚É£ `SparkSession.builder.getOrCreate()`

### **What**

Creates or retrieves the active Spark session.

### **Why**

Entry point for **all PySpark operations** (DataFrame, SQL, streaming).

### **How**

In [171]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
.master("local[*]") \
.appName("MyApp") \
.getOrCreate()


### **When / Scenario**

* First line in **every PySpark job**
* Required for DataFrame & SQL APIs

### **Interview Tip**

> SparkSession **replaced SparkContext + SQLContext + HiveContext**



## 2Ô∏è‚É£ `spark.read`

### **What**

Entry point to read external data.

### **Why**

Supports multiple formats (CSV, JSON, Parquet, ORC, Avro).

### **How**

In [172]:
import pathlib

In [173]:

filepath = str(pathlib.Path().cwd().parent / 'data' / 'Spotify_Artists.csv')
filepath

'd:\\shra1\\github\\pyspark-practice\\data\\Spotify_Artists.csv'

In [174]:
df = spark.read.csv(filepath,
                    header=True,
                    inferSchema=True)

In [175]:
df.show(5)

+---------+--------+----------+---------+
|artist_id|    name|     genre|  country|
+---------+--------+----------+---------+
|        1|Artist_1|Electronic|   France|
|        2|Artist_2|Electronic|Australia|
|        3|Artist_3|      Jazz|   France|
|        4|Artist_4| Classical|Australia|
|        5|Artist_5|   Hip-Hop|      USA|
+---------+--------+----------+---------+
only showing top 5 rows


### **Scenario**

* Data ingestion layer
* Batch pipelines




## 3Ô∏è‚É£ `spark.read.format()`

### **What**

Explicitly defines file format.

### **Why**

Needed for advanced formats (Parquet, Delta, JDBC).

### **How**

In [176]:
# df = spark.read.format("parquet").load("s3://bucket/data")

### **Interview Angle**

> Preferred over `.csv()` / `.json()` in **production pipelines**




## 4Ô∏è‚É£ `spark.read.option()`

### **What**

Sets read-time options.

### **Why**

Controls parsing behavior (delimiter, schema, encoding).

### **How**

In [177]:
# df = spark.read.option("delimiter", "|").csv("data.txt")



## 5Ô∏è‚É£ `df.show()`

### **What**

Displays rows.

### **Why**

Quick debugging & inspection.

### **How**

In [178]:
df.show(100)

+---------+---------+----------+-----------+
|artist_id|     name|     genre|    country|
+---------+---------+----------+-----------+
|        1| Artist_1|Electronic|     France|
|        2| Artist_2|Electronic|  Australia|
|        3| Artist_3|      Jazz|     France|
|        4| Artist_4| Classical|  Australia|
|        5| Artist_5|   Hip-Hop|        USA|
|        6| Artist_6|      Jazz|  Australia|
|        7| Artist_7|      Jazz|South Korea|
|        8| Artist_8|      Rock|    Germany|
|        9| Artist_9|      Jazz|      Japan|
|       10|Artist_10|      Rock|    Germany|
|       11|Artist_11|      Rock|      Japan|
|       12|Artist_12| Classical|     France|
|       13|Artist_13|Electronic|    Germany|
|       14|Artist_14| Classical|        USA|
|       15|Artist_15|      Jazz|  Australia|
|       16|Artist_16|   Hip-Hop|        USA|
|       17|Artist_17|       Pop|     France|
|       18|Artist_18|       Pop|     Canada|
|       19|Artist_19|Electronic|South Korea|
|       20

### **Interview Trick**

> `show()` **triggers an action**




## 6Ô∏è‚É£ `df.printSchema()`

### **What**

Displays schema tree.

### **Why**

Critical for **debugging type issues**.

### **How**

In [179]:
df.printSchema()

root
 |-- artist_id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- country: string (nullable = true)





## 7Ô∏è‚É£ `df.schema`

### **What**

Returns schema object.

### **Why**

Used for **programmatic schema validation**.

### **How**

In [180]:
df.schema

StructType([StructField('artist_id', IntegerType(), True), StructField('name', StringType(), True), StructField('genre', StringType(), True), StructField('country', StringType(), True)])



## 8Ô∏è‚É£ `df.select()`

### **What**

Selects columns.

### **Why**

Column pruning ‚Üí performance optimization.

### **How**

In [181]:
df.select("name", "country").show(5)

+--------+---------+
|    name|  country|
+--------+---------+
|Artist_1|   France|
|Artist_2|Australia|
|Artist_3|   France|
|Artist_4|Australia|
|Artist_5|      USA|
+--------+---------+
only showing top 5 rows




## 9Ô∏è‚É£ `df.selectExpr()`

### **What**

Select using SQL expressions.

### **Why**

Cleaner transformations without `withColumn`.

### **How**

In [182]:
df.selectExpr("country as nation").show(5)

+---------+
|   nation|
+---------+
|   France|
|Australia|
|   France|
|Australia|
|      USA|
+---------+
only showing top 5 rows


### **Interview**

> Faster for **simple derived columns**




## üîü `df.withColumn()`

### **What**

Adds or replaces a column.

### **Why**

Core transformation API.

### **How**

In [183]:
df.show(5)

+---------+--------+----------+---------+
|artist_id|    name|     genre|  country|
+---------+--------+----------+---------+
|        1|Artist_1|Electronic|   France|
|        2|Artist_2|Electronic|Australia|
|        3|Artist_3|      Jazz|   France|
|        4|Artist_4| Classical|Australia|
|        5|Artist_5|   Hip-Hop|      USA|
+---------+--------+----------+---------+
only showing top 5 rows


In [184]:
from pyspark.sql.functions import col, concat, lit

df = df.withColumn("id_with_name", concat(col("artist_id"), lit("_"), col("name")))
df.show(5)

+---------+--------+----------+---------+------------+
|artist_id|    name|     genre|  country|id_with_name|
+---------+--------+----------+---------+------------+
|        1|Artist_1|Electronic|   France|  1_Artist_1|
|        2|Artist_2|Electronic|Australia|  2_Artist_2|
|        3|Artist_3|      Jazz|   France|  3_Artist_3|
|        4|Artist_4| Classical|Australia|  4_Artist_4|
|        5|Artist_5|   Hip-Hop|      USA|  5_Artist_5|
+---------+--------+----------+---------+------------+
only showing top 5 rows


### **Interview Warning**

> Multiple `withColumn()` calls = **multiple DAG stages**

---

## 1Ô∏è‚É£1Ô∏è‚É£ `df.withColumnRenamed()`

### **What**

Renames a column.

### **How**

In [185]:
df = df.withColumnRenamed("artist_id", "id")
df.show(5)

+---+--------+----------+---------+------------+
| id|    name|     genre|  country|id_with_name|
+---+--------+----------+---------+------------+
|  1|Artist_1|Electronic|   France|  1_Artist_1|
|  2|Artist_2|Electronic|Australia|  2_Artist_2|
|  3|Artist_3|      Jazz|   France|  3_Artist_3|
|  4|Artist_4| Classical|Australia|  4_Artist_4|
|  5|Artist_5|   Hip-Hop|      USA|  5_Artist_5|
+---+--------+----------+---------+------------+
only showing top 5 rows


---

## 1Ô∏è‚É£2Ô∏è‚É£ `df.drop()`

### **What**

Removes columns.

### **Why**

Reduce memory & shuffle size.

### **How**

In [186]:
df = df.drop("id_with_name")

---

## 1Ô∏è‚É£3Ô∏è‚É£ `df.filter()` / `df.where()`

### **What**

Row-level filtering.

### **Why**

Predicate pushdown optimization.

### **How**

In [187]:
df.filter(col("country") == "Australia").show(10)

+---+---------+----------+---------+
| id|     name|     genre|  country|
+---+---------+----------+---------+
|  2| Artist_2|Electronic|Australia|
|  4| Artist_4| Classical|Australia|
|  6| Artist_6|      Jazz|Australia|
| 15|Artist_15|      Jazz|Australia|
| 29|Artist_29| Classical|Australia|
| 31|Artist_31|      Jazz|Australia|
| 34|Artist_34|Electronic|Australia|
| 41|Artist_41|   Hip-Hop|Australia|
+---+---------+----------+---------+



In [188]:
df.where(col("country") == "Australia").show(10)

+---+---------+----------+---------+
| id|     name|     genre|  country|
+---+---------+----------+---------+
|  2| Artist_2|Electronic|Australia|
|  4| Artist_4| Classical|Australia|
|  6| Artist_6|      Jazz|Australia|
| 15|Artist_15|      Jazz|Australia|
| 29|Artist_29| Classical|Australia|
| 31|Artist_31|      Jazz|Australia|
| 34|Artist_34|Electronic|Australia|
| 41|Artist_41|   Hip-Hop|Australia|
+---+---------+----------+---------+



### **Interview**

> `filter()` and `where()` are **identical**

---

## 1Ô∏è‚É£4Ô∏è‚É£ `df.distinct()`

### **What**

Removes duplicate rows.

### **Why**

Data deduplication.

### **How**

In [189]:
df.select(df.name, df.country).show(5)

+--------+---------+
|    name|  country|
+--------+---------+
|Artist_1|   France|
|Artist_2|Australia|
|Artist_3|   France|
|Artist_4|Australia|
|Artist_5|      USA|
+--------+---------+
only showing top 5 rows


In [190]:
df.select("country").distinct().show()

+-----------+
|    country|
+-----------+
|    Germany|
|     France|
|        USA|
|South Korea|
|         UK|
|     Canada|
|      Japan|
|  Australia|
+-----------+



### **Cost**

‚ö†Ô∏è Triggers **shuffle**

---

## 1Ô∏è‚É£5Ô∏è‚É£ `df.dropDuplicates()`

### **What**

Removes duplicates based on columns.

### **How**

In [191]:
df.dropDuplicates(["id"])

DataFrame[id: int, name: string, genre: string, country: string]

In [192]:
df.show(5)

+---+--------+----------+---------+
| id|    name|     genre|  country|
+---+--------+----------+---------+
|  1|Artist_1|Electronic|   France|
|  2|Artist_2|Electronic|Australia|
|  3|Artist_3|      Jazz|   France|
|  4|Artist_4| Classical|Australia|
|  5|Artist_5|   Hip-Hop|      USA|
+---+--------+----------+---------+
only showing top 5 rows


### **Why**

More control than `distinct()`

---

## 1Ô∏è‚É£6Ô∏è‚É£ `df.orderBy()` / `df.sort()`

### **What**

Sorts rows.

### **How**

In [193]:
df.orderBy(df.country.desc()).show(10)

+---+---------+----------+-----------+
| id|     name|     genre|    country|
+---+---------+----------+-----------+
|  5| Artist_5|   Hip-Hop|        USA|
| 14|Artist_14| Classical|        USA|
| 16|Artist_16|   Hip-Hop|        USA|
| 40|Artist_40|Electronic|        USA|
| 46|Artist_46|      Rock|        USA|
| 20|Artist_20|       Pop|         UK|
| 21|Artist_21|      Jazz|         UK|
| 39|Artist_39| Classical|         UK|
| 45|Artist_45|Electronic|         UK|
| 42|Artist_42| Classical|South Korea|
+---+---------+----------+-----------+
only showing top 10 rows


In [194]:
df.orderBy("country", ascending=False).show(10)

+---+---------+----------+-----------+
| id|     name|     genre|    country|
+---+---------+----------+-----------+
|  5| Artist_5|   Hip-Hop|        USA|
| 14|Artist_14| Classical|        USA|
| 16|Artist_16|   Hip-Hop|        USA|
| 40|Artist_40|Electronic|        USA|
| 46|Artist_46|      Rock|        USA|
| 20|Artist_20|       Pop|         UK|
| 21|Artist_21|      Jazz|         UK|
| 39|Artist_39| Classical|         UK|
| 45|Artist_45|Electronic|         UK|
| 42|Artist_42| Classical|South Korea|
+---+---------+----------+-----------+
only showing top 10 rows


### **Interview**

> Sorting causes **wide transformation (shuffle)**

---

## 1Ô∏è‚É£7Ô∏è‚É£ `df.limit()`

### **What**

Restricts number of rows.

### **Why**

Sampling / debugging.

### **How**

In [195]:
df.limit(1).show()

+---+--------+----------+-------+
| id|    name|     genre|country|
+---+--------+----------+-------+
|  1|Artist_1|Electronic| France|
+---+--------+----------+-------+





## 1Ô∏è‚É£8Ô∏è‚É£ `df.sample()`

### **What**

Random sampling.

### **How**

In [196]:
df.sample(fraction=0.2, seed=1).show()

+---+---------+----------+-----------+
| id|     name|     genre|    country|
+---+---------+----------+-----------+
|  3| Artist_3|      Jazz|     France|
|  4| Artist_4| Classical|  Australia|
|  8| Artist_8|      Rock|    Germany|
| 12|Artist_12| Classical|     France|
| 19|Artist_19|Electronic|South Korea|
| 21|Artist_21|      Jazz|         UK|
| 25|Artist_25| Classical|     Canada|
| 29|Artist_29| Classical|  Australia|
| 46|Artist_46|      Rock|        USA|
+---+---------+----------+-----------+



### **Scenario**

* Model training
* Data analysis





## 1Ô∏è‚É£9Ô∏è‚É£ `df.count()`

### **What**

Counts rows.

### **Why**

Validation & metrics.

### **Interview**

> Very expensive on large datasets



In [197]:
df.count()

50



## 2Ô∏è‚É£0Ô∏è‚É£ `df.collect()`

### **What**

Brings data to driver.

### **Why**

Debugging small datasets.

### **Danger**

‚ùå **OOM risk**



In [198]:
df.limit(5).collect()

[Row(id=1, name='Artist_1', genre='Electronic', country='France'),
 Row(id=2, name='Artist_2', genre='Electronic', country='Australia'),
 Row(id=3, name='Artist_3', genre='Jazz', country='France'),
 Row(id=4, name='Artist_4', genre='Classical', country='Australia'),
 Row(id=5, name='Artist_5', genre='Hip-Hop', country='USA')]



## 2Ô∏è‚É£1Ô∏è‚É£ `df.take()`

### **What**

Returns first N rows.

### **How**

In [199]:
df.take(5)

[Row(id=1, name='Artist_1', genre='Electronic', country='France'),
 Row(id=2, name='Artist_2', genre='Electronic', country='Australia'),
 Row(id=3, name='Artist_3', genre='Jazz', country='France'),
 Row(id=4, name='Artist_4', genre='Classical', country='Australia'),
 Row(id=5, name='Artist_5', genre='Hip-Hop', country='USA')]



## 2Ô∏è‚É£2Ô∏è‚É£ `df.head()`

### **What**

Same as `take(1)`.





## 2Ô∏è‚É£3Ô∏è‚É£ `df.columns`

### **What**

List of column names.

### **How**

In [200]:
df.columns

['id', 'name', 'genre', 'country']



## 2Ô∏è‚É£4Ô∏è‚É£ `df.dtypes`

### **What**

Column names with data types.



In [201]:
df.dtypes

[('id', 'int'), ('name', 'string'), ('genre', 'string'), ('country', 'string')]



## 2Ô∏è‚É£5Ô∏è‚É£ `df.describe()`

### **What**

Basic statistics.

### **How**

In [202]:
df.describe("id").show()

+-------+------------------+
|summary|                id|
+-------+------------------+
|  count|                50|
|   mean|              25.5|
| stddev|14.577379737113251|
|    min|                 1|
|    max|                50|
+-------+------------------+




# üî• Interview Coverage Achieved (Part 1)

‚úî Data ingestion
‚úî Column operations
‚úî Filtering
‚úî Performance implications
‚úî Lazy vs action operations





## üëâ Next Parts (coming next)

**PART 2:** Aggregations, GroupBy, Joins (26‚Äì50)

**PART 3:** Functions, Window, UDF, Date/JSON (51‚Äì75)

**PART 4:** Performance, Partitioning, Caching, Writing, Spark Internals (76‚Äì100)
