# Advanced Row Manipulation in PySpark
In this notebook, we will explore various row manipulation techniques in PySpark. These methods include filtering, sorting, deduplication, sampling, and using window functions.

## Topics Covered
1. Filtering Rows
2. Sorting Rows
3. Removing Duplicate Rows
4. Random Sampling
5. Applying Custom Row-wise Operations
6. Renaming Columns
7. Retrieving Unique Rows
8. Using Window Functions
9. Summary and Best Practices

In [1]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, rand, row_number
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("Row Manipulation").getOrCreate()

# Create a sample DataFrame
data = [(1, "Alice", 30), (2, "Bob", 25), (3, "Alice", 28), (4, "David", 35), (5, "Eve", 29), (6, "Bob", 22)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/11 11:11:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 30|
|  2|  Bob| 25|
|  3|Alice| 28|
|  4|David| 35|
|  5|  Eve| 29|
|  6|  Bob| 22|
+---+-----+---+



## 1. Filtering Rows
We can filter rows using the `filter()` or `where()` functions.

In [2]:
# Filter rows where age is greater than 25
filtered_df = df.filter(col("age") > 25)
filtered_df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 30|
|  3|Alice| 28|
|  4|David| 35|
|  5|  Eve| 29|
+---+-----+---+



## 2. Sorting Rows
Sorting rows can be done using `orderBy()`.

In [3]:
# Sort DataFrame by age in descending order
sorted_df = df.orderBy(col("age").desc())
sorted_df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  4|David| 35|
|  1|Alice| 30|
|  5|  Eve| 29|
|  3|Alice| 28|
|  2|  Bob| 25|
|  6|  Bob| 22|
+---+-----+---+



## 3. Removing Duplicate Rows
Use `dropDuplicates()` to remove duplicates based on specified columns.

In [4]:
# Remove duplicate rows based on the 'name' column
dedup_df = df.dropDuplicates(["name"])
dedup_df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 30|
|  2|  Bob| 25|
|  4|David| 35|
|  5|  Eve| 29|
+---+-----+---+



## 4. Random Sampling
The `sample()` function is used to select a random subset of rows.

In [5]:
# Sample 50% of the rows randomly
sampled_df = df.sample(fraction=0.5, seed=42)
sampled_df.show()

+---+----+---+
| id|name|age|
+---+----+---+
|  5| Eve| 29|
|  6| Bob| 22|
+---+----+---+



## 5. Applying Custom Row-wise Operations
For complex row-wise operations, use `rdd.map()`.

In [6]:
# Apply a custom function using RDD map
rdd = df.rdd.map(lambda row: (row['name'], row['age'] + 5))
result_df = rdd.toDF(["name", "age_plus_5"])
result_df.show()

+-----+----------+
| name|age_plus_5|
+-----+----------+
|Alice|        35|
|  Bob|        30|
|Alice|        33|
|David|        40|
|  Eve|        34|
|  Bob|        27|
+-----+----------+



## 6. Renaming Columns
The `withColumnRenamed()` function is used to rename columns.

In [7]:
# Rename the 'age' column to 'user_age'
renamed_df = df.withColumnRenamed("age", "user_age")
renamed_df.show()

+---+-----+--------+
| id| name|user_age|
+---+-----+--------+
|  1|Alice|      30|
|  2|  Bob|      25|
|  3|Alice|      28|
|  4|David|      35|
|  5|  Eve|      29|
|  6|  Bob|      22|
+---+-----+--------+



## 7. Retrieving Unique Rows
Use `distinct()` to get unique rows based on all columns.

In [8]:
# Get distinct rows
unique_df = df.distinct()
unique_df.show()

+---+-----+---+
| id| name|age|
+---+-----+---+
|  1|Alice| 30|
|  2|  Bob| 25|
|  3|Alice| 28|
|  4|David| 35|
|  5|  Eve| 29|
|  6|  Bob| 22|
+---+-----+---+



## 8. Using Window Functions
Window functions enable advanced row-level operations.

In [9]:
# Define a window specification and apply row_number()
window_spec = Window.partitionBy("name").orderBy("age")
windowed_df = df.withColumn("row_num", row_number().over(window_spec))
windowed_df.show()

+---+-----+---+-------+
| id| name|age|row_num|
+---+-----+---+-------+
|  3|Alice| 28|      1|
|  1|Alice| 30|      2|
|  6|  Bob| 22|      1|
|  2|  Bob| 25|      2|
|  4|David| 35|      1|
|  5|  Eve| 29|      1|
+---+-----+---+-------+



## 9. Summary and Best Practices
- Use built-in functions for common row operations.
- For complex row manipulations, consider converting to RDD.
- Leverage window functions for advanced analysis.
- Monitor performance using Spark UI to avoid bottlenecks.