# Advanced Row Manipulation in PySpark
In this notebook, we will explore various row manipulation techniques in PySpark. These methods include filtering, sorting, deduplication, sampling, and using window functions.

## Topics Covered
1. Filtering Rows
2. Sorting Rows
3. Removing Duplicate Rows
4. Random Sampling
5. Applying Custom Row-wise Operations
6. Renaming Columns
7. Retrieving Unique Rows
8. Using Window Functions
9. Summary and Best Practices

In [None]:
# Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, when, rand, row_number
from pyspark.sql.window import Window

# Initialize Spark session
spark = SparkSession.builder.appName("Row Manipulation").getOrCreate()

# Create a sample DataFrame
data = [(1, "Alice", 30), (2, "Bob", 25), (3, "Alice", 28), (4, "David", 35), (5, "Eve", 29), (6, "Bob", 22)]
columns = ["id", "name", "age"]
df = spark.createDataFrame(data, columns)
df.show()

## 1. Filtering Rows
We can filter rows using the `filter()` or `where()` functions.

In [None]:
# Filter rows where age is greater than 25
filtered_df = df.filter(col("age") > 25)
filtered_df.show()

## 2. Sorting Rows
Sorting rows can be done using `orderBy()`.

In [None]:
# Sort DataFrame by age in descending order
sorted_df = df.orderBy(col("age").desc())
sorted_df.show()

## 3. Removing Duplicate Rows
Use `dropDuplicates()` to remove duplicates based on specified columns.

In [None]:
# Remove duplicate rows based on the 'name' column
dedup_df = df.dropDuplicates(["name"])
dedup_df.show()

## 4. Random Sampling
The `sample()` function is used to select a random subset of rows.

In [None]:
# Sample 50% of the rows randomly
sampled_df = df.sample(fraction=0.5, seed=42)
sampled_df.show()

## 5. Applying Custom Row-wise Operations
For complex row-wise operations, use `rdd.map()`.

In [None]:
# Apply a custom function using RDD map
rdd = df.rdd.map(lambda row: (row['name'], row['age'] + 5))
result_df = rdd.toDF(["name", "age_plus_5"])
result_df.show()

## 6. Renaming Columns
The `withColumnRenamed()` function is used to rename columns.

In [None]:
# Rename the 'age' column to 'user_age'
renamed_df = df.withColumnRenamed("age", "user_age")
renamed_df.show()

## 7. Retrieving Unique Rows
Use `distinct()` to get unique rows based on all columns.

In [None]:
# Get distinct rows
unique_df = df.distinct()
unique_df.show()

## 8. Using Window Functions
Window functions enable advanced row-level operations.

In [None]:
# Define a window specification and apply row_number()
window_spec = Window.partitionBy("name").orderBy("age")
windowed_df = df.withColumn("row_num", row_number().over(window_spec))
windowed_df.show()

## 9. Summary and Best Practices
- Use built-in functions for common row operations.
- For complex row manipulations, consider converting to RDD.
- Leverage window functions for advanced analysis.
- Monitor performance using Spark UI to avoid bottlenecks.