# PySpark Tutorial: Data Processing and Analysis

## Overview
This notebook covers fundamental PySpark concepts and operations for distributed data processing. PySpark is the Python API for Apache Spark, a powerful distributed computing framework designed for processing large-scale datasets.

### Key Topics Covered:
1. **SparkSession Initialization** - Entry point for Spark functionality
2. **Data Loading** - Reading Parquet files
3. **DataFrame Inspection** - Schema, columns, and statistics
4. **Data Selection** - Choosing specific columns
5. **Data Sorting** - Ordering data by criteria
6. **Data Filtering** - Subsetting data based on conditions
7. **Data Cleaning** - Handling missing values
8. **Feature Engineering** - Creating new columns from existing data
9. **Column Renaming** - Restructuring DataFrame columns

---

In [None]:
# Install PySpark package
pip install pyspark



In [None]:
# Initialize SparkSession - the entry point for Spark functionality
from pyspark.sql import SparkSession

# Create or get an existing SparkSession with the name 'SparkApp'
spark = SparkSession.builder.appName("SparkApp").getOrCreate()
spark

In [None]:
# Load data from a Parquet file into a Spark DataFrame
# Parquet is a columnar storage format optimized for analytical queries
spark_df = spark.read.parquet("/content/yellow_tripdata_2025-01.parquet")

In [None]:
# Count the total number of rows in the DataFrame
# Action: This triggers Spark to compute and return the result
spark_df.count()

3475226

In [None]:
# Display the first 20 rows of the DataFrame
# This is useful for a quick visual inspection of the data
spark_df.show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:18:38|  2025-01-01 00:26:59|              1|          1.6|         1|                 N|         229|    

## 1. SparkSession Initialization

**What is SparkSession?**
- The entry point for all Spark functionality in PySpark
- Represents the connection to a Spark cluster
- Enables you to read data, run SQL queries, and perform distributed computations

**Key Properties:**
- `appName`: Name of your Spark application (useful for tracking in cluster UIs)
- Lazily creates a session on first use
- Can be reused throughout your script

**Why it's important:**
All data processing operations in Spark depend on an active SparkSession instance.

## 2. Reading Data with Parquet Files

**What is Parquet?**
- A columnar storage format developed by Apache
- Optimized for analytical queries and data warehousing
- Stores data in a compressed format, reducing storage space
- Faster than row-based formats like CSV for analytical operations

**Why use Parquet?**
- Excellent compression (reduces data size by 70-90%)
- Efficient column-oriented queries (only reads needed columns)
- Preserves data types and schema information
- Ideal for big data analytics

**The `spark.read.parquet()` method:**
- Reads Parquet files into a Spark DataFrame
- Automatically infers the schema from the file
- Returns a distributed DataFrame object

## 3. DataFrame Row Counting

**The `count()` Method:**
- Returns the total number of rows in the DataFrame
- This is an **Action** in Spark - it triggers actual computation
- Useful for understanding dataset size and volume

**Actions vs Transformations:**
- **Actions**: Execute computations and return results (count, show, collect, write)
- **Transformations**: Create new DataFrames from existing ones (select, filter, sort) - lazy evaluation

**Performance Note:**
- For very large datasets, counting can take significant time
- Spark must read and process all partitions to get an accurate count

## 4. Viewing DataFrame Data

**The `show()` Method:**
- Displays the first 20 rows of the DataFrame (by default)
- Returns data in a tabular format for easy reading
- Useful for quick data inspection and verification
- Only evaluates what's needed (truncates long strings)

**Parameters:**
- `n`: Number of rows to display (default: 20)
- `truncate`: Max column width before truncating (default: True)
- `vertical`: Display rows vertically instead of horizontally

## 5. Inspecting DataFrame Columns

**The `columns` Property:**
- Returns a Python list of all column names in the DataFrame
- Useful for understanding what fields are available
- Helps with programmatic column access and validation
- Returns column names in the order they appear in the schema

## 6. Understanding DataFrame Schema

**The `printSchema()` Method:**
- Displays the schema in a tree-like format
- Shows column names, data types, and nullable properties
- Essential for understanding data structure and types
- Helps identify potential data type mismatches

**What's in the Schema:**
- Column name
- Data type (String, Integer, Double, Boolean, etc.)
- Nullable flag (can column contain NULL values?)
- Nested structures for complex data types

## 7. Accessing Schema as an Object

**The `schema` Property:**
- Returns a `StructType` object representing the DataFrame schema
- Allows programmatic access to schema details
- Useful when you need to manipulate or inspect schema in code
- Can be serialized/deserialized for schema management

**Schema Inspection Benefits:**
- Extract field information programmatically
- Validate data types before processing
- Generate schema-aware code dynamically

## 8. Statistical Summary of Data

**The `describe()` Method:**
- Computes descriptive statistics for numerical columns
- Returns count, mean, standard deviation, min, and max values
- Called with `.show()` to display results

**Statistics Provided:**
- **count**: Non-null values in each column
- **mean**: Average value
- **stddev**: Standard deviation (data spread)
- **min**: Minimum value
- **max**: Maximum value

**Use Cases:**
- Quick data quality checks
- Identify outliers and data ranges
- Understand distribution of numerical data
- Detect potential data entry errors

In [None]:
# Get all column names as a list
spark_df.columns

['VendorID',
 'tpep_pickup_datetime',
 'tpep_dropoff_datetime',
 'passenger_count',
 'trip_distance',
 'RatecodeID',
 'store_and_fwd_flag',
 'PULocationID',
 'DOLocationID',
 'payment_type',
 'fare_amount',
 'extra',
 'mta_tax',
 'tip_amount',
 'tolls_amount',
 'improvement_surcharge',
 'total_amount',
 'congestion_surcharge',
 'Airport_fee',
 'cbd_congestion_fee']

In [None]:
# Print the schema of the DataFrame in a tree format
# Shows all column names, data types, and nullable properties
spark_df.printSchema()

root
 |-- VendorID: integer (nullable = true)
 |-- tpep_pickup_datetime: timestamp_ntz (nullable = true)
 |-- tpep_dropoff_datetime: timestamp_ntz (nullable = true)
 |-- passenger_count: long (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- RatecodeID: long (nullable = true)
 |-- store_and_fwd_flag: string (nullable = true)
 |-- PULocationID: integer (nullable = true)
 |-- DOLocationID: integer (nullable = true)
 |-- payment_type: long (nullable = true)
 |-- fare_amount: double (nullable = true)
 |-- extra: double (nullable = true)
 |-- mta_tax: double (nullable = true)
 |-- tip_amount: double (nullable = true)
 |-- tolls_amount: double (nullable = true)
 |-- improvement_surcharge: double (nullable = true)
 |-- total_amount: double (nullable = true)
 |-- congestion_surcharge: double (nullable = true)
 |-- Airport_fee: double (nullable = true)
 |-- cbd_congestion_fee: double (nullable = true)



In [None]:
# Get the schema as a StructType object for programmatic access
spark_df.schema

StructType([StructField('VendorID', IntegerType(), True), StructField('tpep_pickup_datetime', TimestampNTZType(), True), StructField('tpep_dropoff_datetime', TimestampNTZType(), True), StructField('passenger_count', LongType(), True), StructField('trip_distance', DoubleType(), True), StructField('RatecodeID', LongType(), True), StructField('store_and_fwd_flag', StringType(), True), StructField('PULocationID', IntegerType(), True), StructField('DOLocationID', IntegerType(), True), StructField('payment_type', LongType(), True), StructField('fare_amount', DoubleType(), True), StructField('extra', DoubleType(), True), StructField('mta_tax', DoubleType(), True), StructField('tip_amount', DoubleType(), True), StructField('tolls_amount', DoubleType(), True), StructField('improvement_surcharge', DoubleType(), True), StructField('total_amount', DoubleType(), True), StructField('congestion_surcharge', DoubleType(), True), StructField('Airport_fee', DoubleType(), True), StructField('cbd_congestio

In [None]:
# Display descriptive statistics (count, mean, stddev, min, max) for numerical columns
spark_df.describe().show()

+-------+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+---------------------+------------------+--------------------+-------------------+-------------------+
|summary|           VendorID|   passenger_count|     trip_distance|        RatecodeID|store_and_fwd_flag|     PULocationID|      DOLocationID|      payment_type|       fare_amount|             extra|            mta_tax|        tip_amount|      tolls_amount|improvement_surcharge|      total_amount|congestion_surcharge|        Airport_fee| cbd_congestion_fee|
+-------+-------------------+------------------+------------------+------------------+------------------+-----------------+------------------+------------------+------------------+------------------+-------------------+------------------+------------------+---------------------+-

---

## Data Transformation Operations

### 9. Column Selection (Projection)

**The `select()` Method:**
- Selects specific columns from a DataFrame
- Returns a new DataFrame with only chosen columns
- Reduces memory footprint by removing unnecessary data
- Can be used with column names or Column objects

**Syntax:**
- `df.select(['col1', 'col2', ...])` - using list of strings
- `df.select('col1', 'col2')` - using variable arguments
- `df.select(col('col1'), col('col2'))` - using Column objects

**Benefits:**
- Improves query performance (only processes needed columns)
- Simplifies data by removing irrelevant fields
- Foundation for feature engineering

In [None]:
# Select only specific columns: fare_amount and passenger_count
# This creates a new DataFrame with just these two columns
spark_df.select(['fare_amount','passenger_count']).show()

+-----------+---------------+
|fare_amount|passenger_count|
+-----------+---------------+
|       10.0|              1|
|        5.1|              1|
|        5.1|              1|
|        7.2|              3|
|        5.8|              3|
|       19.1|              2|
|        4.4|              0|
|       12.1|              0|
|       19.1|              0|
|       11.4|              1|
|       11.4|              1|
|        5.8|              1|
|       14.2|              3|
|        7.9|              1|
|       26.1|              1|
|       17.7|              3|
|       16.3|              1|
|       -7.2|              1|
|        7.2|              1|
|       15.6|              2|
+-----------+---------------+
only showing top 20 rows


In [None]:
# Sort the DataFrame by a single column in ascending order (default)
# The sort() method is a transformation that creates a new sorted DataFrame
spark_df.sort('fare_amount').show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       2| 2025-01-07 19:12:25|  2025-01-07 19:14:04|              1|          0.1|         5|                 N|         226|    

---

### 10. Sorting Data

**The `sort()` Method:**
- Arranges rows based on one or more column values
- Returns a new sorted DataFrame
- Default order is ascending (smallest to largest)
- Can sort by multiple columns with different orders

**Parameters:**
- Column name or list of column names
- `ascending`: Boolean or list of booleans for sort order
- Multiple sort criteria applied in order

**Use Cases:**
- Ranking data by values
- Finding top N or bottom N records
- Organizing data for presentation
- Preparing data for time-series analysis

In [None]:
# Sort by multiple columns in descending order
# Sorts first by fare_amount (descending), then by passenger_count (descending)
spark_df.sort(['fare_amount',"passenger_count"], ascending = [False,False]).show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-20 12:07:18|  2025-01-20 12:12:42|              1|          1.6|         1|                 N|         138|    

---

### 11. Filtering Data (Selection)

**The `filter()` Method:**
- Selects rows that satisfy a condition
- Returns a new DataFrame with only matching rows
- Critical for data cleaning and subsetting
- Conditions can use SQL syntax or PySpark expressions

**Filter Syntax Options:**
1. **SQL String**: `df.filter('column > 100')`
2. **PySpark Expression**: `df.filter(df['column'] > 100)`
3. **Column Object**: `df.filter(col('column') > 100)`

**Combining Conditions:**
- AND operator: `&` (not `and`)
- OR operator: `|` (not `or`)
- NOT operator: `~` (not `not`)
- Must wrap conditions in parentheses when combining

**Performance Considerations:**
- Filter early to reduce data size
- More selective filters improve performance
- Spark optimizes filter operations automatically

In [None]:
# Filter for rows where Airport_fee is greater than 0
# Uses SQL string syntax for the filter condition
spark_df.filter('Airport_fee >0').show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:51:41|  2025-01-01 01:06:26|              1|          7.2|         1|                 N|         132|    

In [None]:
# Filter for rides where pickup time is after a specific datetime
# Uses PySpark expression syntax with column reference
spark_df.filter(spark_df['tpep_pickup_datetime']>'2025-01-01 00:11:59').show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:18:38|  2025-01-01 00:26:59|              1|          1.6|         1|                 N|         229|    

In [None]:
# Filter for rows that meet BOTH conditions using AND operator (&)
# Airport_fee must be greater than 0 AND pickup time must be after specified time
# Note: Must use & (bitwise AND) not 'and', and wrap conditions in parentheses
spark_df.filter((spark_df['Airport_fee']>0) & (spark_df['tpep_pickup_datetime']> '2025-01-01 00:11:59')).show()

+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|VendorID|tpep_pickup_datetime|tpep_dropoff_datetime|passenger_count|trip_distance|RatecodeID|store_and_fwd_flag|PULocationID|DOLocationID|payment_type|fare_amount|extra|mta_tax|tip_amount|tolls_amount|improvement_surcharge|total_amount|congestion_surcharge|Airport_fee|cbd_congestion_fee|
+--------+--------------------+---------------------+---------------+-------------+----------+------------------+------------+------------+------------+-----------+-----+-------+----------+------------+---------------------+------------+--------------------+-----------+------------------+
|       1| 2025-01-01 00:51:41|  2025-01-01 01:06:26|              1|          7.2|         1|                 N|         132|    

In [None]:
# Combine multiple operations: filter, select, and display
# Step 1: Filter for rides with more than 2 passengers
# Step 2: Select only VendorId, passenger_count, and total_amount columns
# Step 3: Display the results
spark_df.filter(spark_df['passenger_count']>2)\
.select(['VendorId','passenger_count','total_amount'])\
.show()

+--------+---------------+------------+
|VendorId|passenger_count|total_amount|
+--------+---------------+------------+
|       2|              3|         9.7|
|       2|              3|         8.3|
|       2|              3|        19.2|
|       1|              3|        22.7|
|       1|              3|       15.45|
|       2|              3|       30.13|
|       2|              4|       50.76|
|       2|              4|       29.76|
|       2|              4|       16.13|
|       1|              4|       37.95|
|       2|              9|      111.32|
|       2|              4|        23.4|
|       1|              4|        12.2|
|       2|              4|       36.48|
|       1|              3|       12.29|
|       2|              4|       28.08|
|       1|              4|        16.4|
|       2|              3|       56.38|
|       2|              3|        34.8|
|       1|              4|       17.15|
+--------+---------------+------------+
only showing top 20 rows


In [None]:
# Challenge: Write a query combining sort, select, and filter
# Find non-airport rides with exactly 1 passenger, sort and show relevant columns
# Step 1: Filter for Airport_fee == 0 (non-airport) AND passenger_count == 1
# Step 2: Sort by Airport_fee
# Step 3: Select only passenger_count and Airport_fee columns
spark_df.filter((spark_df['Airport_fee']==0) & (spark_df['passenger_count']==1))\
.sort('Airport_fee')\
.select('passenger_count','Airport_fee')\
.show()

+---------------+-----------+
|passenger_count|Airport_fee|
+---------------+-----------+
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
|              1|        0.0|
+---------------+-----------+
only showing top 20 rows


In [None]:
# Challenge: Select only trip distance and total_amount columns
spark_df.select(['trip_distance', 'total_amount']).show()

+-------------+------------+
|trip_distance|total_amount|
+-------------+------------+
|          1.6|        18.0|
|          0.5|       12.12|
|          0.6|        12.1|
|         0.52|         9.7|
|         0.66|         8.3|
|         2.63|        24.1|
|          0.4|       11.75|
|          1.6|        19.1|
|          2.8|        27.1|
|         1.71|        16.4|
|         2.29|        16.4|
|         0.56|       12.96|
|         1.99|        19.2|
|          1.1|        12.9|
|          3.2|        38.9|
|          2.5|        22.7|
|          1.9|       25.55|
|         0.71|       -8.54|
|         0.71|        12.2|
|          1.2|        20.6|
+-------------+------------+
only showing top 20 rows


In [None]:
# Challenge: Sort the resulting dataframe by trip distance in descending order
# Shows longest trips first
spark_df.select(['trip_distance', 'total_amount'])\
.sort("trip_distance", ascending=False).show()

+-------------+------------+
|trip_distance|total_amount|
+-------------+------------+
|    276423.57|         5.0|
|    276099.95|       13.88|
|    222167.49|       35.94|
|    206137.99|       29.64|
|    202771.63|       14.85|
|    189687.43|       17.45|
|    181139.99|       11.08|
|    168079.57|        4.49|
|    167452.94|        5.45|
|    164959.95|       18.05|
|    158925.09|       14.82|
|    156037.94|       33.49|
|    143712.27|       12.51|
|    135116.83|       36.52|
|    134033.15|       22.89|
|    124083.23|        15.0|
|    121799.97|       31.88|
|    121555.16|        8.46|
|    118927.12|        3.68|
|    118435.89|       19.59|
+-------------+------------+
only showing top 20 rows


In [None]:
# Combined challenge problem: Filter, Select, Sort
# Find all non-airport solo rides and show them sorted by distance (longest first)
spark_df.filter((spark_df['Airport_fee']==0) & (spark_df['passenger_count']==1))\
.select('trip_distance', 'total_amount')\
.sort('trip_distance', ascending=False)\
.show()

+-------------+------------+
|trip_distance|total_amount|
+-------------+------------+
|      44730.3|        45.0|
|      44684.1|       54.94|
|      33588.9|        30.0|
|      11187.2|       61.94|
|      2001.95|       19.35|
|      1847.61|       43.37|
|      1472.37|       20.58|
|        265.9|      139.14|
|       255.33|     2506.71|
|       206.45|      243.32|
|        199.3|        19.0|
|       188.88|      1311.7|
|        181.9|       39.94|
|       150.11|      501.75|
|        148.3|      549.91|
|       122.77|        64.7|
|       119.66|      794.82|
|       114.25|       396.0|
|       105.24|       361.2|
|       104.21|      249.53|
+-------------+------------+
only showing top 20 rows


In [None]:
# Import necessary functions for missing value detection
from pyspark.sql.functions import col, isnull

---

## Data Cleaning and Missing Values

### 12. Detecting and Handling Missing Values

**Missing Data in DataFrames:**
- Missing values are represented as `NULL` in Spark
- Can skew analysis and cause errors in computations
- Must be identified and handled appropriately

**The `isnull()` Function:**
- Returns True where values are NULL
- Often combined with `filter()` to count missing values
- Can be used to create masks for data cleaning

**Common Missing Value Strategies:**
1. **Removal**: Delete rows with NULL values (loses data)
2. **Imputation**: Fill with default, mean, median, or forward-filled values
3. **Domain-specific**: Use business rules to fill values
4. **Keep separate**: Mark and analyze NULL values separately

**The `fillna()` Method:**
- Replaces NULL values with specified values
- Can use a dictionary to fill different columns differently
- Returns a new DataFrame with filled values
- Useful for imputation strategies

In [None]:
# Count the number of NULL/missing values in the 'fare_amount' column
# isnull() returns True for NULL values, count() counts them
spark_df.filter(isnull(col('fare_amount'))).count()

0

In [None]:
# Count the number of NULL/missing values in the 'passenger_count' column
spark_df.filter(isnull(col('passenger_count'))).count()

540149

In [None]:
# Fill missing values in 'passenger_count' column with default value of 1
# This assumes if passenger count is missing, there was 1 passenger
# Returns a new DataFrame with filled values
df = spark_df.fillna({'passenger_count':1})

In [None]:
# Verify that missing values in 'passenger_count' have been filled
# After fillna(), the count should be 0
df.filter(isnull(col('passenger_count'))).count()

0

In [None]:
# Import functions needed for feature engineering
# unix_timestamp(): Convert datetime to seconds since epoch
# round(): Round numerical values to specified decimal places
from pyspark.sql.functions import unix_timestamp, round

---

### 13. Feature Engineering - Creating New Columns

**What is Feature Engineering?**
- Process of creating new features (columns) from existing data
- Transforms raw data into meaningful features for analysis
- Critical step in data preprocessing and machine learning

**The `withColumn()` Method:**
- Adds a new column to a DataFrame or modifies existing ones
- Returns a new DataFrame with the added/modified column
- Accepts column name and expression
- Can chain multiple `withColumn()` calls

**Common PySpark Functions:**
- `unix_timestamp()`: Converts datetime to Unix timestamp (seconds since 1970)
- `round()`: Rounds numerical values to specified decimal places
- String functions: concat, substring, length, upper, lower
- Math functions: abs, sqrt, pow, etc.
- Date functions: year, month, day, hour, minute

**Example: Trip Duration Calculation**
- Extract pickup and dropoff times
- Convert to Unix timestamps
- Calculate difference in seconds
- Convert to minutes and round

In [None]:
# Feature Engineering: Create a new column 'trip_duration' in minutes
# Step 1: Convert dropoff time to Unix timestamp (seconds)
# Step 2: Convert pickup time to Unix timestamp (seconds)
# Step 3: Calculate difference in seconds
# Step 4: Divide by 60 to convert to minutes
# Step 5: Round to 1 decimal place
# withColumn() creates a new DataFrame with the added column
df1 = df.withColumn('trip_duration', \
      round((unix_timestamp('tpep_dropoff_datetime') - unix_timestamp('tpep_pickup_datetime')) / 60, 1))
df1.select('trip_duration').show()

+-------------+
|trip_duration|
+-------------+
|          8.4|
|          2.6|
|          2.0|
|          5.6|
|          3.5|
|         20.0|
|          1.5|
|         12.4|
|         19.7|
|          9.6|
|          7.6|
|          3.4|
|         13.0|
|          6.7|
|         34.1|
|         18.3|
|         16.9|
|          5.6|
|          5.6|
|         17.0|
+-------------+
only showing top 20 rows


In [None]:
# Data Quality Check: Count negative trip durations
# Negative values indicate data issues (dropoff before pickup)
# This helps identify potential data quality problems
df1.filter(df1['trip_duration'] < 0).count()

117

### Data Quality Analysis - Negative Trip Duration Check

**What does this check tell us?**

If the count of negative `trip_duration` values is **zero**:
- ✓ All trip durations are positive or zero
- ✓ Indicates the calculation is accurate in terms of time order
- ✓ No data quality issues detected
- ✓ Pickup times are correctly recorded before dropoff times

If there are **negative values**:
- ✗ Implies data quality issues
- ✗ Drop-off time recorded before pick-up time (impossible for valid trips)
- ✗ Might indicate:
  - Incorrect timestamp recording
  - System clock issues during data capture
  - Data entry errors
  - Records that need investigation and possible removal

**How to Handle Negative Durations:**
1. Filter them out for analysis (treat as invalid records)
2. Investigate the source data for systematic issues
3. Apply business logic (e.g., if duration < some threshold, mark as suspicious)
4. Consider removing outliers and anomalies

In [None]:
# Rename columns for clarity and consistency
# Step 1: Select specific columns
# Step 2: Rename them using withColumnsRenamed()
#   - 'tpep_pickup_datetime' → 'pu_datetime' (shorter, clearer)
#   - 'tpep_dropoff_datetime' → 'do_datetime' (shorter, clearer)
#   - 'fare_amount' → 'ride-amount' (alternative naming)
# This creates a new DataFrame with renamed columns
df2 = df1.select('tpep_pickup_datetime','tpep_dropoff_datetime','fare_amount')\
.withColumnsRenamed({'tpep_pickup_datetime':'pu_datetime','tpep_dropoff_datetime':'do_datetime', 'fare_amount':'ride-amount'})

df2.show()

+-------------------+-------------------+-----------+
|        pu_datetime|        do_datetime|ride-amount|
+-------------------+-------------------+-----------+
|2025-01-01 00:18:38|2025-01-01 00:26:59|       10.0|
|2025-01-01 00:32:40|2025-01-01 00:35:13|        5.1|
|2025-01-01 00:44:04|2025-01-01 00:46:01|        5.1|
|2025-01-01 00:14:27|2025-01-01 00:20:01|        7.2|
|2025-01-01 00:21:34|2025-01-01 00:25:06|        5.8|
|2025-01-01 00:48:24|2025-01-01 01:08:26|       19.1|
|2025-01-01 00:14:47|2025-01-01 00:16:15|        4.4|
|2025-01-01 00:39:27|2025-01-01 00:51:51|       12.1|
|2025-01-01 00:53:43|2025-01-01 01:13:23|       19.1|
|2025-01-01 00:00:02|2025-01-01 00:09:36|       11.4|
|2025-01-01 00:20:28|2025-01-01 00:28:04|       11.4|
|2025-01-01 00:33:58|2025-01-01 00:37:23|        5.8|
|2025-01-01 00:42:40|2025-01-01 00:55:38|       14.2|
|2025-01-01 00:30:07|2025-01-01 00:36:48|        7.9|
|2025-01-01 00:39:55|2025-01-01 01:13:59|       26.1|
|2025-01-01 00:16:54|2025-01

---

### 14. Renaming Columns

**The `withColumnsRenamed()` Method:**
- Renames one or more columns in a DataFrame
- Takes a dictionary mapping old names to new names
- Useful for:
  - Standardizing column naming conventions
  - Simplifying long or unclear column names
  - Preparing data for downstream analysis
  - Making data more readable and accessible
- Returns a new DataFrame with renamed columns

**Use Cases:**
- Shortening verbose column names
- Converting naming conventions (snake_case to camelCase, etc.)
- Making column names more domain-friendly
- Standardizing across multiple data sources

---

## Summary of Key PySpark Concepts

### Transformations (Lazy) vs Actions (Eager)

**Transformations** - Create new DataFrames, don't execute immediately:
- `select()` - Choose columns
- `filter()` - Subset rows
- `sort()` - Order rows
- `withColumn()` - Add/modify columns
- `fillna()` - Replace NULL values
- `withColumnsRenamed()` - Rename columns

**Actions** - Execute computations and return results:
- `show()` - Display data
- `count()` - Count rows
- `collect()` - Get all data
- `first()` - Get first row
- `describe()` - Statistics summary
- `write()` - Save data

### Performance Tips

1. **Filter early** - Reduce data size before other operations
2. **Select only needed columns** - Reduces memory usage
3. **Combine operations** - Spark optimizes operation chains
4. **Use appropriate data types** - Affects performance and storage
5. **Handle missing values** - Prevents errors in calculations
6. **Monitor partitioning** - Affects parallel processing efficiency

### Best Practices

1. Always inspect data first (schema, sample rows, statistics)
2. Check for and handle missing values
3. Use meaningful column names
4. Validate transformations with small samples before full data
5. Document your data processing logic
6. Keep transformations modular and readable
7. Test edge cases and data quality issues