# EX5-BATCH: More advanced RDD API programming

Your assignment: complete the `TODO`'s and include also the **output of each cell**.

### Download Bike Trip Data (Feb 2025)

In [16]:
!wget -np https://s3.amazonaws.com/tripdata/202502-citibike-tripdata.zip -P data/
![ -e "data/202502-citibike-tripdata_1.csv" ] || (cd data/ && unzip 202502-citibike-tripdata.zip)

Connecting to s3.amazonaws.com (52.217.123.56:443)
wget: can't open 'data/202502-citibike-tripdata.zip': File exists


### Data is on three files, let us take a look on one (header + a few lines)

In [18]:
!head -3 data/202502-citibike-tripdata_1.csv

ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
C1F868EC9F7E49A5,electric_bike,2025-02-06 16:54:02.517,2025-02-06 17:00:48.166,Perry St & Bleecker St,5922.07,Watts St & Greenwich St,5578.02,40.73535398,-74.00483091,40.72405549,-74.00965965,member
668DDE0CFA929D5A,electric_bike,2025-02-14 10:09:49.035,2025-02-14 10:21:57.856,Dock 72 Way & Market St,4804.02,Spruce St & Nassau St,5137.10,40.69985,-73.97141,40.71146364,-74.00552427,member


### **Dataset Description**
The dataset contains **bike trip records** with the following columns:

| Column Name            | Description |
|------------------------|-------------|
| `ride_id`             | Unique trip identifier |
| `rideable_type`       | Type of bike used (e.g., docked, electric) |
| `started_at`          | Start timestamp of the trip |
| `ended_at`            | End timestamp of the trip |
| `start_station_name`  | Name of the start station |
| `start_station_id`    | ID of the start station |
| `end_station_name`    | Name of the end station |
| `end_station_id`      | ID of the end station |
| `start_lat`          | Latitude of the start location |
| `start_lng`          | Longitude of the start location |
| `end_lat`            | Latitude of the end location |
| `end_lng`            | Longitude of the end location |
| `member_casual`       | User type (`member` for subscribers, `casual` for non-subscribers) |

### Step 1: Load and Preprocess the Data
1. Start a **PySpark session (or SparkContext)**.
2. Load the dataset as an **RDD**.
3. **Remove the header** and filter out malformed rows.
4. `#TODO` Do the same for each file. Use [Spark Union transformation function](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.union.html) for that.

In [38]:
from pyspark import SparkContext

try:
    sc.stop()
except NameError:
    print("SparkContext not defined")

# Initialize Spark Context
sc = SparkContext(appName="EX5-BIGDATA", master="local[*]") # local execution
# sc = SparkContext(appName="EX5-BIGDATA", master="spark://spark:7077") # cluster execution

# Load data
file_path = "data/202502-citibike-tripdata_1.csv"
raw_rdd = sc.textFile(file_path)

# Remove header
header = raw_rdd.first()
data_rdd = raw_rdd.filter(lambda row: row != header)

# Split CSV rows into lists
rdd = data_rdd.map(lambda row: row.split(","))

# Filter out malformed rows (should have 13 columns)
valid_rdd = rdd.filter(lambda cols: len(cols) == 13)

valid_rdd.take(2)

                                                                                

[['C1F868EC9F7E49A5',
  'electric_bike',
  '2025-02-06 16:54:02.517',
  '2025-02-06 17:00:48.166',
  'Perry St & Bleecker St',
  '5922.07',
  'Watts St & Greenwich St',
  '5578.02',
  '40.73535398',
  '-74.00483091',
  '40.72405549',
  '-74.00965965',
  'member'],
 ['668DDE0CFA929D5A',
  'electric_bike',
  '2025-02-14 10:09:49.035',
  '2025-02-14 10:21:57.856',
  'Dock 72 Way & Market St',
  '4804.02',
  'Spruce St & Nassau St',
  '5137.10',
  '40.69985',
  '-73.97141',
  '40.71146364',
  '-74.00552427',
  'member']]

### Step 2: RDD Partitioning
1. Check the **initial number of partitions**.
2. Repartition the data for better performance (change the number at will).
3. See what happens in the Spark UI.

In [47]:
# check initial partitions
initial_partitions = valid_rdd.getNumPartitions()
print(f"Initial Partitions: {initial_partitions}")

# change the number of partitions (this will trigger a full shuffle, to reorganize data)
partitioned_rdd = valid_rdd.repartition(10)

Initial Partitions: 6


In [67]:
print(partitioned_rdd.toDebugString().decode("utf-8"))

(10) MapPartitionsRDD[30] at coalesce at NativeMethodAccessorImpl.java:0 []
 |   CoalescedRDD[29] at coalesce at NativeMethodAccessorImpl.java:0 []
 |   ShuffledRDD[28] at coalesce at NativeMethodAccessorImpl.java:0 []
 +-(6) MapPartitionsRDD[27] at coalesce at NativeMethodAccessorImpl.java:0 []
    |  PythonRDD[26] at RDD at PythonRDD.scala:53 []
    |  data/202502-citibike-tripdata_1.csv MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0 []
    |  data/202502-citibike-tripdata_1.csv HadoopRDD[0] at textFile at NativeMethodAccessorImpl.java:0 []


### Step 3: Get the top-3 most Popular starting stations
1. You should get this information and collect to the drive (tip: function [PySpark RDD sortBy](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.sortBy.html), however, it can be more efficient than that by using the [Reduce Action](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduce.html) -- not to be confused with the [ReduceByKey Transformation](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.reduceByKey.html))
2. Broadcast this information
3. Use the broacast to append to each RDD item a new value: `starting_station_top3`, with values `yes` or `no`

In [27]:
# TODO

### Step 4: Use Accumulators for Data Statistics
1. Generate:
   - Total trips
   - Trips with missing data
   - Trips by casual riders vs. members

In [26]:
# Accumulators for statistics
total_trips = sc.accumulator(0)
invalid_trips = sc.accumulator(0)
casual_trips = sc.accumulator(0)
member_trips = sc.accumulator(0)

# TODO ...

### Step 5: Other Insights
1. Average trip duration for members vs. casual riders.
2. Peak riding hours, i.e., the day hour in which more people are riding bikes.

Tip: use `datetime` to format string dates and calculate duration, among other date data manipulations. An example below:

```
start_str = '2025-02-06 16:54:02.517'
end_str = '2025-02-06 17:00:48.166'
start_time = datetime.strptime(cols[2], "%Y-%m-%d %H:%M:%S")
end_time = datetime.strptime(cols[3], "%Y-%m-%d %H:%M:%S")
duration = (end_time - start_time).total_seconds() / 60  # Convert to minutes
```

In [28]:
# TODO