### Handle time-series data in PySpark

Time Series Data: Data that is indexed or organized by time.

#### Common Operations:
- **Date Parsing:** Convert string data into date or timestamp format.
- **Resampling:** Aggregate data based on a specific time interval (hourly, daily, monthly).
- **Time-based Filtering:** Filter data by a date or time range.
- **Lagging/Leading:** Comparing values from previous or subsequent time periods.
- **Rolling/Aggregating:** Calculating statistics like moving averages.


In [0]:
from pyspark.sql import functions as F

In [0]:
# Sample time-series data
data = [
    ('2024-01-01 08:00:00', 10.5),
    ('2024-01-01 09:00:00', 12.3),
    ('2024-01-01 10:00:00', 15.6),
    ('2024-01-01 11:00:00', 11.4),
    ('2024-01-01 12:00:00', 14.7),
    ('2024-01-01 13:00:00', 13.5),
]

# Columns
columns = ['timestamp', 'value']

# Creating DataFrame
df = spark.createDataFrame(data, columns)
df.show()

+-------------------+-----+
|          timestamp|value|
+-------------------+-----+
|2024-01-01 08:00:00| 10.5|
|2024-01-01 09:00:00| 12.3|
|2024-01-01 10:00:00| 15.6|
|2024-01-01 11:00:00| 11.4|
|2024-01-01 12:00:00| 14.7|
|2024-01-01 13:00:00| 13.5|
+-------------------+-----+



In [0]:
# converting 'timepstamp' column to a timepstamp type
df = df.withColumn("timestamp", F.col("timestamp").cast("timestamp"))
df.show()

+-------------------+-----+
|          timestamp|value|
+-------------------+-----+
|2024-01-01 08:00:00| 10.5|
|2024-01-01 09:00:00| 12.3|
|2024-01-01 10:00:00| 15.6|
|2024-01-01 11:00:00| 11.4|
|2024-01-01 12:00:00| 14.7|
|2024-01-01 13:00:00| 13.5|
+-------------------+-----+



#### Common Time-Series Operations in PySpark:

**1.Resampling (Aggregation by Time Intervals):**

Suppose you want to aggregate data on an hourly basis.

In [0]:
# group by an hour and calculate  the average value
hourly_avg = df.groupBy(F.window("timestamp", "1 hour")).agg(F.avg("value").alias("avg_value"))

hourly_avg.show(truncate=False)

+------------------------------------------+---------+
|window                                    |avg_value|
+------------------------------------------+---------+
|{2024-01-01 08:00:00, 2024-01-01 09:00:00}|10.5     |
|{2024-01-01 09:00:00, 2024-01-01 10:00:00}|12.3     |
|{2024-01-01 10:00:00, 2024-01-01 11:00:00}|15.6     |
|{2024-01-01 11:00:00, 2024-01-01 12:00:00}|11.4     |
|{2024-01-01 12:00:00, 2024-01-01 13:00:00}|14.7     |
|{2024-01-01 13:00:00, 2024-01-01 14:00:00}|13.5     |
+------------------------------------------+---------+



**2.Filtering by Date Range:**

To filter data between two specific dates:

In [0]:
start_date = '2024-01-01 09:00:00'
end_date = '2024-01-01 12:00:00'

filtered_df = df.filter((F.col("timestamp") >= start_date) & (F.col("timestamp") <= end_date))
filtered_df.show()

+-------------------+-----+
|          timestamp|value|
+-------------------+-----+
|2024-01-01 09:00:00| 12.3|
|2024-01-01 10:00:00| 15.6|
|2024-01-01 11:00:00| 11.4|
|2024-01-01 12:00:00| 14.7|
+-------------------+-----+



**3.Lag and Lead Operations:**

To calculate the difference between current and previous values

In [0]:
from pyspark.sql.window import Window

window_specs = Window.orderBy("timestamp")

df_with_lag = df.withColumn("prev_value", F.lag("value", 1).over(window_specs))
df_with_lag.show()

+-------------------+-----+----------+
|          timestamp|value|prev_value|
+-------------------+-----+----------+
|2024-01-01 08:00:00| 10.5|      null|
|2024-01-01 09:00:00| 12.3|      10.5|
|2024-01-01 10:00:00| 15.6|      12.3|
|2024-01-01 11:00:00| 11.4|      15.6|
|2024-01-01 12:00:00| 14.7|      11.4|
|2024-01-01 13:00:00| 13.5|      14.7|
+-------------------+-----+----------+



In [0]:
df_with_lag = df_with_lag.withColumn("value_diff", F.round(F.col("value") - F.col("prev_value"), 2))
df_with_lag.show()

+-------------------+-----+----------+----------+
|          timestamp|value|prev_value|value_diff|
+-------------------+-----+----------+----------+
|2024-01-01 08:00:00| 10.5|      null|      null|
|2024-01-01 09:00:00| 12.3|      10.5|       1.8|
|2024-01-01 10:00:00| 15.6|      12.3|       3.3|
|2024-01-01 11:00:00| 11.4|      15.6|      -4.2|
|2024-01-01 12:00:00| 14.7|      11.4|       3.3|
|2024-01-01 13:00:00| 13.5|      14.7|      -1.2|
+-------------------+-----+----------+----------+



**4.Rolling Average:**

To compute a moving average

In [0]:
window_specs = Window.orderBy("timestamp").rowsBetween(-2, 0)

rolling_avg_df = df.withColumn("rolling_avg", F.avg("value").over(window_specs))
rolling_avg_df.show()

+-------------------+-----+------------------+
|          timestamp|value|       rolling_avg|
+-------------------+-----+------------------+
|2024-01-01 08:00:00| 10.5|              10.5|
|2024-01-01 09:00:00| 12.3|              11.4|
|2024-01-01 10:00:00| 15.6|12.799999999999999|
|2024-01-01 11:00:00| 11.4|              13.1|
|2024-01-01 12:00:00| 14.7|              13.9|
|2024-01-01 13:00:00| 13.5|13.200000000000001|
+-------------------+-----+------------------+



- `.rowsBetween(-2, 0): `This is the key part for calculating the rolling average. It defines the range of rows to include in the calculation for each row.
- `-2:` This means "include the row that is two rows before the current row".
- `0:` This means "include the current row".
- Therefore, for each row, the window will include the current row and the two rows immediately preceding it.

**Detailed Explanation of the rolling_avg Calculation:**

- Row 1 (08:00:00): Since there are no preceding rows, the rolling average is simply the value of the first row (10.5).
- Row 2 (09:00:00): The rolling average is the average of the first two rows: (10.5 + 12.3) / 2 = 11.4.
- Row 3 (10:00:00): The rolling average is the average of the first three rows: (10.5 + 12.3 + 15.6) / 3 = 12.8.
- Row 4 (11:00:00): The rolling average is the average of rows 2, 3, and 4: (12.3 + 15.6 + 11.4) / 3 = 13.1.
- Row 5 (12:00:00): The rolling average is the average of rows 3, 4, and 5: (15.6 + 11.4 + 14.7) / 3 = 13.9.
- Row 6 (13:00:00): The rolling average is the average of rows 4, 5, and 6: (11.4 + 14.7 + 13.5) / 3 = 13.2.