Problem Statement:

Amazon Web Services (AWS) is powered by fleets of servers. Senior management has requested data-driven solutions to optimize server usage.

Write a query that calculates the total time that the fleet of servers was running. The output should be in units of full days.

Assumptions:

Each server might start and stop several times.
The total time in which the server fleet is running can be calculated as the sum of each server's uptime.

In [0]:
from pyspark.sql.functions import *

# Create a list of data
data = [
    (1, "start", "08/02/2022 10:00:00"),
    (1, "stop", "08/04/2022 10:00:00"),
    (1, "stop", "08/13/2022 19:00:00"),
    (1, "start", "08/13/2022 10:00:00"),
    (3, "stop", "08/19/2022 10:00:00"),
    (3, "start", "08/18/2022 10:00:00"),
    (5, "stop", "08/19/2022 10:00:00"),
    (4, "stop", "08/19/2022 14:00:00"),
    (4, "start", "08/16/2022 10:00:00"),
    (3, "stop", "08/14/2022 10:00:00"),
    (3, "start", "08/06/2022 10:00:00"),
    (2, "stop", "08/24/2022 10:00:00"),
    (2, "start", "08/17/2022 10:00:00"),
    (5, "start", "08/14/2022 21:00:00"),
]

# Define schema
columns = ["server_id", "session_status", "status_time"]

# Create DataFrame
df = spark.createDataFrame(data, schema=columns)

# Convert 'status_time' to timestamp
df = df.withColumn(
    "status_time", to_timestamp(col("status_time"), "MM/dd/yyyy HH:mm:ss")
)

# Sort the DataFrame by 'server_id' and 'status_time'
df = df.sort("server_id", "status_time")

# display the DataFrame
df.display()

server_id,session_status,status_time
1,start,2022-08-02T10:00:00.000+0000
1,stop,2022-08-04T10:00:00.000+0000
1,start,2022-08-13T10:00:00.000+0000
1,stop,2022-08-13T19:00:00.000+0000
2,start,2022-08-17T10:00:00.000+0000
2,stop,2022-08-24T10:00:00.000+0000
3,start,2022-08-06T10:00:00.000+0000
3,stop,2022-08-14T10:00:00.000+0000
3,start,2022-08-18T10:00:00.000+0000
3,stop,2022-08-19T10:00:00.000+0000


In [0]:
df.createOrReplaceTempView("server_utilization")

In [0]:
%sql
WITH running_time AS (
  SELECT
    server_id,
    session_status,
    status_time AS start_time,
    LEAD(status_time) OVER (
      PARTITION BY server_id
      ORDER BY
        status_time
    ) AS stop_time
  FROM
    server_utilization
)
SELECT
  CAST(
    SUM(
      (
        UNIX_TIMESTAMP(stop_time) - UNIX_TIMESTAMP(start_time)
      ) / 86400
    ) AS INT
  ) AS total_uptime_days
FROM
  running_time
WHERE
  session_status = 'start'
  AND stop_time IS NOT NULL;

total_uptime_days
26


In [0]:
from pyspark.sql.functions import *
from pyspark.sql.window import Window

df = df.withColumn("status_time", to_timestamp("status_time"))

# Define the window to get the next status_time
window_spec = Window.partitionBy("server_id").orderBy("status_time")

# Create the running_time DataFrame
running_time = df.withColumn("stop_time", lead("status_time").over(window_spec)).filter(
    (col("session_status") == "start") & (col("stop_time").isNotNull())
)

# Calculate total uptime days
total_uptime_days = running_time.select(
    (sum((unix_timestamp("stop_time") - unix_timestamp("status_time")) / 86400))
    .cast("int")
    .alias("total_uptime_days")
)

# Show the result
total_uptime_days.display()

total_uptime_days
26


Explanation:

Spark Session: Initializes the Spark session.
Sample DataFrame: A sample DataFrame (df) is created for demonstration purposes. Replace this with your actual DataFrame loading method.
Convert to Timestamp: Converts the status_time column to a timestamp type for time calculations.
Window Specification: Defines a window specification to partition by server_id and order by status_time.
Create running_time DataFrame:
Uses lead to get the next status_time as stop_time.
Filters to keep only rows where session_status is 'start' and stop_time is not null.
Calculate Total Uptime Days:
Computes the difference between stop_time and status_time, converts it to days, and sums the results.
The result is cast to an integer type.
Show the Result: Displays the total uptime days.
Make sure you adjust the code to fit your actual DataFrame structure and data types.