# GMU Daen 690 Spark Pipeline

## Setup
Load the `.env` file if available and the application's config file. **Do NOT** commit your `.env` file to version control, as it may contain sensitive info and/or secrets. Use `.env.template` as a base for your personal `.env` file. Environment files should be, as the name suggests, specific to the environment that the application runs in.

`tomllib` is used for the application's general configuration that typically doesn't change between environments and is intended to take the place of otherwise hard-coded values. More sophisticated libraries, like `pydantic` or `dynaconf`, exist for merging toml configurations, .env file variables, and the user's environment variables together.

Rerun this cell anytime these files change.

In [3]:
import os
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()

import tomllib

with Path("config.toml").open("rb") as f:
    config = tomllib.load(f)

config

{'data': {'source': 'data/ISS_HAL_SOPs.csv'}}

Import the dependencies as usual. This is done after the configuration above incase we want imports to be config-dependent.

In [4]:
import httpx
import io
import pandas as pd
import pyspark as ps
from pyspark import SparkContext, SparkFiles
from pyspark.sql import SparkSession
from pyspark.sql.functions import expr

Create spark session. Set the app name and execution mode. In this case, use all available cores on the local machine.

In [5]:
spark = (
    SparkSession
        .builder
        .master("local[*]")
        .appName("ISS Procedures")
        .getOrCreate()
    )

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/03/31 11:12:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [19]:
# Way overthought this, but ultimately, not really compatible with a notebook project.
# match config['data']['source'].split(":")[0]:
#     case "http"|"https":
#         with httpx.Client() as client:
#             r = client.get(config['data']['source'])
#             csv = io.BytesIO(r.content)
#     case "file":
#         with Path(config['data']['source']).open("rb") as f:
#             csv = io.BytesIO(f.read())
#     case _:
#         csv = config['data']['source']

<_io.BytesIO at 0x11b1ed300>

In [9]:
source_df = spark.read.format("csv") \
    .options(header='True', inferSchema='True') \
    .load(config['data']['source'])

source_df = source_df.toDF(*[c.lower() for c in source_df.columns])

source_df.toPandas().head(20)

Unnamed: 0,procedure type,procedure name,procedure end goal,procedure file number,step number,actor,trigger (what),trigger (how),trigger (where),decision (what),...,decision (where),action (what),action (how),action (where),waiting (what),waiting (how),waiting (where),verification (what),verification (how),verification (where)
0,Manual Manipulation of Items,Reconfigure HAL for EVA,Configure the habitable airlock for EVA by rem...,HAL_1_0.pdf,1.0,,,,,,...,,Stow monitors against the wall,,,,,,,,
1,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,2.0,,,,,,...,,Stow the keyboards against the wall,,,,,,,,
2,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,3.0,,,,,,...,,Remove the seat cushion,,,,,,,,
3,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,4.0,,,,,,...,,Fold the chair backs forward,,,,,,,,
4,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,5.0,,,,,,...,,Detach crew hygiene kit,,from the aft transfer port hatches,,,,,,
5,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,6.0,,,,,,...,,Stow the crew hygiene kits,,in Lockers SA-1 and PA-1,,,,,,
6,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,7.0,,,,,,...,,Remove hatch cargo nets,,from lockers SA-1 and PA-1,,,,,,
7,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,8.0,,,,,,...,,Secure hatch cargo nets,to 3 of the 4 D-rings,at the starboard and port hatch openings,,,,,,
8,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,9.0,,,,,,...,,Remove IVA Common Tool Kit,,from PM-5,,,,,,
9,Manual Manipulation of Items,Reconfigure HAL for EVA,,HAL_1_0.pdf,10.0,,,,,,...,,Temp Stow IVA Common Tool Kit,,behind the Port Hatch Opening,,,,,,


In [10]:
df = source_df \
    .select(
        'procedure name',
        expr("""
            stack(
                16,
                'actor', actor,
                'trigger (what)', `trigger (what)`,
                'trigger (how)', `trigger (how)`,
                'trigger (where)', `trigger (where)`,
                'decision (what)', `decision (what)`,
                'decision (how)', `decision (how)`,
                'decision (where)', `decision (where)`,
                'action (what)', `action (what)`,
                'action (how)', `action (how)`,
                'action (where)', `action (where)`,
                'waiting (what)', `waiting (what)`,
                'waiting (how)', `waiting (how)`,
                'waiting (where)', `waiting (where)`,
                'verification (what)', `verification (what)`,
                'verification (how)', `verification (how)`,
                'verification (where)', `verification (where)`
            ) as (label, text)
        """)
    ) \
    .filter("text is not null")

df.toPandas().head(10)

23/03/31 11:14:34 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


Unnamed: 0,procedure name,label,text
0,Reconfigure HAL for EVA,action (what),Stow monitors against the wall
1,Reconfigure HAL for EVA,action (what),Stow the keyboards against the wall
2,Reconfigure HAL for EVA,action (what),Remove the seat cushion
3,Reconfigure HAL for EVA,action (what),Fold the chair backs forward
4,Reconfigure HAL for EVA,action (what),Detach crew hygiene kit
5,Reconfigure HAL for EVA,action (where),from the aft transfer port hatches
6,Reconfigure HAL for EVA,action (what),Stow the crew hygiene kits
7,Reconfigure HAL for EVA,action (where),in Lockers SA-1 and PA-1
8,Reconfigure HAL for EVA,action (what),Remove hatch cargo nets
9,Reconfigure HAL for EVA,action (where),from lockers SA-1 and PA-1
