## Simple space mission analysis with PyStarburst and Galaxy

### Prepare the environment

First run the following in your terminal

`$ python -m venv venv`

`$ source venv\bin\activate`

In addition, you'll need a [Starburst Galaxy]("https://www.starburst.io/platform/starburst-galaxy/start/") account with a sample catalog [setup]("https://docs.starburst.io/starburst-galaxy/catalogs/sample.html") alongside a writeable catalog for storing the results.

In [73]:
# Install the library

%pip install https://starburstdata-downloads.s3.amazonaws.com/pystarburst/0.5.0/pystarburst-0.5.0-py3-none-any.whl

Collecting pystarburst==0.5.0
  Using cached https://starburstdata-downloads.s3.amazonaws.com/pystarburst/0.5.0/pystarburst-0.5.0-py3-none-any.whl (121 kB)
Note: you may need to restart the kernel to use updated packages.


In [74]:
# Import dependencies

from pystarburst import Session
from pystarburst import functions as f
from pystarburst.functions import col

import trino

# Define Connection Properties
# You can get the host and other information from the Partner Connect -> PyStarburst section in Galaxy

session_properties = {
    "user": "alex.breshears@starburstdata.com",
    "host": "monicamillersbdemo-sample.trino.galaxy.starburst.io",
    "port": 443,
    # Needed for https secured clusters
    "http_scheme": "https",
    # Setup authentication through login or password or any other supported authentication methods
    # See docs: https://github.com/trinodb/trino-python-client#authentication-mechanisms
    "auth": trino.auth.BasicAuthentication("alex.breshears@starburstdata.com/accountadmin", "HGZpwCYUC36yGZh")
}

session = Session.builder.configs(session_properties).create()

In [75]:
# Validate connectivity to the cluster

session.sql("select 1 as b").collect()

[Row(b=1)]

In [76]:
#
# Let's now understand our datasets
#

session.sql("SHOW CATALOGS").collect()

# We'll use the "sample" catalog for this example

# List out schemas
session.sql("SHOW SCHEMAS FROM sample").collect()
session.sql("USE sample.demo").collect()

# List our tables
session.sql("SHOW TABLES").collect()

df_missions = session.table("sample.demo.missions")

print(df_missions.schema)
df_missions.show()

df_missions.describe().show()


StructType([StructField('id', IntegerType(), nullable=True), StructField('company_name', StringType(), nullable=True), StructField('location', StringType(), nullable=True), StructField('date', StringType(), nullable=True), StructField('detail', StringType(), nullable=True), StructField('status_rocket', StringType(), nullable=True), StructField('cost', DoubleType(), nullable=True), StructField('status_mission', StringType(), nullable=True)])
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"id"  |"company_name"  |"location"                                          |"date"                      |"detail"                                            |"status_rocket"  |"cost"  |"status_mission"  |
---------------------------------------------------------------------------------------------------------------------------------------------

In [77]:
#
# There's some data clean up needed - plus we want to only look at missions since the year 2000
#

import datetime

df_missions = session.table("sample.demo.missions")

# We can add abritray SQL expressions as needed
df_missions = df_missions.with_column("date", f.sql_expr("COALESCE(TRY(date_parse(\"date\", '%a %b %d, %Y %H:%i UTC')), NULL)"))

print(df_missions.schema)

df_missions = df_missions.filter(col("date") > f.lit(datetime.datetime.strptime("01/01/2000", "%m/%d/%Y")))
df_missions = df_missions.sort(col("date"), ascending=True)

df_missions.show()


StructType([StructField('id', IntegerType(), nullable=True), StructField('company_name', StringType(), nullable=True), StructField('location', StringType(), nullable=True), StructField('detail', StringType(), nullable=True), StructField('status_rocket', StringType(), nullable=True), StructField('cost', DoubleType(), nullable=True), StructField('status_mission', StringType(), nullable=True), StructField('date', TimestampNTZType(), nullable=True)])
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"id"  |"company_name"  |"location"                                         |"detail"                          |"status_rocket"  |"cost"  |"status_mission"  |"date"               |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|1213 

In [78]:
#
# Next we'll do a basic aggergation for summarization
#

df_summarized = df_missions.group_by("company_name").agg(col("*"), "count").rename("count(1)", "num_missions")
df_summarized = df_summarized.sort(col("num_missions").desc())
df_summarized.show(n=100)

-----------------------------------
|"company_name"  |"num_missions"  |
-----------------------------------
|CASC            |179             |
|Arianespace     |166             |
|ULA             |140             |
|SpaceX          |100             |
|VKS RF          |97              |
|ISRO            |62              |
|Roscosmos       |54              |
|Northrop        |51              |
|MHI             |50              |
|Boeing          |46              |
|NASA            |40              |
|Sea Launch      |32              |
|Lockheed        |30              |
|ILS             |24              |
|Kosmotras       |20              |
|Eurockot        |13              |
|Rocket Lab      |13              |
|ExPace          |10              |
|Blue Origin     |7               |
|JAXA            |7               |
|Land Launch     |7               |
|ISAS            |5               |
|KCST            |4               |
|Exos            |4               |
|MITT            |3         

In [79]:
#
# Finally, let's write the table to our data lake
#

session.sql("CREATE SCHEMA IF NOT EXISTS use_for_general_aws.pystarburst_mis_sum").collect()

session.sql("DROP TABLE IF EXISTS use_for_general_aws.pystarburst_mis_sum.missions_summary").collect()

df_summarized.write.save_as_table(
    "use_for_general_aws.pystarburst_mis_sum.missions_summary",
)

df_validation = session.table("use_for_general_aws.pystarburst_mis_sum.missions_summary").show()

-----------------------------------
|"company_name"  |"num_missions"  |
-----------------------------------
|SpaceX          |100             |
|ULA             |140             |
|JAXA            |7               |
|Northrop        |51              |
|IAI             |2               |
|Rocket Lab      |13              |
|MHI             |50              |
|ISA             |2               |
|Blue Origin     |7               |
|Exos            |4               |
-----------------------------------

