## Simple space mission analysis with PyStarburst and Galaxy

### Sign up for a Galaxy account & setup the sample catalog

You'll need a [Starburst Galaxy]("https://www.starburst.io/platform/starburst-galaxy/start/") account with a sample catalog [setup]("https://docs.starburst.io/starburst-galaxy/catalogs/sample.html") alongside a writeable catalog (suggested to use an object-store based catalog) for storing the results.

In [None]:
# Install the library

%pip install pystarburst
%pip install pandas matplotlib

In [None]:
# Define Connection Properties
# You can get the host and other information from the Partner Connect -> PyStarburst section in Galaxy

import getpass

host = input("Host name")
username = input("User name")
password = getpass.getpass("Password")

In [None]:
# Import dependencies

from pystarburst import Session
from pystarburst import functions as f
from pystarburst.functions import col

import trino

session_properties = {
    "host":host,
    "port": 443,
    # Needed for https secured clusters
    "http_scheme": "https",
    # Setup authentication through login or password or any other supported authentication methods
    # See docs: https://github.com/trinodb/trino-python-client#authentication-mechanisms
    "auth": trino.auth.BasicAuthentication(username, password)
}

session = Session.builder.configs(session_properties).create()

In [None]:
# Validate connectivity to the cluster

session.sql("select 1 as b").collect()

In [None]:
# Let's understand the data

df_missions = session.table("sample.demo.missions")

print(df_missions.schema)
df_missions.show()

In [None]:
#
# There's some data clean up needed - plus we want to only look at missions since the year 2000
#

from datetime import datetime

# We can add arbitrary SQL expressions as needed
df_missions = df_missions.with_column("date", f.sql_expr("COALESCE(TRY(date_parse(\"date\", '%a %b %d, %Y %H:%i UTC')), NULL)"))

print(df_missions.schema)

df_missions = df_missions\
    .filter(col("date") > datetime(2000, 1, 1))\
    .sort(col("date"), ascending=True)

df_missions.show()

In [None]:
#
# Next we'll do a basic aggregation for summarization
#

df_summarized = df_missions\
    .group_by("company_name")\
    .count()\
    .rename("count", "num_missions")\
    .sort(col("num_missions").desc())
df_summarized.show(n=100)

In [None]:
#
# Finally, let's write the table to our data lake
#

session.sql("CREATE SCHEMA IF NOT EXISTS s3lakehouse.pystarburst_mis_sum").collect()

session.sql("DROP TABLE IF EXISTS s3lakehouse.pystarburst_mis_sum.missions_summary").collect()

df_summarized.write.save_as_table(
    "s3lakehouse.pystarburst_mis_sum.missions_summary",
)

df_validation = session.table("s3lakehouse.pystarburst_mis_sum.missions_summary").show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

def to_pandas_df(pystarburst_df):
    return pd.DataFrame(pystarburst_df.collect())

df_validation_pd = to_pandas_df(session.table("s3lakehouse.pystarburst_mis_sum.missions_summary"))
df_validation_pd = df_validation_pd.sort_values('num_missions')

In [None]:
df_validation_pd.plot.pie(figsize=(20,12), y='num_missions', labels=df_validation_pd['company_name'], legend=False)