## Simple space mission analysis with PyStarburst and Galaxy

### Prepare the environment

First run the following in your terminal

`$ python -m venv venv`

`$ source venv\bin\activate`

In addition, you'll need a [Starburst Galaxy]("https://www.starburst.io/platform/starburst-galaxy/start/") account with a sample catalog [setup]("https://docs.starburst.io/starburst-galaxy/catalogs/sample.html") alongside a writeable catalog for storing the results.

In [None]:
# Install the library

%pip install https://starburstdata-downloads.s3.amazonaws.com/pystarburst/0.5.0/pystarburst-0.5.0-py3-none-any.whl

In [None]:
# Import dependencies

from pystarburst import Session
from pystarburst import functions as f
from pystarburst.functions import col

import trino

# Define Connection Properties
# You can get the host and other information from the Partner Connect -> PyStarburst section in Galaxy

session_properties = {
    "host": "yourcluster.trino.galaxy.starburst.io",
    "port": 443,
    # Needed for https secured clusters
    "http_scheme": "https",
    # Setup authentication through login or password or any other supported authentication methods
    # See docs: https://github.com/trinodb/trino-python-client#authentication-mechanisms
    "auth": trino.auth.BasicAuthentication("youremail@domain.com/accountadmin", "password")
}

session = Session.builder.configs(session_properties).create()

In [None]:
# Validate connectivity to the cluster

session.sql("select 1 as b").collect()

In [None]:
# Let's understand the data

df_missions = session.table("sample.demo.missions")

print(df_missions.schema)
df_missions.show()

In [None]:
#
# There's some data clean up needed - plus we want to only look at missions since the year 2000
#

from datetime import datetime

# We can add abritray SQL expressions as needed
df_missions = df_missions.with_column("date", f.sql_expr("COALESCE(TRY(date_parse(\"date\", '%a %b %d, %Y %H:%i UTC')), NULL)"))

print(df_missions.schema)

df_missions = df_missions\
    .filter(col("date") > datetime(2000, 1, 1))\
    .sort(col("date"), ascending=True)

df_missions.show()


In [None]:
#
# Next we'll do a basic aggregation for summarization
#

df_summarized = df_missions\
    .group_by("company_name")\
    .count()\
    .rename("count", "num_missions")\
    .sort(col("num_missions").desc())
df_summarized.show(n=100)

In [None]:
#
# Finally, let's write the table to our data lake
#

session.sql("CREATE SCHEMA IF NOT EXISTS use_for_general_aws.pystarburst_mis_sum").collect()

session.sql("DROP TABLE IF EXISTS use_for_general_aws.pystarburst_mis_sum.missions_summary").collect()

df_summarized.write.save_as_table(
    "use_for_general_aws.pystarburst_mis_sum.missions_summary",
)

df_validation = session.table("use_for_general_aws.pystarburst_mis_sum.missions_summary").show()