
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

In [3]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.dataframe import DataFrame
from pyspark.sql.functions import (
    col,
    collect_list,
    count,
    desc,
    expr,
    lag,
    max,
    min,
    round,
    sum,
)
from pyspark.sql.types import (
    StringType,
    StructField,
    StructType,
    TimestampType,
)

from pyspark.sql.window import Window
from awsglue.dynamicframe import DynamicFrame

import boto3

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
client = boto3.client('s3')


SESSION_SCHEMA = StructType(
    [
        StructField("userid", StringType(), False),
        StructField("timestamp", TimestampType(), True),
        StructField("artistid", StringType(), True),
        StructField("artistname", StringType(), True),
        StructField("trackid", StringType(), True),
        StructField("trackname", StringType(), True),
    ]
)

S3_PATH="s3://lastfm-dataset/user-session-track.tsv"
BUCKET="lastfm-dataset"




In [35]:

def read_session_data(spark) -> DataFrame:
    """
    Read session and user profile data and join on userid.
    :param spark: SparkSession
    :return: Spark DataFrame
    """
    data = (
        spark.read.format("csv")
        .option("header", "false")
        .option("delimiter", "\t")
        .schema(SESSION_SCHEMA)
        .load(S3_PATH)
    )
    cols_to_drop = ("artistid", "trackid")
    return data.drop(*cols_to_drop).cache()




def create_users_and_distinct_songs_count(df: DataFrame) -> DataFrame:
    """
    Create a list of user IDs, along with the number of distinct songs each user has played.
    :param df: Spark DataFrame
    :return: DataFrame
    """
    df1 = df.select("userid", "artistname", "trackname").dropDuplicates()
    df2 = (
        df1.groupBy("userid")
        .agg(count("*").alias("DistinctTrackCount"))
        .orderBy(desc("DistinctTrackCount"))
    )
    return df2


def create_popular_songs(df: DataFrame, limit=100) -> DataFrame:
    """
    Create a list of the 100 most popular songs (artist and title) in the dataset, with the number of
    times each was played..
    :param df: Spark DataFrame
    :param limit: int
    :return: DataFrame
    """
    df1 = (
        df.groupBy("artistname", "trackname")
        .agg(count("*").alias("CountPlayed"))
        .orderBy(desc("CountPlayed"))
        .limit(limit)
    )
    return df1


def create_session_ids_for_all_users(
    df: DataFrame, session_cutoff: int
) -> DataFrame:
    """
    Creates a new 'session_ids' for each user depending for a timestamp if time between successive
    played tracks exceeds session_cutoff.
    :param df: Spark DataFrame
    :param session_cutoff: int
    :return: Spark DataFrame
    """
    w1 = Window.partitionBy("userid").orderBy("timestamp")

    df1 = (
        df.withColumn("pretimestamp", lag("timestamp").over(w1))
        .withColumn(
            "delta_mins",
            round(
                (
                    col("timestamp").cast("long")
                    - col("pretimestamp").cast("long")
                )
                / 60
            ),
        )
        .withColumn(
            "sessionflag",
            expr(
                f"CASE WHEN delta_mins > {session_cutoff} OR delta_mins IS NULL THEN 1 ELSE 0 END"
            ),
        )
        .withColumn("sessionID", sum("sessionflag").over(w1))
    )
    return df1


def compute_top_n_longest_sessions(df: DataFrame, limit: int) -> DataFrame:
    """
    Calculates the length of each session for each user from difference of timestamps
    of first and last songs of each session. Returns top n longest sessions, where n is determined
    by the limit argument.
    :param df: Spark DataFrame
    :param limit: int
    :return: Spark DataFrame
    """
    df1 = (
        df.groupBy("userid", "sessionID")
        .agg(
            min("timestamp").alias("session_start_ts"),
            max("timestamp").alias("session_end_ts"),
        )
        .withColumn(
            "session_length(hrs)",
            round(
                (
                    col("session_end_ts").cast("long")
                    - col("session_start_ts").cast("long")
                )
                / 3600
            ),
        )
        .orderBy(desc("session_length(hrs)"))
        .limit(limit)
    )
    return df1


def longest_sessions_with_tracklist(
    df: DataFrame, session_cutoff: int = 20, limit: int = 10
) -> DataFrame:
    """
    Creates a dataframe of the top 10 longest sessions (by elapsed time), with the following information about
    each session: userid, timestamp of first and last songs in the  session, and the list of songs played
    in the session (in order of play). An exsiting session ends if time between successive
    tracks exceeds session_cutoff.
    :param df: Spark DataFrame
    :param session_cutoff:int, Default:20
    :param limit: int, Default:10
    :return: DataFrame
    """
    df1 = create_session_ids_for_all_users(df, session_cutoff)
    df2 = compute_top_n_longest_sessions(df1, limit)
    df3 = (
        df1.join(df2, ["userid", "sessionID"])
        .select("userid", "sessionID", "trackname", "session_length(hrs)")
        .groupBy("userid", "sessionID", "session_length(hrs)")
        .agg(collect_list("trackname").alias("tracklist"))
        .orderBy(desc("session_length(hrs)"))
    )
    return df3


def rename_s3_results_key(source_key_prefix, dest_key):
    #getting all the content/file inside the bucket. 
    response = client.list_objects_v2(Bucket=BUCKET)
    body = response["Contents"]
    #Find out the file which has part-000* in it's Key
    key =  [obj['Key'] for obj in body if source_key_prefix in obj['Key']]
    client.copy_object(Bucket=BUCKET, CopySource={'Bucket': BUCKET, 'Key': key[0]}, Key=dest_key)
    client.delete_object(Bucket=BUCKET, Key=key[0])
    

def write_ddf_to_s3(df:DataFrame, name: str):
    dyf = DynamicFrame.fromDF(df.repartition(1), glueContext, name)
    sink = glueContext.write_dynamic_frame.from_options(frame=dyf, 
                                                        # use s3a as seems to prevent creating '_$folder$' in S3
                                                        connection_type = "s3a",
                                                        format = "glueparquet",
                                                        connection_options = {"path": f"s3a://{BUCKET}/results/{name}/", "partitionKeys": []},
                                                        transformation_ctx = f"{name}_sink"
                                                                )
    source_key_prefix = f"results/{name}/run-"
    dest_key = f"results/{name}/{name}.parquet"
    rename_s3_results_key(source_key_prefix, dest_key)
    return sink





In [5]:

df = read_session_data(spark)
df.printSchema()

df.show(5)

root
 |-- userid: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- artistname: string (nullable = true)
 |-- trackname: string (nullable = true)

+-----------+-------------------+----------+--------------------+
|     userid|          timestamp|artistname|           trackname|
+-----------+-------------------+----------+--------------------+
|user_000001|2009-05-04 23:08:57| Deep Dish|Fuck Me Im Famous...|
|user_000001|2009-05-04 13:54:10|  坂本龍一|Composition 0919 ...|
|user_000001|2009-05-04 13:52:04|  坂本龍一|Mc2 (Live_2009_4_15)|
|user_000001|2009-05-04 13:42:52|  坂本龍一|Hibari (Live_2009...|
|user_000001|2009-05-04 13:42:11|  坂本龍一|Mc1 (Live_2009_4_15)|
+-----------+-------------------+----------+--------------------+
only showing top 5 rows


In [36]:


songs_per_user = create_users_and_distinct_songs_count(df)
songs_per_user.show()




+-----------+------------------+
|     userid|DistinctTrackCount|
+-----------+------------------+
|user_000691|             63636|
|user_000861|             50230|
|user_000681|             43241|
|user_000800|             39542|
|user_000427|             35934|
|user_000774|             34620|
|user_000702|             31342|
|user_000345|             26055|
|user_000882|             24990|
|user_000783|             24569|
|user_000451|             23513|
|user_000692|             22392|
|user_000910|             22311|
|user_000162|             22143|
|user_000313|             20355|
|user_000031|             19864|
|user_000870|             19847|
|user_000896|             19836|
|user_000483|             19479|
|user_000210|             19269|
+-----------+------------------+
only showing top 20 rows


In [37]:
popular_songs = create_popular_songs(df)
popular_songs.show()


+-------------------+--------------------+-----------+
|         artistname|           trackname|CountPlayed|
+-------------------+--------------------+-----------+
| The Postal Service|  Such Great Heights|       3992|
|       Boy Division|Love Will Tear Us...|       3663|
|          Radiohead|        Karma Police|       3534|
|               Muse|Supermassive Blac...|       3483|
|Death Cab For Cutie|     Soul Meets Body|       3479|
|          The Knife|          Heartbeats|       3156|
|               Muse|           Starlight|       3060|
|        Arcade Fire|    Rebellion (Lies)|       3048|
|     Britney Spears|          Gimme More|       3004|
|        The Killers| When You Were Young|       2998|
|           Interpol|                Evil|       2989|
|         Kanye West|       Love Lockdown|       2950|
|     Massive Attack|            Teardrop|       2948|
|Death Cab For Cutie|I Will Follow You...|       2947|
|               Muse| Time Is Running Out|       2945|
|         

In [38]:
df_sessions = longest_sessions_with_tracklist(df)
df_sessions.show()

+-----------+---------+-------------------+--------------------+
|     userid|sessionID|session_length(hrs)|           tracklist|
+-----------+---------+-------------------+--------------------+
|user_000949|      149|              354.0|[Chained To You, ...|
|user_000997|       18|              353.0|[Unentitled State...|
|user_000949|      553|              309.0|[White Daisy Pass...|
|user_000544|       75|              252.0|[Finally Woken, O...|
|user_000949|      137|              212.0|[Neighborhood #2 ...|
|user_000949|      187|              187.0|[Disco Science, H...|
|user_000949|      123|              187.0|[Excuse Me Miss A...|
|user_000544|       55|              181.0|[La Murga, Breath...|
|user_000250|     1258|              174.0|[Lazarus Heart, S...|
|user_000949|      150|              170.0|[Y-Control, Banqu...|
+-----------+---------+-------------------+--------------------+


In [1]:

write_ddf_to_s3(popular_songs, "popular_songs")
write_ddf_to_s3(df_sessions, "df_sessions")
write_ddf_to_s3(songs_per_user, "distinct_songs")
