#Notebook Description
**Author**: Slawomir Drzymala

**Description:**   
This notebook is getting the data from the curated layer of the data lake and preparing a dataset that can be used in further analysis of songs that were played on the given radio

#Create notebook parameters
* radio_name - name of the radio that will be used to create a dataset used for further analysis, default: RMFFM
* song_minium_play_treshold - number of minimum number of song repetition in the given radio playlist, default: 0 - take all

In [0]:
dbutils.widgets.dropdown("radio_name", 'RMFFM', ['Antyradio', 'Eska', 'RMFFM', 'ZET'])
dbutils.widgets.text("song_minium_play_treshold", "0")

#Get notebook parameters and assign to local variables

In [0]:
current_radio_name = dbutils.widgets.get("radio_name")
current_song_mininum_play_treshold = int(dbutils.widgets.get("song_minium_play_treshold"))

#Set up connection to data lake on Azure

In [0]:
#vide https://docs.databricks.com/_static/notebooks/data-import/azure-data-lake-store.html
#vide https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started
spark.conf.set(
  "fs.azure.account.key.sdsalearnsthnew.dfs.core.windows.net", 
  "RJMELuc9ffZPf5D0gwcbxJp+hWTkQuW8lmWa1DRFSF59aDiatDsMJ6X/yC/dHZtB7kdGl3cJIrYry++6EnCb5g==" 
)


#Read from Azure data lake all files from all years for given radio name

**Things to be noticed:**   
* **file path** - with the parameters for partitions and wildcard to get all files matching
* **.option("basePath", basePath)** - mandatory to retrive in automatic way the partition value from path
* **encoding** - specifying the UTF-8 to make sure that all characters will be correctly discovered

In [0]:
# read all files from all radio stations
base_path = "abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/curated-initial/"
file_path = f"abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/curated-initial/radio_name={current_radio_name}/year=*/*.parquet"
df_playlist = spark.read.option("basePath", base_path) \
                        .option('encoding', 'UTF-8') \
                        .parquet(file_path, multiLine=True)


#Check schema of the created data frame

**Things to be noticed:**   
* **radio_name and year** - those columns were automaticaly retrieve from path as they were used for partitioning

In [0]:
df_playlist.printSchema()

#Show number of rows in dataframe

In [0]:
df_playlist.count()

#Display 5 sample rows

**Things to be noticed:**   
* **display** - display is the magic Databricks function that can be used for visualization of many different objects including spark or pandas dataframes

In [0]:
display(
  df_playlist.head(5)
)

datetime,artist,title,radio_name,year
2020-12-31T00:01:00.000+0000,Calvin Harris / Rihanna,This Is What You Came For,RMFFM,2020
2020-12-31T00:07:00.000+0000,Nickelback,Lullaby,RMFFM,2020
2020-12-31T00:10:00.000+0000,Robert Gawliński,Nie Stało Się Nic,RMFFM,2020
2020-12-31T00:13:00.000+0000,Tiësto,The Business,RMFFM,2020
2020-12-31T00:15:00.000+0000,Andrzej Piaseczny / Robert Chojnacki,Prawie Do Nieba,RMFFM,2020


# Manipulate dataframe, add basic new columns
**New columns**
* **date** - date truncated from current datatime column
* **played** - an indicator that the given song in given readio in given datetime was played, we have an actual data from playlist so all of the rows are representing the fact that the particular song was played

**Things to be noticed:**   
* **select** - similar to sql select, selects the columns for further use
* **withColumn** - function withColumn can be used to add a new column to existing dataframe
* **lit(x)** - lit function can be used to specify the constant value for all columns, please note that providing the value itself without that function will raise an exception
* **attribute names** - attributes can be used in many different ways, here we have a name of the dataframe and the column name like dataframe["columnname"] when the column exists in the original dataframe and only the "columnname" if the column has been derived. There are more options tough
* **replace dataframe** - please also note that the dataframe is assigned back to the same name and will be "overwriten"
* **list of avalaible functions** - list of all avaliable functions can be found here: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

In [0]:
from pyspark.sql.functions import input_file_name
from pyspark.sql.functions import lit, split, reverse, regexp_replace, count, concat_ws, coalesce
from pyspark.sql.functions import year, date_format, hour, to_date

# all functions -> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

df_playlist = df_playlist.select( \
                            df_playlist["radio_name"], \
                            df_playlist["artist"], \
                            df_playlist["datetime"], \
                            df_playlist["title"] \
                          ) \
                         .withColumn("date", to_date(df_playlist["datetime"])) \
                         .withColumn("played", lit(1)) \
                         .select( \
                            df_playlist["radio_name"], \
                            "date", \
                            df_playlist["artist"], \
                            df_playlist["title"], \
                            "played"
                          )


#Display 5 sample rows to check the dataframe again

In [0]:
display(
  df_playlist.head(5)
)

radio_name,date,artist,title,played
RMFFM,2020-12-31,Calvin Harris / Rihanna,This Is What You Came For,1
RMFFM,2020-12-31,Nickelback,Lullaby,1
RMFFM,2020-12-31,Robert Gawliński,Nie Stało Się Nic,1
RMFFM,2020-12-31,Tiësto,The Business,1
RMFFM,2020-12-31,Andrzej Piaseczny / Robert Chojnacki,Prawie Do Nieba,1


# Create dataframe with playlist statistics across for all songs

**Idea**   
We are going to prepare a dataframe that will hold the statistics of all* songs that we have collected so far across all of the dates that we have also collected. The current dataframe contains an information about actual songs that were played on the radio. We are going then to add an information about all of the songs that were not played in particular radio station for those days. Therefore we will create some intermidiate dataframes that will be used to get that data at the end:

**New (or changed) dataframes**
* **df_unique_radio_and_day** - dataframe with the list of all the avaliable days for given radio
* **df_unique_songs** - list of all* unique songs and theirs artists that canb be found, to limit the number of data the songs that were played at least 100 times over the entire period of time, if the song were played less than that it will be ignored
* **df_cross_radio_day_songs** - a cartesian product of the df_unique_radio_and_day and df_unique_songs
* **df_playlist_statistics** - final dataframe that holds an information about all of the songs that were played and that were not played, to get information about the songs that were played we will join again to the original playlist dataframe and based on that update the played column

**New (or changed) columns**
* **artist_and_title** - concatenation of the artist name and the title of the song
* **month_name** - month name retrived from each timestamp when the song was played
* **played** - in the new dataframe after creating the cartesian product we join back to original playlist, all entires that are not coming from original playlist dataframe are marked as played = 0 which show that the particular song was not played in the particular period

**Things to be noticed:**   
* **distinct** - function to remove all duplciates and keep only the unique values
* **goupby** - equvalent of sql groupby used to group the data by one or many columns
* **agg** - function that can be used in combination with other aggregation functions, please note that the count is a different function
* **where** - equvalent of sql where clause, used to filter data, please note that here we have used python f string to insert the where clause value
* **crossJoin** - function to create a cartesian product of two tables
* **join** - function to join two dataframes, pelase note that we can specify multiple join conditions as well as the type of the join
* **concat_ws** - function to conatenate multiple attributes with a specifed delimeter
* **date_format** - functon to convert a date or datetime columns to the particular format, or simply to get a part of the date
* **coalesce** - equvalent of sql coalesce function, cheks if the column is empty and replace with given value if empty, please note that here we also need to use the lit() function to provide the static value

In [0]:
from pyspark.sql.functions import input_file_name
from pyspark.sql.functions import lit, split, reverse, regexp_replace, count, concat_ws, coalesce
from pyspark.sql.functions import year, date_format, hour, to_date

df_unique_radio_and_day = df_playlist.select(df_playlist["radio_name"], df_playlist["date"]) \
                                     .distinct()

df_unique_songs = df_playlist.select(df_playlist["artist"], df_playlist["title"]) \
                                     .groupby(df_playlist["artist"], df_playlist["title"]) \
                                     .agg(count(lit(1)).alias("cnt")) \
                                     .where(f"cnt > {current_song_mininum_play_treshold}") \
                                     .select(df_playlist["artist"], df_playlist["title"])

df_cross_radio_day_songs = df_unique_radio_and_day.crossJoin(df_unique_songs)

df_playlist_statistics = df_cross_radio_day_songs.join(df_playlist, \
                                                       (df_playlist["radio_name"] == df_cross_radio_day_songs["radio_name"]) \
                                                       & (df_playlist["date"] == df_cross_radio_day_songs["date"]) \
                                                       & (df_playlist["artist"] == df_cross_radio_day_songs["artist"]) \
                                                       & (df_playlist["title"] == df_cross_radio_day_songs["title"]), \
                                                       how="left" \
                                                  ) \
                                                  .withColumn("artist_and_title", concat_ws(" - ", df_cross_radio_day_songs["artist"], df_cross_radio_day_songs["title"])) \
                                                  .withColumn("month_name", date_format(df_cross_radio_day_songs["date"], "MMMM")) \
                                                  .withColumn("played", coalesce(df_playlist["played"], lit(0))) \
                                                  .select( \
                                                      df_cross_radio_day_songs["radio_name"], \
                                                      "month_name",
                                                      "artist_and_title",
                                                      "played" \
                                                  )


#Check number of rows in the new dataset with all statistics

In [0]:
df_playlist_statistics.count()

#Display 5 sample rows to check the new dataframe

In [0]:
display(
  df_playlist_statistics.head(5)
)

radio_name,month_name,artist_and_title,played
RMFFM,January,#razemrobimydobro - Razem,0
RMFFM,January,2+1 - Chodź Pomaluj Mój Świat,0
RMFFM,January,Abba - Mamma Mia,1
RMFFM,January,Abc - The Look Of Love,0
RMFFM,January,Ac/dc - Thunderstruck,0


#Export new dataframe to the - analytics - layer

In [0]:
output_file = "abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/analytics/playlist_statistics.parquet"
df_playlist_statistics.write.mode('overwrite') \
                 .parquet(output_file)

#Finish notebook execution and report success

**Things to be noticed:**   
* **dbutils.notebook.exit** - that magic command can be used to return a value from the notebook, can be very useful when there is a job that executes many notebooks or one notebook is executing another notebook

In [0]:
dbutils.notebook.exit("success")

success