#Notebook Description
**Author**: Slawomir Drzymala

**Description:**   
This notebook is getting the data from the raw layer of the data lake and preparing a dataset that can be used in further analysis placing them in the curated layer

#Set up connection to data lake on Azure

**Things to be noticed:**   
* **sensitive data alert** - please note that this is not recommended to store any key or any other sensitve data in the notebooks, this is just to make the code more simple for the demo. For real work please use Azure KeyVault or databricks secrets.
* **multiple ways to connect to Azure data lake** - there are multiple options to connect to the Azure data lake, we can use the access key or the service principal, we can also mount the storage account so the storage account will be visible in many notebooks, please see link below for mode details

In [0]:
#vide https://docs.databricks.com/_static/notebooks/data-import/azure-data-lake-store.html
#vide https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/azure-datalake-gen2-get-started
spark.conf.set(
  "fs.azure.account.key.sdsalearnsthnew.dfs.core.windows.net", 
  "RJMELuc9ffZPf5D0gwcbxJp+hWTkQuW8lmWa1DRFSF59aDiatDsMJ6X/yC/dHZtB7kdGl3cJIrYry++6EnCb5g==" 
)

#Check connection to Azure data lake and list files in folder

**Things to be noticed:**   
* **display** - display is the magic Databricks function that can be used for visualization of many different objects including spark or pandas dataframes
* **dbutils.fs.ls** - display is the magic Databricks function that can be used to list the files in the local environment or the connected storage acccounts, here the Azure data lake

In [0]:
file_path = "abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/"

display(
  dbutils.fs.ls(file_path)
)

path,name,size
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2010.json,ZET_2010.json,12440846
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2011.json,ZET_2011.json,12472076
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2012.json,ZET_2012.json,11404607
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2013.json,ZET_2013.json,12488122
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2014.json,ZET_2014.json,11862115
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2015.json,ZET_2015.json,12890685
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2016.json,ZET_2016.json,13407331
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2017.json,ZET_2017.json,12803766
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2018.json,ZET_2018.json,13939163
abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/zet/ZET_2019.json,ZET_2019.json,14007212


#Read files from Azure data lake

**Things to be noticed:**   
* **data_schema** - Databricks will try to retrive the schema from the file itself, but we can also specify the custom data schema for the files that we want to load, if we specify the schema we will make sure that all of the data will be handled properly and also if we won't specify a particular column in a schema then even if the column will be present in the file it will be ignored during the data load
* **wildcards in the path** - please note that we can use a wildcards in any part of the path and that will allow us to load many files or folders according to the provided path template
* **.json()** - we are going to read the json files so we will use the json function, but for the other file types we would need to use a different functions
* **encoding** - encoding will make sure that all of the unicode special characters will be discovered correctly
* **spark dataframe vs pandas dataframe** - please note that the dataframe that will be created it's not the pandas data frame, but the spark data frame, please check the differences here: https://towardsdatascience.com/parallelize-pandas-dataframe-computations-w-spark-dataframe-bba4c924487c or https://medium.com/@chris_bour/6-differences-between-pandas-and-spark-dataframes-1380cec394d2 here.

In [0]:
from pyspark.sql.types import TimestampType, StringType

data_schema = [
                StructField('datetime', TimestampType(), True), 
                StructField('artist', StringType(), True),
                StructField('title', StringType(), True)
 ]
final_struc = StructType(fields=data_schema)

# read all files from all radio stations
file_path = "abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/raw-initial/*/*.json"
df_playlist = spark.read.option('encoding', 'UTF-8').json(file_path, multiLine=True, schema=final_struc)


#Check schema of the created dataframe

**Things to be noticed:**   
* **df.printSchema()** - is the function of the dataframe that can be used to display the schema - columns and types - of the dataframe

In [0]:
df_playlist.printSchema()

#Check number of rows

**Things to be noticed:**   
* **df.count()** - is the function of the dataframe that we can use to get the number of rows from the dataframe, please not that the Pandas len() - as an exmaple - won't work here

In [0]:
df_playlist.count()

#Display 5 sample rows

**Things to be noticed:**   
* **display** - display is the magic Databricks function that can be used for visualization of many different objects including spark or pandas dataframes

In [0]:
display(
  df_playlist.head(5)
)

datetime,artist,title
2019-12-31T00:00:00.000+0000,Peja/slums Attack,Szacunek Ludzi Ulicy (Explicit)
2019-12-31T00:05:00.000+0000,Tymek/tede,Rainman (Explicit)
2019-12-31T00:08:00.000+0000,Taconafide,Metallica 808 (Explicit)
2019-12-31T00:12:00.000+0000,Nautilus,Blat
2019-12-31T00:16:00.000+0000,Young Multi,Jeden Dzien (Explicit)


# Manipulate dataframe, add basic new columns
**New columns**
* **year** - year derived from the datetime time stamp of each row
* **radio_name** - the name of the radio that is derived from the path of the file

**Things to be noticed:**   
* **select** - similar to sql select, selects the columns for further use
* **input_file_name()** - is the special function that is returning the filepath of the file from the particular row is coming
* **withColumn** - function withColumn can be used to add a new column to existing dataframe
* **attribute names** - attributes can be used in many different ways, here we have a name of the dataframe and the column name like dataframe["columnname"] when the column exists in the original dataframe and only the "columnname" if the column has been derived. There are more options tough
* **replace dataframe** - please also note that the dataframe is assigned back to the same name and will be "overwriten"
* **list of avalaible functions** - list of all avaliable functions can be found here: https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html
* **lazy evaluation aka transformations vs actions** - in databricks, or spark to be precise all transformations are lazy evaluation transformation which means that they won't be actually executed unless we provide an action, thefore the below command will be "executed" in just a second, but the .Count or some other functions that are providing the actions are taking longer and only them are applying and executing the actual code; please also note that spark will also try to optimize all of your code thanks to the lazy evaluations

In [0]:
from  pyspark.sql.functions import input_file_name
from pyspark.sql.functions import lit, split, reverse, regexp_replace, count, concat_ws
from pyspark.sql.functions import year, date_format, hour, to_date

# all functions -> https://spark.apache.org/docs/latest/api/python/_modules/pyspark/sql/functions.html

df_playlist = df_playlist.select(df_playlist["artist"], df_playlist["datetime"], df_playlist["title"]) \
                         .withColumn("year", year(df_playlist["datetime"])) \
                         .withColumn("radio_name", split(regexp_replace(reverse(split(input_file_name(), "/"))[0], ".json", ""), "_")[0]) \
                         .select( \
                            "radio_name", \
                            "year", \
                            df_playlist["datetime"], \
                            df_playlist["artist"], df_playlist["title"] \
                          )

#Display 5 sample rows

In [0]:
display(
  df_playlist.head(5)
)

radio_name,year,datetime,artist,title
Eska,2019,2019-12-31T00:00:00.000+0000,Peja/slums Attack,Szacunek Ludzi Ulicy (Explicit)
Eska,2019,2019-12-31T00:05:00.000+0000,Tymek/tede,Rainman (Explicit)
Eska,2019,2019-12-31T00:08:00.000+0000,Taconafide,Metallica 808 (Explicit)
Eska,2019,2019-12-31T00:12:00.000+0000,Nautilus,Blat
Eska,2019,2019-12-31T00:16:00.000+0000,Young Multi,Jeden Dzien (Explicit)
Eska,2019,2019-12-31T00:19:00.000+0000,Kubi Producent/otsoch/schafter/planbe,9 Zyc (Explicit)
Eska,2019,2019-12-31T00:23:00.000+0000,Grubson,Naprawimy To (Explicit)
Eska,2019,2019-12-31T00:28:00.000+0000,Mata,Patointeligencja (Explicit)
Eska,2019,2019-12-31T00:32:00.000+0000,White 2115,Palac (Explicit)
Eska,2019,2019-12-31T00:34:00.000+0000,Keke,Wyjebane Tak Mocno (Explicit)


# Manipulate dataframe, add basic new columns

**New dataframe**
* **df_stats** - dataframe with statistics about number of rows for each radio

**New columns**
* **cnt** - number of rows for each group, here for each radio

**Things to be noticed:**   
* **lit(x)** - lit function can be used to specify the constant value for all columns, please note that providing the value itself without that function will raise an exception
* **goupby** - equvalent of sql groupby used to group the data by one or many columns
* **agg** - function that can be used in combination with other aggregation functions, please note that the count is a different function
* **alias** - with alias we can rename a column

In [0]:
df_stats = df_playlist.select("radio_name", "datetime", "title", "artist") \
               .groupBy("radio_name") \
               .agg(count(lit(1)).alias("cnt"))

#Visualizations in databricks notebook

**Things to be noticed:**   
* **display** - display is the magic Databricks function that can be used for visualization of many different objects including spark or pandas dataframes
* **visualizations** - those are the built in visualization tools inside Databricks notebook, we don't use additional python packages here, please note that you can change the layout of the chart using the creator below the chart or grid, please also note that by default the display() function will show the grid with the data, but we can change to the charts

In [0]:
display(
  df_stats
)

radio_name,cnt
Eska,1360400
Antyradio,991631
RMFFM,1180904
ZET,1081730


#Export new dataframe to the - curated - layer

**Things to be noticed:**   
* **write** - save the data into the destination, please note that later on we use the .parquet() function to specify the format of output file
* **partitionBy()** - partitionBy will partition the dataframe when saving in the data lake, that means that databricks will automatically save the dataframe to multiple directories acording to the columns specified in partitionBy function, here we will split the data into /radio_name=.../year=... folder structure, please also note that the columns from partitionBy function will be removed from the target file

In [0]:
output_directory = "abfss://learnsthnew@sdsalearnsthnew.dfs.core.windows.net/curated-initial/"
df_playlist.write.mode('overwrite') \
                 .partitionBy("radio_name", "year") \
                 .parquet(output_directory)

#Finish notebook execution and report success

**Things to be noticed:**   
* **dbutils.notebook.exit** - that magic command can be used to return a value from the notebook, can be very useful when there is a job that executes many notebooks or one notebook is executing another notebook

In [0]:
dbutils.notebook.exit("success")