# Maps

In Spark, maps take data as input and then transform that data
with whatever function you put in the map. They are like directions for the data
telling how each input should get to the output


The first code cell creates a SparkContext. With the SparkContext,
you can input dataset and parallelize the data across a cluster
(since you are curently using Spark in local mode on a single machine,
technically the dataset isn't distributed yet)

Run the code cell below to instantiate a SparkContext object
and then read in the log_of_songs list into Spark

In [7]:
# !pip install findspark
import findspark
findspark.init('D:/Spark/spark-3.2.1-bin-hadoop3.2')
import pyspark
findspark.find()

'D:/Spark/spark-3.2.1-bin-hadoop3.2'

In [10]:
sc = pyspark.SparkContext.getOrCreate()

log_of_songs=[
    "Despacito",
    "Nice for what",
    "No tears left to cry",
    "Despacito",
    "Havana",
    "In my feelings",
    "Nice for what",
    "despacito",
    "All the stars"
]

# parallelize the log_of_songs to use with Spark
distributed_song_log = sc.parallelize(log_of_songs)


This next code cell defines a function that converts a song title to lowercase
. Then there is an example converting the word "Havana" to "havana"

In [12]:
def convert_song_to_lowercase(song):
    return song.lower()

convert_song_to_lowercase("Havana")

'havana'

The following code celss demonstrate how to apply this function using map step.
The map step will go through each song in the list and apply the conver_to_song_to_lowercase()
function

In [14]:
distributed_song_log.map(convert_song_to_lowercase).collect()

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']

Note as well that Spark is not changing the original data set. Spark is merely making a copy
. You can see this by running collect() on the orginal dataset

In [16]:
distributed_song_log.collect()

['Despacito',
 'Nice for what',
 'No tears left to cry',
 'Despacito',
 'Havana',
 'In my feelings',
 'Nice for what',
 'despacito',
 'All the stars']

In [17]:
distributed_song_log.map(lambda song: song.lower()).collect()

['despacito',
 'nice for what',
 'no tears left to cry',
 'despacito',
 'havana',
 'in my feelings',
 'nice for what',
 'despacito',
 'all the stars']