# Map():
     Syntax: RDD.map(function, preservesPartitioning=False)
     
The map() transformation is the most basic of all transformations. It evaluates a named or anony-
mous function for each element within a dataset partition. One or many map() functions can
run asynchronously because they shouldn’t produce any side effects, maintain state, or attempt
to communicate or synchronize with other map() operations. That is, they are shared nothing
operations.

The preservesPartitioning argument is an optional Boolean argument intended for use
with RDDs with a partitioner defined—typically a key/value pair RDD (as discussed later in this
chapter) in which a key is defined and grouped by a key hash or key range. If this parameter is set
to True , the partitions stay intact. This parameter can be used by the Spark scheduler to optimize
subsequent operations, such as joins based on the partitioned key.

Consider Figure 4.7, where the map() transformation evaluates a function for each input record
and emits a transformed output record. In this case, the split function takes a string and produces
a list, and each string element in the input data maps to a list element in the output. The result, in
this case, is a list of lists.



# flatMap() :
            Syntax: RDD.flatMap(<function>, preservesPartitioning=False)

The flatMap() transformation is similar to the map() transformation in that it runs a function
against each record in the input dataset. However, flatMap() “flattens” the output, meaning it
removes a level of nesting. For example, given a list containing lists of strings, flattening would
result in a single list of strings—“flattening” all of the nested lists. Figure 4.8 shows the effect of a
flatMap() transformation using the same anonymous ( lambda) function as the map() opera-
tion shown in Figure 4.7. Notice that instead of each string producing a respective list object,
all elements are flattened into one list. In other words, flatMap() , in this case, produces one
combined list as output, in contrast to the list of lists in the map() example.

# filter():
                Syntax: RDD.filter(<function>)

The filter transformation evaluates a Boolean expression, usually expressed as an anonymous
function, against each element in the dataset. The Boolean value returned determines whether
the record is included in the resultant output RDD. This is another common transformation used
to remove from RDD records that are not required for intermediate processing and that are not
included in the final output.

Listing 4.13 shows an example of using the map() , flatMap() , and filter() transformations
together to convert input text to uppercase. It uses map() and flatMap() to split the text into
a combined list of words and then uses filter() to filter the list to return only words that are
greater than four characters long.

In [1]:
#Importing dependencies
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

In [2]:
sc = SparkContext("local")
spark = SparkSession(sc)

In [3]:
licenses = sc.textFile('file:///opt/Spark/licenses')
#WINDOWS Users can uncomment the line below
# licenses = sc.textFile("file:///C:\\Spark\\licenses")
words = licenses.flatMap(lambda x:x.split(' '))
words.take(10)

['Copyright',
 '©',
 '2015',
 'The',
 'University',
 'of',
 'Tennessee.',
 'All',
 'rights',
 'reserved.']

In [9]:
lowercase = words.map(lambda x: x.lower())
 

In [12]:
lowercase.take(15)

['copyright',
 '©',
 '2015',
 'the',
 'university',
 'of',
 'tennessee.',
 'all',
 'rights',
 'reserved.',
 '',
 'redistribution',
 'and',
 'use',
 'in']

In [15]:
long = lowercase.filter(lambda x : len(x) > 12)
long.take(10)

['redistribution',
 'modification,',
 'redistributions',
 'redistributions',
 'documentation',
 'distribution.',
 'merchantability',
 'consequential',
 'interruption)',
 'documentation']

There is a standard axiom in the world of Big Data programming: “Filter early, filter often.” This
refers to the fact that there is no value in carrying records or fields through a process where they
are not needed. Both the filter() and map() functions can be used to achieve this objective.
That said, in many cases Spark—through its key runtime characteristic of lazy execution—
attempts to optimize routines for you even if you do not explicitly do this yourself.