# groupBy():
                Syntax: RDD.groupBy(<function>, numPartitions=None)

The groupBy() transformation returns an RDD of items grouped by a specified function. The
<function> argument is an anonymous or named function used to nominate a key by which to
group all elements or to specify an expression to evaluate against elements to determine a group,
such as when grouping elements by odd or even numbers of a numeric field in the data.
    
    
You can use the numPartitions argument to create a specified number of partitions automatically by computing hashes of the key space from the output of the grouping function. For instance, if you want to group an RDD by the days in a week and process each day separately,
specify numPartitions=7 . You will see numPartitions specified in numerous Spark transformations, where its behavior is analogous.
    
    
Listing 4.15 demonstrates the use of the groupBy() function. Notice that groupBy() returns an
iterable object; we will look at how to handle this type of object later in this chapter.

In [3]:
from pyspark.context import SparkContext
from pyspark.sql.session  import SparkSession

In [4]:
sc = SparkContext('local')
spark = SparkSession(sc)

In [6]:
licenses = sc.textFile('file:///opt/Spark/licenses')
words = licenses.flatMap(lambda x: x.split(' ')).filter(lambda x: len(x) > 0)
groupedbyfirstletter = words.groupBy(lambda x: x[0].lower())
groupedbyfirstletter.take(1)

[('s', <pyspark.resultiterable.ResultIterable at 0x7fbfa920c9d0>)]

# Consider Other Functions for Grouping Data
If your ultimate intention in using groupBy() is to aggregate values, such as when perform-
ing a sum() or count() operation, you should opt for more efficient operators for this pur-
pose in Spark, including aggregateByKey() and reduceByKey() , which we will discuss
shortly. The groupBy() transformation does not perform any aggregation prior to shuffling
data, resulting in more data being shuffled. Furthermore, groupBy() requires that all values
for a given key fit into memory. The groupBy() transformation is useful in some cases, but
you should consider these factors before deciding to use this function.