### PySpark Groupby Explained with Example

* **Syntax** :`Dataframe.groupby(*cols)`

When we perform groupby() on pyspark Dataframe, it returns GroupedData object which contains below aggregate functions.

count(), mean(), max(). min(), sum(), avg(), agg()

In [0]:
import pyspark
from pyspark.sql import SparkSession

sc = SparkSession.builder \
     .master('local[*]') \
     .appName('SParkByExample') \
     .getOrCreate()
     


In [0]:

simpleData = [("James","Sales","NY",90000,34,10000),
    ("Michael","Sales","NY",86000,56,20000),
    ("Robert","Sales","CA",81000,30,23000),
    ("Maria","Finance","CA",90000,24,23000),
    ("Raman","Finance","CA",99000,40,24000),
    ("Scott","Finance","NY",83000,36,19000),
    ("Jen","Finance","NY",79000,53,15000),
    ("Jeff","Marketing","CA",80000,25,18000),
    ("Kumar","Marketing","NY",91000,50,21000)
  ]

schema = ["employee_name","department","state","salary","age","bonus"]
df = spark.createDataFrame(data=simpleData, schema = schema)
df.printSchema()
df.show(truncate=False)


root
 |-- employee_name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- state: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- age: long (nullable = true)
 |-- bonus: long (nullable = true)

+-------------+----------+-----+------+---+-----+
|employee_name|department|state|salary|age|bonus|
+-------------+----------+-----+------+---+-----+
|James        |Sales     |NY   |90000 |34 |10000|
|Michael      |Sales     |NY   |86000 |56 |20000|
|Robert       |Sales     |CA   |81000 |30 |23000|
|Maria        |Finance   |CA   |90000 |24 |23000|
|Raman        |Finance   |CA   |99000 |40 |24000|
|Scott        |Finance   |NY   |83000 |36 |19000|
|Jen          |Finance   |NY   |79000 |53 |15000|
|Jeff         |Marketing |CA   |80000 |25 |18000|
|Kumar        |Marketing |NY   |91000 |50 |21000|
+-------------+----------+-----+------+---+-----+



Let’s do the groupBy() on department column of DataFrame and then find the sum of salary for each department using sum() function.

In [0]:
df.groupBy('department').sum('salary').show()

+----------+-----------+
|department|sum(salary)|
+----------+-----------+
|     Sales|     257000|
|   Finance|     351000|
| Marketing|     171000|
+----------+-----------+



In [0]:
df.groupBy('department').count().show()
df.groupBy('department').min('salary').show()
df.groupBy('department').max('salary').show()
df.groupBy('department').avg('salary').show()
df.groupBy('department').mean('salary').show()

+----------+-----+
|department|count|
+----------+-----+
|     Sales|    3|
|   Finance|    4|
| Marketing|    2|
+----------+-----+

+----------+-----------+
|department|min(salary)|
+----------+-----------+
|     Sales|      81000|
|   Finance|      79000|
| Marketing|      80000|
+----------+-----------+

+----------+-----------+
|department|max(salary)|
+----------+-----------+
|     Sales|      90000|
|   Finance|      99000|
| Marketing|      91000|
+----------+-----------+

+----------+-----------------+
|department|      avg(salary)|
+----------+-----------------+
|     Sales|85666.66666666667|
|   Finance|          87750.0|
| Marketing|          85500.0|
+----------+-----------------+

+----------+-----------------+
|department|      avg(salary)|
+----------+-----------------+
|     Sales|85666.66666666667|
|   Finance|          87750.0|
| Marketing|          85500.0|
+----------+-----------------+



#### using multiple columns

In [0]:
from pyspark.sql.functions import col

In [0]:
df.groupBy('department', 'state') \
    .sum('salary', 'bonus') \
    .orderBy(col('state').desc()) \
    .show(truncate=False) 

+----------+-----+-----------+----------+
|department|state|sum(salary)|sum(bonus)|
+----------+-----+-----------+----------+
|Sales     |NY   |176000     |30000     |
|Finance   |NY   |162000     |34000     |
|Marketing |NY   |91000      |21000     |
|Sales     |CA   |81000      |23000     |
|Finance   |CA   |189000     |47000     |
|Marketing |CA   |80000      |18000     |
+----------+-----+-----------+----------+



#### Running more aggregates at a time

In [0]:
from pyspark.sql.functions import sum,avg,max

df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
         avg("salary").alias("avg_salary"), \
         sum("bonus").alias("sum_bonus"), \
         max("bonus").alias("max_bonus") \
    ) \
    .show(truncate=False)


+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
|Marketing |171000    |85500.0          |39000    |21000    |
+----------+----------+-----------------+---------+---------+



#### Using filter on aggregate data

In [0]:
from pyspark.sql.functions import sum,avg,max
df.groupBy("department") \
    .agg(sum("salary").alias("sum_salary"), \
      avg("salary").alias("avg_salary"), \
      sum("bonus").alias("sum_bonus"), \
      max("bonus").alias("max_bonus")) \
    .where(col("sum_bonus") >= 50000) \
    .show(truncate=False)


+----------+----------+-----------------+---------+---------+
|department|sum_salary|avg_salary       |sum_bonus|max_bonus|
+----------+----------+-----------------+---------+---------+
|Sales     |257000    |85666.66666666667|53000    |23000    |
|Finance   |351000    |87750.0          |81000    |24000    |
+----------+----------+-----------------+---------+---------+



### PySpark Aggregate Functions with Examples

* **approx_count_distinct** : In PySpark approx_count_distinct() function returns the count of distinct items in a group.

* **avg (average)** : avg() function returns the average of values in the input column.

* **collect_list** : it returns all values from an input column with duplicates

* **collect_set** : It returns all values from an input column with duplicate values eliminated

* **countDistinct** : It returns number of disticnt elements in a columns

* **count** : It returns number of elements in a column

* **first** : It returns the first element in a column when ignoreNulls is set to true

* **last** : It returns the last element in a column when ignoreNulls is set to true

* **max** : It returns max value in column

* **min** : It returns min value in column

* **mean** : It returns the avg values in a column. This is alias for avg()

* **sum** : It returns the sum of all values in a column

* **sumDistinct** : It retuns the sum of all distinct values in a column

In [0]:
from pyspark.sql.functions import approx_count_distinct, collect_list, collect_set, countDistinct, count, first, last, mean, max, min, sum, sumDistinct
simpleData = [("James", "Sales", 3000),
    ("Michael", "Sales", 4600),
    ("Robert", "Sales", 4100),
    ("Maria", "Finance", 3000),
    ("James", "Sales", 3000),
    ("Scott", "Finance", 3300),
    ("Jen", "Finance", 3900),
    ("Jeff", "Marketing", 3000),
    ("Kumar", "Marketing", 2000),
    ("Saif", "Sales", 4100)
  ]
schema = ["employee_name", "department", "salary"]
df = spark.createDataFrame(data=simpleData, schema = schema)
# df.printSchema()
# df.show(truncate=False)

print("approx_count_distinct:", df.select(approx_count_distinct('salary')).collect())

print('avg:', df.select(avg('salary')).collect()[0][0])

print('collect_list:', df.select(collect_list('salary')).show(truncate=False))

print('collect_set:', df.select(collect_set('salary')).show(truncate=False))

print('countDistinct:', df.select(countDistinct('salary')).collect()[0][0])

print('count:', df.select(count('salary')).collect()[0])

print('first :', df.select(first('salary')).show(truncate=False))

print('last :', df.select(last('salary')).show(truncate=False))

print('max: ', df.select(max('salary')).show(truncate=False))

print('min: ', df.select(min('salary')).show(truncate=False))

print('mean', df.select(mean('salary')).show(truncate=False))

print('sum :', df.select(sum('salary')).show(truncate=False))

print('sumDistintct:', df.select(sumDistinct('salary')).show(truncate=False))



approx_count_distinct: [Row(approx_count_distinct(salary)=6)]
avg: 3400.0
+------------------------------------------------------------+
|collect_list(salary)                                        |
+------------------------------------------------------------+
|[3000, 4600, 4100, 3000, 3000, 3300, 3900, 3000, 2000, 4100]|
+------------------------------------------------------------+

collect_list: None
+------------------------------------+
|collect_set(salary)                 |
+------------------------------------+
|[4600, 3000, 3900, 4100, 3300, 2000]|
+------------------------------------+

collect_set: None
countDistinct: 6
count: Row(count(salary)=10)
+-------------+
|first(salary)|
+-------------+
|3000         |
+-------------+

first : None
+------------+
|last(salary)|
+------------+
|4100        |
+------------+

last : None
+-----------+
|max(salary)|
+-----------+
|4600       |
+-----------+

max:  None
+-----------+
|min(salary)|
+-----------+
|2000       |
+----------



+--------------------+
|sum(DISTINCT salary)|
+--------------------+
|20900               |
+--------------------+

sumDistintct: None


### PySpark Join Types | Join Two DataFrames

In [0]:
emp = [(1,"Smith",-1,"2018","10","M",3000), \
    (2,"Rose",1,"2010","20","M",4000), \
    (3,"Williams",1,"2010","10","M",1000), \
    (4,"Jones",2,"2005","10","F",2000), \
    (5,"Brown",2,"2010","40","",-1), \
      (6,"Brown",2,"2010","50","",-1) \
  ]
empColumns = ["emp_id","name","superior_emp_id","year_joined", \
       "emp_dept_id","gender","salary"]

empDF = spark.createDataFrame(data=emp, schema=empColumns)
empDF.printSchema()
empDF.show(truncate=False)

dept = [("Finance",10), \
    ("Marketing",20), \
    ("Sales",30), \
    ("IT",40) \
  ]
deptColumns = ["dept_name","dept_id"]

deptDF = spark.createDataFrame(data=dept, schema = deptColumns)
deptDF.printSchema()
deptDF.show(truncate=False)

root
 |-- emp_id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- superior_emp_id: long (nullable = true)
 |-- year_joined: string (nullable = true)
 |-- emp_dept_id: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- salary: long (nullable = true)

+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
|6     |Brown   |2              |2010       |50         |      |-1    |
+------+--------+---------------+-----------+-----------+------+-----

#### PySpark Inner Join DataFrame

In [0]:
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, how="inner").orderBy(col('emp_dept_id').asc()).show()

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|    name|superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|     1|   Smith|             -1|       2018|         10|     M|  3000|  Finance|     10|
|     3|Williams|              1|       2010|         10|     M|  1000|  Finance|     10|
|     4|   Jones|              2|       2005|         10|     F|  2000|  Finance|     10|
|     2|    Rose|              1|       2010|         20|     M|  4000|Marketing|     20|
|     5|   Brown|              2|       2010|         40|      |    -1|       IT|     40|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+



#### PySpark Full Outer Join

In [0]:
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, 'outer' ).orderBy(col('emp_id').asc()).show(truncate=False)
#below two commands are provide same output as above 

empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"full").orderBy(col('emp_id').asc()).show(truncate=False)
   
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"fullouter").orderBy(col('emp_id').asc()).show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|null  |null    |null           |null       |null       |null  |null  |Sales    |30     |
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

+------+-

#### PySpark Left and right Outer Join

In [0]:
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, "left").show(truncate=False)
empDF.join(deptDF,empDF.emp_dept_id ==  deptDF.dept_id,"leftouter").show(truncate=False)

empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, 'right').show(truncate=False)
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, 'rightouter').show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
|6     |Brown   |2              |2010       |50         |      |-1    |null     |null   |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|n

#### Left Semi Join

returns rows from the left table where there is a match in the right table based on the join condition **but doesn't include columns from the right table**. This is similar to inner join. But in inner join returns rows from both tables where there is a match based on the join condition, **including columns from both tables**.

In [0]:
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, "leftsemi").show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+--------+---------------+-----------+-----------+------+------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |
|3     |Williams|1              |2010       |10         |M     |1000  |
|4     |Jones   |2              |2005       |10         |F     |2000  |
|2     |Rose    |1              |2010       |20         |M     |4000  |
|5     |Brown   |2              |2010       |40         |      |-1    |
+------+--------+---------------+-----------+-----------+------+------+



#### Left Anti Join

In [0]:
empDF.join(deptDF, empDF.emp_dept_id == deptDF.dept_id, "leftanti").show(truncate=False)

+------+-----+---------------+-----------+-----------+------+------+
|emp_id|name |superior_emp_id|year_joined|emp_dept_id|gender|salary|
+------+-----+---------------+-----------+-----------+------+------+
|6     |Brown|2              |2010       |50         |      |-1    |
+------+-----+---------------+-----------+-----------+------+------+



#### PySpark Self Join
Though there is no self-join type available, we can use any of the above-explained join types to join DataFrame to itself. below example use inner self join.

In [0]:
empDF.alias('emp1').join(empDF.alias('emp2'), \
      col('emp1.superior_emp_id') == col('emp2.emp_id'), 'inner') \
      .select(col('emp1.emp_id'), col('emp1.name'), col('emp2.emp_id').alias('superior_emp_id'), \
          col('emp2.name').alias('superior_emp_name')) \
       .show(truncate=False)

+------+--------+---------------+-----------------+
|emp_id|name    |superior_emp_id|superior_emp_name|
+------+--------+---------------+-----------------+
|2     |Rose    |1              |Smith            |
|3     |Williams|1              |Smith            |
|4     |Jones   |2              |Rose             |
|5     |Brown   |2              |Rose             |
|6     |Brown   |2              |Rose             |
+------+--------+---------------+-----------------+



#### Using SQL Expression

In [0]:
empDF.createOrReplaceTempView('EMP')
deptDF.createOrReplaceTempView('DEPT')

spark.sql('select * from EMP e, dept d where e.emp_dept_id == d.dept_id').show(truncate=False)
spark.sql('select * from EMP e inner join dept d on e.emp_dept_id == d.dept_id').show(truncate=False)

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|1     |Smith   |-1             |2018       |10         |M     |3000  |Finance  |10     |
|3     |Williams|1              |2010       |10         |M     |1000  |Finance  |10     |
|4     |Jones   |2              |2005       |10         |F     |2000  |Finance  |10     |
|2     |Rose    |1              |2010       |20         |M     |4000  |Marketing|20     |
|5     |Brown   |2              |2010       |40         |      |-1    |IT       |40     |
+------+--------+---------------+-----------+-----------+------+------+---------+-------+

+------+--------+---------------+-----------+-----------+------+------+---------+-------+
|emp_id|name    |superior_emp_id|year_joined|emp_dept_id|gender|salary|dept_name|dept_id|
+------+-

#### PySpark SQL Join on multiple DataFrames

In [0]:
# df1.join(df2, df1.id1 == df2.id2, 'inner') \
#     .join(df3, df1.id1 == df2.id3, 'inner')

### PySpark RDD Tutorial | Learn with Examples

In [0]:

#Create RDD from parallelize    
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = spark.sparkContext.parallelize(data)

#Create RDD from external Data source
# rdd2 = spark.sparkContext.textFile("/path/textFile.txt")

# Create RDD using sparkContext.wholeTextFiles()
# wholeTextFiles() function returns a PairRDD with the key being the file path and value being file content.
#Reads entire file into a RDD as single record.
rdd3 = spark.sparkContext.wholeTextFiles("/path/textFile.txt")




### Create empty RDD using sparkContext.emptyRDD

In [0]:

# Creates empty RDD with no partition    
rdd = spark.sparkContext.emptyRDD 
# rddString = spark.sparkContext.emptyRDD[String]


In [0]:

#Create empty RDD with partition
rdd2 = spark.sparkContext.parallelize([],10) #This creates 10 partitions


* **RDD Parallelize** : When we use parelleize() or textfile() or wholeTextFiles() methods of SparkContext to initiate RDD, it automatically splits the data into partitions based on resource availability. When you run it on laptop it would create partitions as the same number of cores available on your system. 

* **getNumPartitions**: This RDD function which returns a number of partitions our dataset splits into. 
syntax is 

`print("initial partition count:"+str(rdd.getNumPartitions()))
#Outputs: initial partition count:2
`

### Repartition and Coalesce

PySpark provides two ways to repartition; 
  1. repartition(): Which shuffles data from all nodes also called full shuffle. It is a very **expansive operation** as it shuffles data from all nodes in a cluster
  2. coalesce() : Which shuffles data from minimum nodes . ex: if you have data in 4 partitions and doing coalesce(2) moves data from just 2 nodes

In [0]:
data = [1,2,3,4,5,6,7,8,9,10,11,12]
rdd = spark.sparkContext.parallelize(data)
reparRdd = rdd.repartition(4)
print("re-partition count:"+str(reparRdd.getNumPartitions()))
#Outputs: "re-partition count:4


re-partition count:4


### RDD Transformations with example

In [0]:
rdd = spark.sparkContext.textFile('/FileStore/tables/test.txt')


* **flatmap()** : it is a transformation operation in spark that is used to transform each element of a collection into zero or more elements and then flatten the results into single output collection.

In [0]:
rdd2 = rdd.flatMap(lambda x: x.split(" "))
# rdd2.collect()

* **map()** : is a transformation operation that is used to apply a function to each element of an RDD or DataFrame. It produces a new RDD or DataFrame where each element is the result of applying the specified function to the correspanding element of the original RDD or DataFrame.

In [0]:
rdd3 = rdd2.map(lambda x: (x,1))
# rdd3.collect()

* **reduceByKey()**: Is a transformation operation that is typically used on a **pair of RDD** to perform a reduction operation on elements with the same key. It groups elements by their keys and applies a specific function to reduce the value associated with each key.

In [0]:
rdd4 = rdd3.reduceByKey(lambda a,b:a+b)
# rdd4.collect()

* **sortByKey()**: is a transformation operation that is used to sort the elements of a pair RDD by their keys in ascending or descending order. It arranges the key-value pairs in a specified order based on the keys

In [0]:
rdd5=rdd4.map(lambda x:(x[1],x[0])).sortByKey(ascending=True)
# rdd5.collect()

* **filter**: is a transformation is used to filter the records in an RDD

In [0]:
rdd4 = rdd5.filter(lambda x: 'an' in x[1])
rdd4.collect()

Out[118]: [(18, 'Wonderland'), (27, 'anyone'), (27, 'anywhere'), (27, 'and')]

### RDD Actions with example

* **count()** – Returns the number of records in an RDD

* **first()** - Returns the fist record

* **last()** - Returns the last record

* **max()** - Returns the max record

* **reduce()** - Reduces the records to single, we can use this count or sum

* **take()** - Returns the record specified as an argumenent

* **collect()** - Returns all data from RDD as an array. Be careful when you use this action when you are working with huge RDD with millions and billions of data as you may run out of memory on the driver

* **saveAsTextFile()** - we can write RDD to text file

In [0]:
print("count:", rdd5.count()) ## count
#######first###
rdd_f = spark.sparkContext.parallelize([1,2,3,4,5,6])
print("first_element", rdd_f.first())
print("max:" ,rdd_f.max())

######reduce()##########
def add(x,y):
    return x+y

print("reduce:",rdd_f.reduce(add))

#######take(n), return first n elements
print("take:", rdd_f.take(3))

#collect to retrieve all elements from the RDD
print("collect:", rdd_f.collect())


count: 23
first_element 1
max: 6
reduce: 21
take: [1, 2, 3]
collect: [1, 2, 3, 4, 5, 6]


In [0]:
# ?