Difference between `map` and `foreach`
- https://stackoverflow.com/questions/41388597/difference-between-rdd-foreach-and-rdd-map

*Map is a transformation*

thus when you perform a map you apply a function to each element in the RDD and return a new RDD where additional transformations or actions can be called.

*Foreach is an action*

it takes each element and applies a function, but it does not return a value. This is particularly useful in you have to call perform some calculation on an RDD and log the result somewhere else, for example a database or call a REST API with each element in the RDD.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import *

In [4]:
spark = SparkSession.builder \
                    .appName('SparkByExamples.com') \
                    .getOrCreate()

In [68]:
data = [
    ('James','Smith','M',30),
    ('Anna','Rose','F',41),
    ('Robert','Williams','M',62), 
]

In [69]:
columns = ["firstname","lastname","gender","salary"]

In [70]:
df = spark.createDataFrame(data=data, schema=columns)

In [71]:
df.show()

+---------+--------+------+------+
|firstname|lastname|gender|salary|
+---------+--------+------+------+
|    James|   Smith|     M|    30|
|     Anna|    Rose|     F|    41|
|   Robert|Williams|     M|    62|
+---------+--------+------+------+



In [13]:
(
    df.withColumn("fullname", F.concat_ws(",",F.col("firstname"),F.col("lastname")))
      .withColumn("new_salary", F.lit(2*F.col("salary")))
    .select("*")
    .show()
)

+---------+--------+------+------+---------------+----------+
|firstname|lastname|gender|salary|       fullname|new_salary|
+---------+--------+------+------+---------------+----------+
|    James|   Smith|     M|    30|    James,Smith|        60|
|     Anna|    Rose|     F|    41|      Anna,Rose|        82|
|   Robert|Williams|     M|    62|Robert,Williams|       124|
+---------+--------+------+------+---------------+----------+



In [14]:
rdd=df.rdd.map(lambda x: 
    (x[0]+","+x[1],x[2],x[3]*2)
    )  

In [17]:
df1 = rdd.toDF(["name","gender","new_salary"])
df1.show()

+---------------+------+----------+
|           name|gender|new_salary|
+---------------+------+----------+
|    James,Smith|     M|        60|
|      Anna,Rose|     F|        82|
|Robert,Williams|     M|       124|
+---------------+------+----------+



In [19]:
rdd2=df.rdd.map(lambda x: 
    (x["firstname"]+", "+x["lastname"],x["gender"],x["salary"]*2)
    )
df2 = rdd2.toDF(["name","gender","new_salary"])
df2.show()

+----------------+------+----------+
|            name|gender|new_salary|
+----------------+------+----------+
|    James, Smith|     M|        60|
|      Anna, Rose|     F|        82|
|Robert, Williams|     M|       124|
+----------------+------+----------+



In [20]:
def f(x):
    firstName=x.firstname
    lastName=x.lastName
    name=firstName+", "+lastName
    gender=x.gender.lower()
    salary=x.salary*2
    return (name,gender,salary)

In [72]:
df.rdd.map(f).collect()

[('James', 'Smith', 'M', 30, 'James', 'Smith', 'M', 30),
 ('Anna', 'Rose', 'F', 41, 'Anna', 'Rose', 'F', 41),
 ('Robert', 'Williams', 'M', 62, 'Robert', 'Williams', 'M', 62)]

In [38]:
import pandas as pd
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
    print(row['firstname'], row['gender'])

James M
Anna F
Robert M


  PyArrow >= 0.15.1 must be installed; however, it was not found.
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.


In [80]:
def f(x): 
    print(x)
    with open(f"/tmp/foreach-{x}.txt", "w") as f:
        f.write(f"x={x}:  {2*x}")
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
rdd.foreach(f)
for x in rdd.toLocalIterator():
    print(x)
    fn = f"/tmp/foreach-{x}.txt"
    print(f"file {fn}")
    with open(fn) as f:
        print(f.read())
    

1
file /tmp/foreach-1.txt
x=1:  2
2
file /tmp/foreach-2.txt
x=2:  4
3
file /tmp/foreach-3.txt
x=3:  6
4
file /tmp/foreach-4.txt
x=4:  8
5
file /tmp/foreach-5.txt
x=5:  10
