# Objective
This lab will demonstrate to participants how to create Apache Spark core constructs. The tutorial covers some of the essential transformations and actions in PySpark Core. Since it runs on the Google cloud, we don’t need to install anything in our system locally.


##Tutorials
We need to install all the dependencies in Google Colab environment such as Apache Spark, Hadoop, and PySpark.

In [2]:
# install pyspark using pip
!pip install --ignore-install -q pyspark
# install findspark using pip
!pip install --ignore-install -q findspark


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.5/200.5 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone


In [3]:
from pyspark.sql import SparkSession
import collections
spark = SparkSession.builder.master("local").appName("Colab Demo for Actions and Transofrmations").config('spark.ui.port', '4050').getOrCreate()


## Transformations

Transformations are operations on RDDs or DataSet or DataFrame that produce another similar structure.

### `map`

Applies a function to each element of the RDD. PySpark's map function is used to apply a function to each element of an RDD (Resilient Distributed Dataset). Here are some examples to illustrate how to use the map function in PySpark:

Basic Usage:

In [None]:
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Multiply each element by 2
multiplied = rdd.map(lambda x: x * 2)
print(multiplied.collect())  # [2, 4, 6, 8, 10]


[2, 4, 6, 8, 10]


Working with Tuples:

In [None]:
data_tuples = [(1, 'a'), (2, 'b'), (3, 'c')]
rdd_tuples = spark.sparkContext.parallelize(data_tuples)

# Add 1 to the number in each tuple
added = rdd_tuples.map(lambda x: (x[0] + 1, x[1]))
print(added.collect())  # [(2, 'a'), (3, 'b'), (4, 'c')]


[(2, 'a'), (3, 'b'), (4, 'c')]


String Manipulation:

In [None]:
data_strings = ["Hello", "PySpark", "World"]
rdd_strings = spark.sparkContext.parallelize(data_strings)

# Convert each string to uppercase
uppercase = rdd_strings.map(lambda x: x.upper())
print(uppercase.collect())  # ['HELLO', 'PYSPARK', 'WORLD']


['HELLO', 'PYSPARK', 'WORLD']


Using Multiple Operations:

In [None]:
numbers = [1, 2, 3, 4, 5]
rdd_numbers = spark.sparkContext.parallelize(numbers)

# Perform a series of operations: square, then add 2
result = rdd_numbers.map(lambda x: x**2).map(lambda y: y + 2)
print(result.collect())  # [3, 6, 11, 18, 27]


[3, 6, 11, 18, 27]


Using Custom Function

In [None]:
def square_and_add_one(n):
    return n**2 + 1

numbers = [1, 2, 3, 4, 5]
rdd_numbers = spark.sparkContext.parallelize(numbers)

result = rdd_numbers.map(square_and_add_one)
print(result.collect())  # [2, 5, 10, 17, 26]


[2, 5, 10, 17, 26]


Lambda Inline Functions

In [None]:
x = [1, 2, 3, 4, 5]
squared_rdd = rdd.map(lambda x: x*x)
print(squared_rdd.collect()) # [1, 4, 9, 16, 25]

[1, 4, 9, 16, 25]


### flatMap
The `flatMap` transformation in PySpark returns multiple values for each element in the source RDD. The output is flattened so that you get a single RDD rather than an RDD of lists or arrays.  Here are some examples using `flatMap` in PySpark:
Basic Text Processing:
Let's say you have an RDD of sentences and you want to split them into words:


In [4]:
sentences = ["Hello world", "I am learning PySpark"]
rdd_sentences = spark.sparkContext.parallelize(sentences)
words = rdd_sentences.flatMap(lambda x: x.split(" "))
print(words.collect())  # ['Hello', 'world', 'I', 'am', 'learning', 'PySpark']


['Hello', 'world', 'I', 'am', 'learning', 'PySpark']
