<a href="https://colab.research.google.com/github/trashpanda-ai/Cloud-Computing-and-Big-Data-Applications/blob/main/PySpark_RDD_Operations.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a ><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/Logo_INSA_Lyon_%282014%29.svg/langfr-2560px-Logo_INSA_Lyon_%282014%29.svg.png"  width="200" align="left"> </a>
<div style="text-align: right"> <h3><span style="color:gray"> Cloud Computing and Big Data Applications </span> </h3> </div>

<br>
<br>
<br>


<h1><center>PySpark RDD Operations</center></h1>
<h2><center> <span style="font-weight:normal"><font color='#e42618'> Basic Operations as building Blocks for larger Machine Learning Applications</font>  </span></center></h2>


<h3><center><font color='gray'>KEVIN KANAAN and JONAS GOTTAL</font></center></h3>





### Installation and PySpark setup

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark

In [47]:
# Set the PySpark environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"


import findspark
findspark.init()

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder.appName("RDD-Ops").getOrCreate()
spark.conf.set("spark.sql.repl.eagerEval.enabled", True) # format output tables


### Let's create RDDs

In [48]:
list_of_numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = spark.sparkContext.parallelize(list_of_numbers)

In [49]:
# Collect action: Retrieve all elements of the RDD
rdd.collect()

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

In [51]:
# Create an RDD from a list of our (my) semester
data = [("Foundation of data engineering", 6), ("TCS", 2), ("TCS", 2), ("Cloud Computing for Distributed Big Data Applications", 6), ("Projet de Recherche", 10)]
rdd = spark.sparkContext.parallelize(data)

In [53]:
# Collect action: Retrieve all courses of the RDD
print("All courses and credits of the rdd: ", rdd.collect())

All courses and credits of the rdd:  [('Foundation of data engineering', 6), ('TCS', 2), ('TCS', 2), ('Cloud Computing for Distributed Big Data Applications', 6), ('Projet de Recherche', 10)]


### The RDD Operations: Actions

In [54]:
# Count action: Count the number of elements in the RDD
count = rdd.count()
print("The total number of elements in rdd: ", count)

The total number of elements in rdd:  5


In [55]:
# First action: Retrieve the first element of the RDD
first_element = rdd.first()
print("The first element of the rdd: ", first_element)

The first element of the rdd:  ('Foundation of data engineering', 6)


In [56]:
# Take action: Retrieve the n elements of the RDD
taken_elements = rdd.take(3)
print("The first two elements of the rdd: ", taken_elements)

The first two elements of the rdd:  [('Foundation of data engineering', 6), ('TCS', 2), ('TCS', 2)]


### The RDD Operations: Transformations

In [57]:
# Map transformation: Convert name to lowercase
mapped_rdd = rdd.map(lambda x: (x[0].lower(), x[1]))

In [58]:
result = mapped_rdd.collect()
print("rdd with uppercease name: ", result)

rdd with uppercease name:  [('foundation of data engineering', 6), ('tcs', 2), ('tcs', 2), ('cloud computing for distributed big data applications', 6), ('projet de recherche', 10)]


In [59]:
# Filter transformation: Filter records where the ECTS is greater than 2
filtered_rdd = rdd.filter(lambda x: x[1] > 2)
filtered_rdd.collect()

[('Foundation of data engineering', 6),
 ('Cloud Computing for Distributed Big Data Applications', 6),
 ('Projet de Recherche', 10)]

In [60]:
# ReduceByKey transformation: Calculate the total ECTS for each name
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
reduced_rdd.collect()

[('Foundation of data engineering', 6),
 ('Projet de Recherche', 10),
 ('TCS', 4),
 ('Cloud Computing for Distributed Big Data Applications', 6)]

In [61]:
# SortBy transformation: Sort the RDD by ECTS in descending order
sorted_rdd = rdd.sortBy(lambda x: x[1], ascending=False)
sorted_rdd.collect()

[('Projet de Recherche', 10),
 ('Foundation of data engineering', 6),
 ('Cloud Computing for Distributed Big Data Applications', 6),
 ('TCS', 2),
 ('TCS', 2)]

### Save RDDs to text file and read RDDs from text file

In [64]:
# Save action: Save the RDD to a text file
rdd.saveAsTextFile("output.txt")

In [65]:
# create rdd from text file
rdd_text = spark.sparkContext.textFile("output.txt")
rdd_text.collect()

["('Foundation of data engineering', 6)",
 "('TCS', 2)",
 "('TCS', 2)",
 "('Cloud Computing for Distributed Big Data Applications', 6)",
 "('Projet de Recherche', 10)"]

### Shut down Spark Session

In [None]:
spark.stop()