<a href="https://colab.research.google.com/github/subho99/Computational-Data-Science/blob/main/SubhajitBasistha_M5_AST_04_PySpark_Transform_and_Actions_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Certification Program in Computational Data Science
## A program by IISc and TalentSprint
### Assignment 4: PySpark Transform and Actions

## Learning Objectives

At the end of the experiment, you will be able to

* Perform RDD (Resilient Distributed Datasets) operations including:
        
  1.   Transformations
  2.   Actions

* Obtain an overview of shuffle operations
* Implement RDD based model


## Information

**Overview about Spark, PySpark and Apache Spark in simple language**

**Spark:** A data computational framework that handles Big data.

**PySpark:** A tool to support Python with Spark

**Apache Spark:** It is an open-source cluster-computing framework, built around speed, ease of use, and streaming analytics.

* Like Spark, PySpark helps data scientists to work with (RDDs) Resilient Distributed Datasets. It is also used to work on Data frames. PySpark can be used to work with machine learning algorithms as well.

### ***Spark RDD is a major concept in Apache Spark***

**Resilient Distributed Datasets:**

**Resilient:**    because RDDs are immutable (can’t be modified once created)                        and fault tolerant.

**Distributed:**  because it is distributed across clusters

**Dataset:**      because it holds data.

**Why RDD?**

* Apache Spark lets you treat your input files almost like any other variable, which you cannot do in Hadoop MapReduce.
* RDDs are automatically distributed across the network by means of Partitions.

RDDs are divided into smaller chunks called Partitions, and when you execute some action, a task is launched per partition. This means, the more the number of partitions, the more will be the parallelism.

Spark automatically decides the number of partitions that an RDD has to be divided into, but you can also specify the number of partitions when creating an RDD. These partitions of an RDD are distributed across all the nodes in the network.

**Difference between Dataframe and RDD (Resilient Distributed Datasets):**

**Dataframe:**
* Automatically finds out the schema of the dataset.
* Performs aggregation faster than RDDs, as it provides an easy API to perform aggregation operations.

**RDD:**
* We need to define the schema manually.
* RDD is slower than Dataframes to perform simple operations like grouping the data.


**Creating an RDD**

**There are three ways to create an RDD in Spark:**
1. Parallelizing already existing collection in the driver program.

  The key point to note in a parallelized collection is the number of partitions the dataset is divided into. Spark will run one task for each partition of the cluster. We require two to four partitions for each CPU in the cluster. Spark sets the number of partition based on our cluster.

2. Referencing a dataset in an external storage system (e.g. HDFS, Hbase, shared file system).
  
  In Spark, the distributed dataset can be formed from any data source supported by Hadoop, including the local file system, HDFS, Cassandra, HBase etc. In this, the data is loaded from the external dataset.

  * csv (String path): It loads a CSV file and returns the result as a Dataset.

  * json (String path): It loads a JSON file (one object per line) and returns the result as a Dataset

  * textFile (String path) It loads text files and returns a Dataset of String.

3. Creating RDD from already existing RDDs.

  Transformation mutates one RDD into another RDD, this transformation is the way to create an RDD from an already existing RDD. This creates a difference between Apache Spark and Hadoop MapReduce.

**Actions/Transformations**

There are two types of operations that you can perform on an RDD-
* Transformations
* Actions.

**Transformation** applies some function on an RDD and creates a new RDD, it does not modify the RDD that you apply the function on. Also, the new RDD keeps a pointer to its parent RDD.

When you call a transformation, Spark does not execute it immediately, instead it creates a lineage. A lineage keeps track of what all transformations have to be applied on that RDD, including from where it has to read the data.


**Action** is used to either save the result to some location or to display it. You can also print the RDD lineage information by using the command:

"filtered.toDebugString" -> (*filtered* is the RDD here).

![img](https://cdn.iisc.talentsprint.com/CDS/Images/Pyspark_RDD.JPG)

### Setup Steps:

In [1]:
#@title Please enter your registration id to start: { run: "auto", display-mode: "form" }
Id = "2236624" #@param {type:"string"}

In [2]:
#@title Please enter your password (your registered phone number) to continue: { run: "auto", display-mode: "form" }
password = "8240187807" #@param {type:"string"}

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M5_AST_04_PySpark_Transform_and_Actions_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/CDS/Datasets/Spark_Text.txt")
    ipython.magic("sx wget https://cdn.iisc.talentsprint.com/CDS/Datasets/google_books.csv")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword()}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://cds.iisc.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


 **Install PySpark**

In [4]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.4.1.tar.gz (310.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m310.8/310.8 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.4.1-py2.py3-none-any.whl size=311285397 sha256=caf10170a8d1f9bde079737fcb5b0dfe42ee3c70d85e0bb1e619edf3653fc5ec
  Stored in directory: /root/.cache/pip/wheels/0d/77/a3/ff2f74cc9ab41f8f594dabf0579c2a7c6de920d584206e0834
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.4.1


**Creating Spark Session**

Spark session is a combined entry point of a Spark application, which came into implementation from Spark 2.0 (Instead of having various contexts, everything is encapsulated in a Spark session)

In [5]:
# Start spark session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf  # User Defined Functions
from pyspark.sql.types import StringType
spark = SparkSession.builder.appName('Rdd').getOrCreate()
spark

In [6]:
# Accessing sparkContext from sparkSession instance.
sc = spark.sparkContext

### Spark Python Transformations

**map()** - A map transformation is useful when we need to transform an RDD by applying a function to each element.

In [7]:
# Return a new RDD by applying a function to each element of this RDD.
rdd = sc.parallelize(["b", "a", "c"])
sorted(rdd.map(lambda x: (x, 1)).collect())

[('a', 1), ('b', 1), ('c', 1)]

**take()** - Take the first num elements of the RDD.

It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit

In [8]:
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2) #take()

[2, 3]

In [9]:
sc.parallelize(range(100), 100).filter(lambda x: x > 90).take(3) #take()

[91, 92, 93]

**flatMap()** - The flatMap transformation will return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. This is the main difference between the flatMap and *map transformations.*

In [10]:
s0 = sc.parallelize([3,4,5])
s0.flatMap(lambda x: [x, x*x]).collect()

[3, 9, 4, 16, 5, 25]

Compare the same function using map()

In [11]:
sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect()

[[3, 9], [4, 16], [5, 25]]

**filter()** - The filter transformation returns a new dataset formed by selecting  those elements of the source on which func returns true.

In [12]:
rdd = sc.parallelize([1, 2, 3, 4, 5])
rdd.filter(lambda x: x % 2 == 0).collect() # Return a new RDD containing only the elements that satisfy a predicate.

[2, 4]

**groupByKey()** - We can apply the “groupByKey” transformations on (key,val) pair RDD. The “groupByKey” will group the values for each key in the original RDD. It will create a new pair, where the original key corresponds to this collected group of values.

In [13]:
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
x.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()

[('b', [1]), ('a', [1, 1])]

**reduceByKey()** - Merge the values for each key using an associative reduce function.

In [14]:
from operator import add
rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
sorted(rdd.reduceByKey(add).collect())

[('a', 2), ('b', 1)]

**mapPartitions()** - Is similar to map, but runs separately on each partition (block) of the RDD

In [16]:
wordsList = ['cat', 'elephant', 'rat', 'rat', 'cat']

wordsRDD = sc.parallelize(wordsList, 4) # number of partitions - 4

print(wordsRDD.collect())

itemsRDD = wordsRDD.mapPartitions(lambda iterator: [','.join(iterator)])
# mapPartitions() loops through 4 partitions and combines('rat,cat') in 4th iteration.
print (itemsRDD.collect())

['cat', 'elephant', 'rat', 'rat', 'cat']
['cat', 'elephant', 'rat', 'rat,cat']


In [17]:
L = range(1,10)

parallel = sc.parallelize(L, 3) # number of partitions - 3

def f(iterator):
  yield sum(iterator)

parallel.mapPartitions(f).collect()

# Results [6,15,24] are created because mapPartitions() loops through 3 partitions, Partion 1: 1+2+3 = 6, Partition 2: 4+5+6 = 15, Partition 3: 7+8+9 = 24


[6, 15, 24]

In [18]:
rdd = sc.parallelize([1, 2, 3, 4], 2) # number of partitions - 2

def f(iterator):
  yield sum(iterator)

rdd.mapPartitions(f).collect()

# Results [3, 7], partition 1 : 1+2 = 3, partition 2 : 3+4 =7

[3, 7]

**mapPartitionsWithIndex()** - Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition.

In [19]:
rdd = sc.parallelize([1, 2, 3, 4], 4)
def f(splitIndex, iterator): yield splitIndex
rdd.mapPartitionsWithIndex(f).sum()

6

### Spark Python Actions

**Creating an RDD to explain "RDD actions with Examples"**

In [20]:
data=[("Z", 1),("A", 20),("B", 30),("C", 40),("B", 30),("B", 60)]

inputRDD = spark.sparkContext.parallelize(data)

listRdd = spark.sparkContext.parallelize([1,2,3,4,5,3,2])

from operator import add

After creating two RDDs as given above, we use these two as and when necessary to demonstrate the RDD actions.

**first()** – Return the first element in the dataset.

In [21]:
#first
print("first :  "+str(listRdd.first()))
print("first :  "+str(inputRDD.first()))

first :  1
first :  ('Z', 1)


**take()** – Return the first num elements of the dataset.

In [22]:
#take()
print("take : "+str(listRdd.take(2)))

take : [1, 2]


**takeSample()** – Return the subset of the dataset in an Array.

In [23]:
print("take : "+str(listRdd.takeSample(0,3))) # ([1,2,3,4,5,3,2])

take : [5, 3, 2]


**takeOrdered()** – Return the first num (smallest) elements from the dataset and this is the opposite of the take() action.

In [24]:
print("takeOrdered : "+ str(listRdd.takeOrdered(2)))

takeOrdered : [1, 2]


**collect()** - Return the complete dataset as an Array.

In [25]:
#Collect
data = listRdd.collect()
print(data)

[1, 2, 3, 4, 5, 3, 2]


**count()** – Return the count of elements in the dataset.

In [26]:
print("Count : "+str(listRdd.count()))

Count : 7


**countByValue()** – Return Map[T,Long] key representing each unique value in dataset and value represents count each value present.

In [27]:
print("countByValue :  "+str(listRdd.countByValue()))

countByValue :  defaultdict(<class 'int'>, {1: 1, 2: 2, 3: 2, 4: 1, 5: 1})


**reduce()** – Reduces the elements of the dataset using the specified binary operator.

In [28]:
redRes=listRdd.reduce(add)
print(redRes)

20


**top()** – Return top n elements from the dataset.

In [29]:
print("top : "+str(listRdd.top(2)))
print("top : "+str(inputRDD.top(2)))

top : [5, 4]
top : [('Z', 1), ('C', 40)]


**fold()** - Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value."

In [30]:
foldRes=listRdd.fold(0, add)
print(foldRes)

20


**foldByKey()** -  is quite similar to fold(), both use a zero value of the same type of the data in our RDD and combination function.

In [31]:
inputRDD.foldByKey(0, add).collect()

[('C', 40), ('Z', 1), ('A', 20), ('B', 120)]

**reduceByKey()** - Merge the values for each key using an associative reduce function.

In [32]:
sorted(inputRDD.reduceByKey(add).collect())

[('A', 20), ('B', 120), ('C', 40), ('Z', 1)]

**combineByKey()** - Generic function to combine the elements for each key using a custom set of aggregation functions.

In [33]:
def f(inputRDD):
  return inputRDD
def add(A, B):
  return A + str(B)
sorted(inputRDD.combineByKey(str, add, add).collect())

[('A', '20'), ('B', '303060'), ('C', '40'), ('Z', '1')]

### PySpark User Defined Functions

* PySpark UDF is a User Defined Function that is used to create a reusable
function in Spark.

* Once UDF is created, that can be re-used on multiple DataFrames and SQL (after registering).

* The default type of the udf() is StringType.

Created dataframe with two columns "Seqno" and "Name"

In [34]:
columns = ["Seqno","Name"]
data = [("1", "john jones"),
    ("2", "tracey smith"),
    ("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+



Applying UDF

In [35]:
# creating a udf using lambda
convertUDF = udf(lambda z: z.upper())
df.select(col("Seqno"), convertUDF(col("Name")).alias("Name") ).show(truncate=False)

+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |JOHN JONES  |
|2    |TRACEY SMITH|
|3    |AMY SANDERS |
+-----+------------+



#### **Shuffle Operations**


Shuffling is a mechanism PySpark uses to redistribute the data across different executors and even across machines. PySpark shuffling triggers when we perform certain transformation operations like gropByKey(), reduceByKey(), join() on RDDS

Spark also supports transformations with wide dependencies, such as groupByKey and reduceByKey. In these dependencies, the data required to compute the records in a single partition can reside in many partitions of the parent dataset.

To perform these transformations, all of the tuples with the same key must end up in the same partition, processed by the same task. To satisfy this requirement, Spark performs a shuffle, which transfers data around the cluster and results in a new stage with a new set of partitions.

For example, consider the following code:

**sc.textFile("someFile.txt").map(mapFunc).flatMap(flatMapFunc).filter(filterFunc).count()**

It runs a single action, count, which depends on a sequence of three transformations on a dataset derived from a text file. This code runs in a single stage because none of the outputs of these three transformations depend on data that comes from different partitions than their inputs.

**Below is an example implementing RDD based model to count the words given in a file**



To implement RDD based model, we have used the text file (**Spark_Text.txt**) which includes Apache Spark notes/information. This text file contains 5 paragraphs of information on Spark.

We would perform RDD Transformations and Actions on the file to count the words given in the text file.

In [36]:
rdd = sc.textFile("Spark_Text.txt")

In [37]:
# To lower the case of each word of a document, we can use the map transformation.
def Func(lines):
      lines = lines.lower()
      lines = lines.split()
      return lines
rdd1 = rdd.map(Func)

In [38]:
rdd1.take(1)

[['apache',
  'spark',
  'lets',
  'you',
  'treat',
  'your',
  'input',
  'files',
  'almost',
  'like',
  'any',
  'other',
  'variable,',
  'which',
  'you',
  'cannot',
  'do',
  'in',
  'hadoop',
  'mapreduce.']]

In [39]:
#To get the flat output, we need to apply a transformation which will flatten the output, The transformation “flatMap" will help here:
rdd2 = rdd.flatMap(Func)
rdd2.take(3)


['apache', 'spark', 'lets']

In [40]:
rdd3 = rdd2.filter(lambda x:x!= '')
rdd3.take(7)  # We can check first 7 elements of “rdd3” by applying take action.

['apache', 'spark', 'lets', 'you', 'treat', 'your', 'input']

In [41]:
rdd3_mapped = rdd3.map(lambda x: (x,1))
rdd3_grouped = rdd3_mapped.groupByKey()

In [42]:
rdd3_mapped.reduceByKey(lambda x,y: x+y).map(lambda x:(x[1],x[0])).sortByKey(False).take(200)

[(13, 'the'),
 (7, 'a'),
 (6, 'of'),
 (6, 'you'),
 (5, 'rdd'),
 (4, 'that'),
 (4, 'to'),
 (4, 'it'),
 (3, 'are'),
 (3, 'when'),
 (3, 'an'),
 (3, 'spark'),
 (3, 'rdds'),
 (3, 'number'),
 (3, 'be'),
 (3, 'partitions'),
 (3, 'has'),
 (2, 'in'),
 (2, 'partitions,'),
 (2, 'execute'),
 (2, 'is'),
 (2, 'more'),
 (2, 'rdd.'),
 (2, 'new'),
 (2, 'rdd,'),
 (2, 'keeps'),
 (2, 'which'),
 (2, 'automatically'),
 (2, 'distributed'),
 (2, 'across'),
 (2, 'divided'),
 (2, 'and'),
 (2, 'some'),
 (2, 'all'),
 (2, 'function'),
 (2, 'on'),
 (2, 'creates'),
 (2, 'does'),
 (2, 'not'),
 (1, 'treat'),
 (1, 'input'),
 (1, 'files'),
 (1, 'like'),
 (1, 'other'),
 (1, 'variable,'),
 (1, 'cannot'),
 (1, 'do'),
 (1, 'hadoop'),
 (1, 'network'),
 (1, 'means'),
 (1, 'into'),
 (1, 'task'),
 (1, 'means,'),
 (1, 'decides'),
 (1, 'but'),
 (1, 'creating'),
 (1, 'these'),
 (1, 'nodes'),
 (1, 'network.'),
 (1, 'applies'),
 (1, 'on.(remember'),
 (1, 'also,'),
 (1, 'pointer'),
 (1, 'parent'),
 (1, 'call'),
 (1, 'transformation,'

**In the below example we can see Spark Transformations in Python using a CSV file.**

We will use this CSV file (**Google_Books.csv**) to work on Spark Transformations.

This data was acquired from the Google Books store. Google API was used to acquire the data. Nine features were gathered for each book in the data set.

In [43]:
book_names = sc.textFile("google_books.csv")
rows = book_names.map(lambda line: line.split(",")) #we are creating a new RDD called “rows” by splitting every row in the book_names RDD.

In [44]:
for row in rows.take(rows.count()):
  print(row[1])

authors
['Wendelin Van Draanen']


['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
"['James Dickey'
"['Katherine M. Keyes'
['Martina Handler']
['BookCaps Study Guides Staff']
"['Richard M. Lerner'
"['David I. Durham'

['Ruth Roy Harris']

"['Gale
"['David E. Keyes'
['United States. Dept. of Commerce']
['Jessica Keyes']
['Jessica Keyes']

['United States. Bureau of Mines']
['Elizabeth Walsh']
 1997-98"


['Association of Collegiate Schools of Architecture']
 1941-1945"
['New Mexico. Bureau of Mines and Mineral Resources']

['Maureen Callahan']

 Keyed to Soderquist"
['William T. Coyle']
 Keyed to Smiddy and Cunningham"
['Phil Geusz']
"['Joseph M. Flora'
['Jeana L. Magyar-Moe']
"['Teresa L. Scheid'
['Timothy P Melchert']
['Jessica Keyes']
['Charles Don Keyes']
['Ralph Keyes']
"['Katherine M. Keyes'
 carbon dioxide and carbon monoxide"

"['Graeme Simsion'
['Janet Clark']
['Greg Keyes']
['Marian Keyes']
['Marian Keyes']
"['R. Regi

In [45]:
for row in rows.take(10):
  print(row[1])

authors
['Wendelin Van Draanen']


['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']
['Jessica Keyes']


In [46]:
# filter() - Creating a new RDD by returning only the elements that satisfy the search filter.
rows.filter(lambda line: "Inward Journey" in line).collect()

[['Inward Journey',
  '',
  'en',
  "['Medical']",
  '',
  'NOT_MATURE',
  'Open Court Publishing Company',
  '1983-01',
  '133']]

In [47]:
# groupByKey() The following groups all titles to their publisher. Operates on value pairs
rows = book_names.map(lambda line: line.split(","))
titleToPublisher = rows.map(lambda n: (str(n[0]),str(n[6]) )).groupByKey()
titleToPublisher.map(lambda x : {x[0]: list(x[1])}).take(5)

[{'title': ['publisher']},
 {'The Boston Directory ...': ['']},
 {"The CIO's Guide to Oracle Products and Solutions": ['CRC Press']},
 {'Implementing the IT Balanced Scorecard': ['CRC Press',
   'CRC Press',
   'CRC Press']},
 {'Social Software Engineering': ['CRC Press']}]

### Please answer the questions below to complete the experiment:




In [48]:
# @title Select the False Statement: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "flatMap function always produces a single value as output for each input value" #@param ["", "Map transforms an RDD of length N into another RDD of length N","flatMap function always produces a single value as output for each input value","The reduce operation shuffles and reduces the output obtained from the map"]

In [49]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good, But Not Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [50]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "Perfect for practice" #@param {type:"string"}


In [51]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [52]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [53]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [54]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 6281
Date of submission:  06 Aug 2023
Time of submission:  15:45:34
View your submissions: https://cds.iisc.talentsprint.com/notebook_submissions
