### Resilient Distributed Datasets (RDDs)

the resilient distributed datasets (RDDs) which is Spark's core abstraction when working with data.
RDDs support two types of operations: transformations and actions.


In [1]:
import findspark

In [2]:
findspark.init()

In [3]:
import pyspark

In [4]:
sc = pyspark.SparkContext(appName="myAppName")

In [5]:
text_file = sc.parallelize(["hello","hello world"])

In [6]:
counts = text_file.flatMap(lambda line: line.split(" ")) \
             .map(lambda word: (word, 1)) \
             .reduceByKey(lambda a, b: a + b)

In [7]:
for line in counts.collect():
    print(line)

('hello', 2)
('world', 1)


The first line defines a base RDD by parallelizing an existing Python list. The second line defines counts as the result of a few transformations. In the third line and fourth line, the program print all elements from counts by calling collect(). collect() is used to retrieve the entire RDD if the data are expected to fit in memory. For more RDD APIs, you can refer to the website [RDD python](http://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds)