# Spark and Python

using the *pyspark* library. This Jupyter Notebook serves as a reference code for the Big Data field and involves Amazon Web Services.

### Creating a SparkContext

Importing *SparkContext* from *pyspark* library:

In [1]:
from pyspark import SparkContext

### Instantiating SparkContext object

SparkContext represents the connection to a Spark cluster, and can be used to create an RDD and broadcast variables on that cluster.

SparkContext can be utilized at a time because of the way the things are built here.

In [2]:
sc = SparkContext()

## Basic Operations

Making a 'hello world' example, which is just reading a text file.
___

Creating a text file first, by using some special Jupyter Notebook command for this: %%writefile <file_name>.txt

However, any text file can be read.

%%writefile text_file_example.txt
first line
second line
third line
fourth line

### Creating RDD (Resilient Destributed Dataset)

Loading the textfile using the **textFile** method of the SparkContext that was created. This method reads a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and returns it as an RDD of Strings.

In [3]:
# making an object
textFileObject = sc.textFile('text_file_example.txt')

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.

*textFileObject* is the RDD.

*sc* is the SparkContext that connects to a Spark cluster

### Actions

An RDD has just been created by using the textFile method and operations on this object can be performed, such as counting the rows.

RDDs have actions, which return values, and transformations, which return pointers to new RDDs. Here are a few actions:

#### counting the number of elements in RDD. Here, each line of the text file is an element:

In [4]:
textFileObject.count()

4

#### grabbing the first line, i.e. the first object:

In [5]:
textFileObject.first()

'first line'

### Transformations

Transformations can also be used, where more complicated tasks can take place. For example, the filter transformation will return a new RDD with a subset of items in the file. 



#### Creating a sample transformation using the filter() method:

*This method (just like [Python's own filter function](https://www.w3schools.com/python/ref_func_filter.asp)) will only return elements that satisfy the condition.* 

Trying looking for lines that contain the word 'second'. *In that case, there should only be one line that has that.*

In [6]:
second_find = textFileObject.filter(lambda line: 'second' in line)

Note how fast that was to run! The reason for that is because RDDs are lazily evaluated. That means that all those instructions of transformations don't actually execute until an action is performed! So, transformations are as a kind of recipe.

#### Taking a look of what *second_find* is:

In [7]:
second_find

PythonRDD[4] at RDD at PythonRDD.scala:53

RDD has a recipe of instructions to follow but it doesn't actually execute them until you ask for the performance of the action.

#### Performing an action on the transformation:

In [8]:
second_find.collect()

['second line']

#### Performing another action on the transformation:

In [9]:
second_find.count()

1

Notice how the transformations won't display an output and won't be run until an action is called.