In [None]:
"""
reference videos:
    https://www.youtube.com/watch?v=5dARTeE6OpU
    https://www.youtube.com/watch?v=e5ol7oyKV0A


reference blogs:
    https://www.edureka.co/blog/pyspark-rdd/
    

"""

In [None]:

"""

Clustered computing: Collection of resources of multiple machines

Parallel computing: Simultaneous computation

Distributed computing: Collection of nodes (networked computers) that run in parallel

Batch processing: Breaking the job into small pieces and running them on individual machines

Real-time processing: Immediate processing of data

"""

In [None]:
"""
Features of Apache Spark framework

Distributed cluster computing framework

Efcient in-memory computations for large data sets

Lightning fast data processing framework

Provides support for Java, Scala, Python, R and SQL
"""

In [None]:
"""
Apache Spark => Google Dataproc / Amazon Elastic MapReduce
"""

In [None]:
"""
Spark modes of deployment

    Local mode: Single machine such as your laptop
        Local model convenient for testing, debugging and demonstration
        
    Cluster mode: Set of pre-dened machines
        Good for production
        
    Workow: Local -> clusters
    
    No code change necessary
"""

In [None]:
"""
Spark is a platform for cluster computing. Spark lets you spread data and computations over 
clusters with multiple nodes (think of each node as a separate computer). Splitting up your 
data makes it easier to work with very large datasets because each node only works with a 
small amount of data.

As each node works on its own subset of the total data, it also carries out a part of the total 
calculations required, so that both data processing and computation are performed in parallel 
over the nodes in the cluster. It is a fact that parallel computation can make certain types of 
programming tasks much faster.

However, with greater computing power comes greater complexity.

Deciding whether or not Spark is the best solution for your problem takes some experience, but you 
can consider questions like:

    Is my data too big to work with on a single machine?
    
    Can my calculations be easily parallelized?
"""

In [None]:
"""
Using Spark in Python

The first step in using Spark is connecting to a cluster.

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. 
There will be one computer, called the master that manages splitting up the data and the computations. 

The master is connected to the rest of the computers in the cluster, which are called worker. The master 
sends the workers data and calculations to run, and they send their results back to the master.

When you're just getting started with Spark it's simpler to just run a cluster locally. 

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor 
takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.

An object holding all these attributes can be created with the SparkConf() constructor. Take a look at 
the documentation for all the details!

http://spark.apache.org/docs/2.1.0/api/python/pyspark.html

"""

In [None]:
"""
SparkContext is the entry gate of Apache Spark functionality. The most important step 
of any Spark driver application is to generate SparkContext. It allows your Spark 
Application to access Spark Cluster with the help of Resource Manager (YARN/Mesos).
"""

from pyspark import SparkContext

# Create a SparkContext

sc = SparkContext.getOrCreate()


# Verify SparkContext
print(sc)


# Print Spark version
print(sc.version)

print(sc.pythonVer)

print(sc.master)

"""
You'll probably notice that code takes longer to run than you might expect. This is because Spark is some 
serious software. It takes more time to start up than you might be used to. You may also find that running 
simpler computations might take longer than expected. That's because all the optimizations that Spark has 
under its hood are designed for complicated operations with big data sets. That means that for simple or 
small problems Spark may actually perform worse than some other solutions!
"""

In [None]:
"""
Using DataFrames

Spark's core data structure is the Resilient Distributed Dataset (RDD). This is a low level object 
that lets Spark work its magic by splitting data across multiple nodes in the cluster. However, RDDs 
are hard to work with directly, so in this course you'll be using the Spark DataFrame abstraction built 
on top of RDDs.


The Spark DataFrame was designed to behave a lot like a SQL table (a table with variables in the columns 
and observations in the rows). Not only are they easier to understand, DataFrames are also more optimized 
for complicated operations than RDDs.

When you start modifying and combining columns and rows of data, there are many ways to arrive at the 
same result, but some often take much longer than others. When using RDDs, it's up to the data scientist 
to figure out the right way to optimize the query, but the DataFrame implementation has much of this 
optimization built in!

To start working with Spark DataFrames, you first have to create a SparkSession object from your SparkContext. 
You can think of the SparkContext as your connection to the cluster and the SparkSession as your interface 
with that connection.

"""

In [None]:
# Create a SparkSession
"""
SparkSession is the entry point to Spark SQL. It is one of the very first objects you 
create while developing a Spark SQL application. As a Spark developer, you create a 
SparkSession using the SparkSession.builder method (that gives you access to Builder 
API that you use to configure the session).
"""

from pyspark.sql import SparkSession

spark = (SparkSession.builder
                  .appName("Apache PySpak Intro")
                  .getOrCreate())

spark = SparkSession.builder.getOrCreate()

# Print my_spark
print(spark)

In [None]:
"""
Your SparkSession has an attribute called catalog which lists all the data inside the cluster. 
This attribute has a few methods for extracting different pieces of information.

One of the most useful is the .listTables() method, which returns the names of all the tables in 
your cluster as a list.

Catalog is new API in spark 2.0 which allows us to interact with metadata of spark sql. This is a much 
better interface to metadata compared to earlier versions of spark.
"""

# Print the tables in the catalog
print(spark.catalog.listTables())

In [None]:
"""
Sometimes it makes sense to then take that table and work with it locally using a tool like pandas. 
Spark DataFrames make that easy with the .toPandas() method. Calling this method on a Spark DataFrame 
returns the corresponding pandas DataFrame.
"""

pandas_df = spark_df.toPandas()

In [None]:
"""
Put some Spark in your data

Put a pandas DataFrame into a Spark cluster! The SparkSession class has a method for this as well.
The .createDataFrame() method takes a pandas DataFrame and returns a Spark DataFrame.

The output of this method is stored locally, not in the SparkSession catalog. This means that you can 
use all the Spark DataFrame methods on it, but you can't access the data in other contexts.

For example, a SQL query (using the .sql() method) that references your DataFrame will throw an error. 
To access the data in this way, you have to save it as a temporary table.

You can do this using the .createTempView() Spark DataFrame method, which takes as its only argument 
the name of the temporary table you'd like to register. This method registers the DataFrame as a table in 
the catalog, but as this table is temporary, it can only be accessed from the specific SparkSession used to 
create the Spark DataFrame.

There is also the method .createOrReplaceTempView(). This safely creates a new temporary table if nothing 
was there before, or updates an existing table if one was already defined. You'll use this method to avoid 
running into problems with duplicate tables.
"""

import pandas as pd
import numpy as np

# Create pd_temp
pd_temp = pd.DataFrame(np.random.random(10))

# Create spark_temp from pd_temp
spark_temp = spark.createDataFrame(pd_temp)

# Examine the tables in the catalog
print(spark.catalog.listTables())

# Add spark_temp to the catalog
spark_temp.createOrReplaceTempView("temp")

# Examine the tables in the catalog again
print(spark.catalog.listTables())

In [None]:
rdd = sc.parallelize([1,2,3,4,5])

In [None]:
rdd = sc.parallelize("Hello world")

In [1]:
rdd = sc.textFile(".data/data.txt")

In [None]:
rdd = sc.parallelize(range(10), minPartitions = 6)

In [None]:
rdd = sc.textFile(".data/data.txt", minPartitions = 6)

In [None]:
# Check the number of partitions in rdd

print("Number of partitions in rdd is", rdd.getNumPartitions())

In [None]:
"""
Transformations create new RDDS
Actions perform computation on the RDDs
"""

# Map Transformation
rdd = sc.parallelize([1,2,3,4])
rdd_map = rdd.map(lambda x: x * x)


# Filter Transformation
rdd = sc.parallelize([1,2,3,4])
rdd_filter = rdd.filter(lambda x: x > 2)

# Flatmap Transformation
rdd = sc.parallelize(["hello world", "how are you"])
rdd_flatamp = rdd.flatMap(lambda x: x.split(" "))


# Union Transformation
input_rdd = sc.textFile("logs.txt")
error_rdd = input_rdd.filter(lambda x: "error" in x.split())
warnings_rdd = input_rdd.filter(lambda x: "warnings" in x.split())
combined_rdd = error_rdd.union(warnings_rdd)


In [None]:
# RDD Actions: Operation return a value after running a computation on the RDD

# collect() return all the elements of the dataset as an array
rdd_map.collect()


# take(N) returns an array with the first N elements of the dataset
rdd_map.take(2)

# first() prints the first element of the RDD
rdd_map.first()


# count() return the number of elements in the RDD
rdd_map.count()

In [None]:
"""
Introduction to pair RDDs in PySpark

    Real life datasets are usually key/value pairs
    Each row is a key and maps to one or more values
    PairRDD is a special data structure to work with this kind of datasets
    PairRDD: Key is the identier and value is data
    
Creating pair RDDs

    Two common ways to create pairRDDs
        From a list of key-value tuple
        From a regularRDD

"""

In [None]:
# Get the data into key/value form for paired RDD

# Method 1
my_tuple = [('Sam', 23), ('Mary', 34), ('Peter', 25)]

pairRDD_tuple = sc.parallelize(my_tuple)



# Method 2
my_list = ['Sam 23', 'Mary 34', 'Peter 25']

regularRDD = sc.parallelize(my_list)

pairRDD_RDD = regularRDD.map(lambda s: (s.split(' ')[0], s.split(' ')[1]))

In [None]:
# reduceByKey() transformation

    # reduceByKey() transformation combines values with the same key
    # It runs parallel operations for each key in the dataset
    # It is a transformation and not action

regularRDD = sc.parallelize([("Messi", 23), ("Ronaldo", 34),
                                ("Neymar", 22), ("Messi", 24)])

pairRDD_reducebykey = regularRDD.reduceByKey(lambda x,y : x + y)

pairRDD_reducebykey.collect()

In [None]:
# sortByKey() transformation

    # sortByKey() operation orders pairRDD by key
    # It returns an RDD sorted by key in ascending or descending order

pairRDD_reducebykey_rev = pairRDD_reducebykey.map(lambda x: (x[1], x[0]))

pairRDD_reducebykey_rev.sortByKey(ascending=False).collect()

In [None]:
# groupByKey() transformation

    # groupByKey() groups allthe values with the same key in the pairRDD
    
regularRDD = sc.parallelize(airports)

pairRDD_group = regularRDD.groupByKey().collect()

for cont, air in pairRDD_group:
    print(cont, list(air))
    
    

In [None]:
# join() transformation

    # join() transformation joins the two pairRDDs based on their key
    
RDD1 = sc.parallelize([("Messi", 34),("Ronaldo", 32),("Neymar", 24)])

RDD2 = sc.parallelize([("Ronaldo", 80),("Neymar", 120),("Messi", 100)])    

RDD1.join(RDD2).collect()

In [None]:
# reduce() action

    # reduce(func) action is used for aggregating the elements of a regularRDD
    # The function should be commutative and associative

x = [1,3,4,6]
RDD = sc.parallelize(x)
RDD.reduce(lambda x, y : x + y)

In [None]:
# saveAsTextFile() action

    # saveAsTextFile() action saves RDD into a text file inside a directory with 
    # each partition as a separate file

RDD.saveAsTextFile("tempFile")

# coalesce() method can be used to save RDD as a single text file

RDD.coalesce(1).saveAsTextFile("tempFile")


In [None]:
# Action Operations on pair RDDs

    # RDD actions available for PySpark pairRDDs
    # PairRDD actions leverage the key-value data


# countByKey() action

    # countByKey() only available for type (K,V)
    # countByKey() action counts the number of elements for each key

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])

for key, val in rdd.countByKey().items():
    print(key, val)


    
# collectAsMap() action

    # collectAsMap() return the key-value pairs in the RDD as a dictionary
    
# Example of collectAsMap() on a simple tuple:

sc.parallelize([(1, 2), (3, 4)]).collectAsMap()

In [None]:
# RDD to DataFrame

# Create a list of tuples
sample_list = [('Mona',20), ('Jennifer',34), ('John',20), ('Jim',26)]

# Create a RDD from the list
rdd = sc.parallelize(sample_list)

# Create a PySpark DataFrame
names_df = spark.createDataFrame(rdd, schema=['Name', 'Age'])

# Check the type of names_df
print("The type of names_df is", type(names_df))


In [None]:
data_path = './data'
file_path = f'{data_path}/location_temp.csv'

sdf1 = (spark.read
       .format("csv")
       .option("header", "true")
       .load(file_path))

In [None]:
sdf1.head(5)

In [None]:
sdf1.show(5)

In [None]:
sdf1.count()

In [None]:
file_path = f'{data_path}/utilization.csv'

In [None]:
# CSV File does not have headers

sdf2 = (spark.read
        .format("csv")
        .option("header", "false")
        .option("inferSchema","true")
        .load(file_path))

In [None]:
sdf2.show(5)

In [None]:
sdf2.count()

In [None]:
sdf2 = (sdf2.withColumnRenamed("_c0", "event_datetime")
            .withColumnRenamed("_c1", "server_id")
            .withColumnRenamed("_c2", "cpu_utilization")
            .withColumnRenamed("_c3", "free_memory")
            .withColumnRenamed("_c4", "session_count"))

In [None]:
sdf2.show(5)

In [None]:
sdf3_json_file_path = f'{data_path}/location_temp.json'
sdf1.write.json(sdf3_json_file_path)

In [None]:
sdf4_json_file_path = f'{data_path}/utilization.json'
sdf2.write.json(sdf4_json_file_path)

In [None]:
!ls './data'

In [None]:
sdf3 = (spark.read
            .format("json")
            .load(sdf3_json_file_path))

In [None]:
sdf3.head(5)

In [None]:
sdf3.show(5)

In [None]:
sdf4 = (spark.read
            .format("json")
            .load(sdf4_json_file_path))

In [None]:
! ls './data/utilization.json'

In [None]:
sdf4.columns

In [None]:
sdf4.describe()

In [None]:
sdf4.describe().show()

In [None]:
sdf4.printSchema()

In [None]:
sdf4_sample = sdf4.sample(withReplacement=False, fraction=0.1)

In [None]:
sdf4_sort = sdf4_sample.sort('event_datetime')

In [None]:
sdf3.filter(sdf3["location_id"]=="loc0").show(3)

In [None]:
sdf3.filter(sdf3["location_id"]=="loc0").count()

In [None]:
sdf3.filter("location_id = 'loc1'").show(3)

In [None]:
sdf3.groupBy("location_id").count().show(3)

In [None]:
sdf3.orderBy("location_id").show(3)

In [None]:
(sdf3.groupby('location_id')
    .agg({'temp_celcius': 'mean'})
    .show(3))

In [None]:
(sdf3.groupby('location_id')
    .agg({'temp_celcius': 'max'})
    .show(3))

In [None]:
(sdf3.groupBy("location_id")
     .agg({'temp_celcius': 'mean'})
     .orderBy("location_id")
     .show(3))

In [None]:
sdf3.write.csv('./data/sdf3.csv')

In [None]:
! ls './data'

In [None]:
!ls './data/sdf3.csv'

In [None]:
! head './data/sdf3.csv/part-00000-f5051421-4bc9-4dc7-8e87-612f86d01554-c000.csv'

In [None]:
sdf3.write.json('./data/sdf3.json')

In [None]:
! ls './data'

In [None]:
!ls './data/sdf3.json'

In [None]:
! head './data/sdf3.json/part-00000-be0377a9-4877-4524-92ce-8e611155b4c5-c000.json'