# The 3 V's of Big Data

Which of the following are considered as the three Vs of Big Data?

- Volume
- Velocity
- Variety

# Understanding SparkContext

A SparkContext represents the entry point to Spark functionality. It's like a key to your car. When we run any Spark application, a driver program starts, which has the main function and your SparkContext gets initiated here. PySpark automatically creates a SparkContext for you in the PySpark shell (so you don't have to create it by yourself) and is exposed via a variable `sc`.

In [1]:
from pyspark import SparkConf, SparkContext
# Create a SparkConf object to configure the SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]")

# Create a SparkContext with the configured SparkConf object
sc = SparkContext(conf=conf)


# Print the version of SparkContext
print("The version of Spark Context in the PySpark shell is", sc.version)

# Print the Python version of SparkContext
print("The Python version of Spark Context in the PySpark shell is", sc.pythonVer)

# Print the master of SparkContext
print("The master of Spark Context in the PySpark shell is", sc.master)

The version of Spark Context in the PySpark shell is 3.5.0
The Python version of Spark Context in the PySpark shell is 3.8
The master of Spark Context in the PySpark shell is local[*]


# Interactive Use of PySpark

Spark comes with an interactive Python shell in which PySpark is already installed. PySpark shell is useful for basic testing and debugging and is quite powerful. The easiest way to demonstrate the power of PySpark’s shell is with an exercise. In this exercise, you'll load a simple list containing numbers ranging from 1 to 100 in the PySpark shell.

The most important thing to understand here is that we are not creating any SparkContext object because PySpark automatically creates the SparkContext object named `sc` in the PySpark shell.

In [2]:
# Create a Python list of numbers from 1 to 100 
numb = range(1, 100)

# Load the list into PySpark  
spark_data = sc.parallelize(numb)

# Loading data in PySpark shell

In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.

In [5]:
# Load a local file into PySpark shell
lines = sc.textFile("dataset/Complete_Shakespeare.txt")
lines

dataset/Complete_Shakespeare.txt MapPartitionsRDD[6] at textFile at NativeMethodAccessorImpl.java:0

# Use of lambda() with map()

The `map()` function in Python returns a list of the results after applying the given function to each item of a given iterable (list, tuple etc.). The general syntax of `map()` function is `map(fun, iter)`. We can also use lambda functions with `map()`

In [6]:
my_list = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Print my_list in the console
print("Input list is", my_list)

# Square all numbers in my_list
squared_list_lambda = list(map(lambda x: x ** 2, my_list))

# Print the result of the map function
print("The squared numbers are", squared_list_lambda)

Input list is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
The squared numbers are [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


# Use of lambda() with filter()

Another function that is used extensively in Python is the `filter()` function. The `filter()` function in Python takes in a function and a list as arguments. Similar to the `map()`, `filter()` can be used with lambda function. 

In [7]:
my_list2 = [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]

# Print my_list2 in the console
print("Input list is:", my_list2)

# Filter numbers divisible by 10
filtered_list = list(filter(lambda x: (x%10 == 0), my_list2))

# Print the numbers divisible by 10
print("Numbers divisible by 10 are:", filtered_list)

Input list is: [10, 21, 31, 40, 51, 60, 72, 80, 93, 101]
Numbers divisible by 10 are: [10, 40, 60, 80]
