# Chapter 1

### Big Data


- Volume: Size of the data
- Variety: Different sources and formats
- Velocity: Speed of the data
- Clustered computing: Collection of resources of multiple machines
- Parallel computing: Simultaneous computation on single computer
- Distributed computing: Collection of nodes (networked computers) that run in parallel
- Batch processing: Breaking the job into small pieces and running them on individual machines
- Real-time processing: Immediate processing of data
- Big Data processing systems
    - Hadoop/MapReduce: Scalable and fault tolerant framework written in Java (Batch processing)
    - Apache Spark: General purpose and lightning fast cluster computing system (Both batch and real-time data processing)
    - Note: Apache Spark is nowadays preferred over Hadoop/MapReduce

### Spark

- General purpose data processing engine designed for big data.
- Written in scala
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer). 
- Very large datasets are split into smaller datasets and  each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster. 
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
    - Master: 
        - Connected to the rest of the computers in the cluster, which are called worker
        - sends the workers data and calculations to run
    - Worker: 
        - They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- spark dataframes are immutable, you need to return a new instance after modification 
- You start working with `SparkSession` or `SparkContext` entrypoint
- 2 modes:
    - local mode : Single computer
    - cluster mode : cluster computers
    - You first build in local mode and deploy in cluster mode (no code change is required)
- Spark shell : 
    - interactive environment for spark jobs
    - allow interacting with data on disk or in memory

### Lambda function

```
func_name = lambda inputs : return_expression

add = lambda a, b : a + b
add(3,6) ## 9
```

### Map

```
#### Core python use case #####
#map(func_name, some_list)

items = [1, 2, 3, 4]
list(map(lambda x: x + 2 , items))  ## [3, 4, 5, 6]
#### Dataframe Application #####
# Method 1
df["col"].apply(lambda x: x+1)
# Method 2
genders = {'James': 'Male', 'Jane': 'Female'}
df['gender'] = df['name'].map(genders)
```

### Filter

```
## filter(boolean_func, list)

items = [1, 2, 3, 4]
list(filter(lambda x: (x%2 != 0), items)) ## [1, 3]
```

### Spark Context

```
from pyspark import SparkConf, SparkContext

# Create a SparkConf object to configure the SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]")

# Create a SparkContext with the configured SparkConf object
sc = SparkContext(conf=conf)


print(sc) # Verify SparkContext
print(sc.version) # Print Spark version
print(sc.pythonVer) # Print Python version
print(sc.master) # Print the spark mode

# Loading data
rdd = sc.parallelize([1,2,3,4,5])
rdd2 = sc.textFile("test.txt")
```