# Chapter 1

### Spark

- General purpose data processing engine designed for big data.
- Spark is a platform for cluster computing.
- Spark lets you spread data and computations over clusters with multiple nodes (each node as a separate computer). 
- Very large datasets are split into smaller datasets and  each node only works with a small amount of data.
- Data processing and computation are performed in parallel over the nodes in the cluster. 
- However, with greater computing power comes greater complexity.
- Can be used for Analytics, Data Integration, Machine learning, Stream Processing.
- Master and Worker:
    - Master: 
        - Connected to the rest of the computers in the cluster, which are called worker
        - sends the workers data and calculations to run
    - Worker: 
        - They send their results back to the master.
- Spark's core data structure is the Resilient Distributed Dataset (RDD)
- Instead of RDDs, it is easier to work with Spark DataFrame abstraction built on top of RDDs ( Operations using DataFrames are automatically optimized.)
- You start working with `SparkSession` or `SparkContext`

### SparkSession

```
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark K-means example") \
    .getOrCreate()
# Print the tables in the catalog
print(spark.catalog.listTables())

# Load CSV file into DataFrame
df = spark.read.csv("file.csv", header=True, inferSchema=True)
# Show the first few rows of the DataFrame
df.show()
# Print the schema of the DataFrame
df.printSchema()
# Perform basic operations or transformations on the DataFrame as needed
# For example, you can filter rows, perform aggregations, etc.
# Stop SparkSession
spark.stop()
```

### SparkContext

```
from pyspark import SparkConf, SparkContext

# Create a SparkConf object to configure the SparkContext
conf = SparkConf().setAppName("YourAppName").setMaster("local[*]")

# Create a SparkContext with the configured SparkConf object
sc = SparkContext(conf=conf)

# Verify SparkContext
print(sc)

# Print Spark version
print(sc.version)
```