#### Setting up spark environment

* Apache Spark is an open-source, distributed computing framework designed for fast big data processing. It simplifies working with massive datasets by providing easy-to-use APIs, in-memory computation, and support for multiple languages like Python, Scala, Java, and R.

### What is Apache Spark?

Definition: A cluster computing framework for large-scale data processing.

Purpose: Handles big data workloads faster than traditional systems like Hadoop MapReduce.

Key Feature: In-memory computation, which reduces disk I/O and speeds up processing.

### Spark Architecture (Step by Step)
**Driver Program** : The main application that defines transformations and actions on data.   
Coordinates tasks across the cluster.

**Cluster Manager**: Allocates resources across machines (examples: YARN, Mesos, or Sparkâ€™s built-in manager).

**Executors**: Worker processes running on cluster nodes.
Execute tasks assigned by the driver.

**Resilient Distributed Dataset (RDD)**: Core data structure in Spark.   
Immutable, distributed collection of objects that can be processed in parallel.

#### Start a Spark Session
* Every PySpark program begins with a SparkSession (the entry point to Spark).
* appName: Name of your application.
* getOrCreate(): Creates a new session or reuses an existing one.

In [0]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MySparkApp").getOrCreate()

* Create a Simple Dataset
You can create a DataFrame directly from Python lists.

In [0]:
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')]
columns = ['id', 'name']
dataframe = spark.createDataFrame(data, columns)
dataframe.show()

* Perform Basic Operations


In [0]:
# Select specific column
dataframe.select('name').show()

In [0]:
# Filter rows
dataframe.filter(dataframe.id > 2).show()

In [0]:
# Always stop the Spark session when done.
# spark.stop()