# Lesson 2 - Setting Up PySpark

Okay, let's move on to Lesson 2, focusing on getting PySpark up and running.

---

**Lesson 2: Setting Up PySpark**

**Objective:** To learn how to install and configure PySpark in different environments, understand the primary entry points (`SparkSession`, `SparkContext`), write and execute a basic PySpark script, and get familiar with fundamental configuration settings.

---

**1. Installing PySpark (Local, Cloud, Databricks)**

Getting PySpark ready involves installing the necessary software and potentially configuring environment variables. The method varies depending on where you intend to run Spark.

**a) Local Installation (on your personal machine)**

This is ideal for learning, development, and testing on smaller datasets. Spark will run in "local mode," using the cores on your single machine.

*   **Prerequisites:**
    *   **Python:** You need a compatible Python installation (Python 3.x is recommended). Check with `python --version`.
    *   **Java Development Kit (JDK):** Spark runs on the Java Virtual Machine (JVM). You need a JDK installed (version 8 or 11 are commonly used and stable). You can download it from Oracle or use OpenJDK.
        *   *Verification:* Check with `java -version`.
        *   *Environment Variable:* You **must** set the `JAVA_HOME` environment variable to point to your JDK installation directory. The process differs for Windows, macOS, and Linux.
            *   Example (Linux/macOS - add to `.bashrc` or `.zshrc`): `export JAVA_HOME=/path/to/your/jdk`
            *   Example (Windows): Set via System Properties -> Environment Variables.
*   **Installation using `pip`:** This is the simplest way to get PySpark. It bundles the necessary Spark components.
    ```bash
    pip install pyspark
    ```
    *   *Optional:* You might also want `findspark` if you have a separate Spark installation and want Python to find it easily, but `pip install pyspark` usually suffices for basic local use.
*   **Verification:** Open your terminal or command prompt and run the PySpark interactive shell:
    ```bash
    pyspark
    ```
    If it launches successfully, you'll see the Spark logo and a `spark` variable available in the shell. Type `quit()` to exit.
*   **Pros:** Easy setup, free, great for learning fundamentals.
*   **Cons:** Limited by your machine's resources (CPU, RAM), not suitable for large-scale data processing.

**b) Cloud Installation (Managed Services)**

Cloud providers offer managed Spark services that handle cluster creation, management, and scaling.

*   **Examples:**
    *   **AWS:** Elastic MapReduce (EMR)
    *   **Google Cloud:** Dataproc
    *   **Microsoft Azure:** HDInsight, Azure Synapse Analytics, Azure Databricks
*   **Process:**
    1.  Log in to your cloud provider's console.
    2.  Navigate to the relevant managed Spark service (e.g., EMR, Dataproc).
    3.  Configure and launch a cluster: Choose instance types (master/worker nodes), number of nodes, Spark version, etc. PySpark is typically pre-installed.
    4.  Connect to the cluster: This varies. Options often include:
        *   SSH into the master node and use `pyspark` shell or `spark-submit`.
        *   Use web-based notebook interfaces provided by the service (like JupyterHub, Zeppelin).
        *   Submit jobs programmatically via APIs or SDKs.
*   **Pros:** Scalable (pay for what you use), managed infrastructure (less Ops overhead), integrated with other cloud services.
*   **Cons:** Cost involved, can have a steeper initial learning curve for cluster configuration, potential vendor lock-in.

**c) Databricks**

Databricks is a popular, commercial unified analytics platform built by the original creators of Apache Spark. It provides an optimized Spark environment with a collaborative notebook interface.

*   **Setup:**
    1.  Sign up for a Databricks account (offers community edition for free learning, or uses your cloud provider like AWS, Azure, GCP).
    2.  Log in to your Databricks workspace.
    3.  Create a Cluster: Use the UI to specify Spark version, node types, autoscaling options, etc. Start the cluster.
    4.  Create a Notebook: Create a new notebook (Python is a common choice).
    5.  Attach the notebook to your running cluster.
*   **PySpark Access:** In a Databricks notebook attached to a cluster, PySpark (specifically the `SparkSession` object named `spark`) is **already initialized and available** for you to use directly. There's no separate installation needed within the notebook environment.
*   **Pros:** Extremely easy setup, optimized Spark runtime, collaborative environment, integrated features (MLflow, Delta Lake).
*   **Cons:** Commercial product (cost beyond free tier), platform-specific features might lead to lock-in.

**Comparison Table:**

| Feature             | Local Installation           | Cloud Managed Service (EMR, Dataproc) | Databricks                  |
| :------------------ | :------------------------- | :------------------------------------ | :-------------------------- |
| **Ease of Setup**   | Moderate (JDK, pip)        | Moderate to Complex (Cloud config)  | Easy (Web UI)               |
| **Scalability**     | Low (Single Machine)       | High (Cluster)                        | High (Optimized Cluster)    |
| **Cost**            | Free (Hardware cost only)  | Pay-as-you-go (Compute, Storage)    | Pay-as-you-go (DBUs, Cloud) |
| **Management**      | Manual                     | Partially Managed                     | Fully Managed               |
| **Primary Use Case**| Learning, Small Dev/Test   | Production Workloads, Large Data    | Dev, Production, Collaboration |
| **PySpark Access**  | Install via `pip`, run shell/script | Pre-installed on cluster nodes      | Pre-configured in notebooks |

---

**2. SparkSession and SparkContext**

These are the primary entry points for interacting with Spark.

*   **SparkContext (`sc`)**
    *   **Role:** The original (Spark 1.x) main entry point to Spark functionality. It represents the connection to a Spark cluster and is used to create RDDs, accumulators, and broadcast variables.
    *   **Access:** In older Spark versions or when working directly with RDDs, you would create a `SparkContext`. In modern Spark, the `SparkSession` manages the `SparkContext`. You can access the underlying `SparkContext` from a `SparkSession` using `spark.sparkContext`.
    *   **Key Functions:** Creating RDDs (`parallelize`, `textFile`), accessing cluster information.

*   **SparkSession (`spark`)**
    *   **Role:** Introduced in Spark 2.0 as a **unified entry point** for all Spark functionality. It subsumes `SQLContext`, `HiveContext`, and `StreamingContext` from earlier versions. It's the preferred way to interact with Spark now.
    *   **Functionality:**
        *   Creates DataFrames and Datasets.
        *   Reads data from various sources (JSON, CSV, Parquet, JDBC, etc.).
        *   Executes SQL queries (`spark.sql(...)`).
        *   Provides access to configuration settings.
        *   Manages the underlying `SparkContext`.
    *   **Instantiation (in a standalone script):** You typically create a `SparkSession` using the builder pattern.
        ```python
        from pyspark.sql import SparkSession

        # Create a SparkSession
        spark = SparkSession.builder \
            .appName("MyFirstApp") \ # Optional: Set application name
            .master("local[*]") \ # Optional: Run locally using all available cores
            .getOrCreate()

        # Now you can use 'spark' to create DataFrames, etc.
        print(f"SparkSession available. Spark version: {spark.version}")

        # Access the underlying SparkContext if needed
        sc = spark.sparkContext
        print(f"SparkContext available: {sc.appName}, Master: {sc.master}")

        # Don't forget to stop the session when done in a script
        spark.stop()
        ```
    *   **`getOrCreate()`:** This method either gets an existing `SparkSession` or creates a new one if none exists. This prevents issues if multiple parts of your code try to create a session.
    *   **In Interactive Shells/Databricks:** The `pyspark` shell and Databricks notebooks typically create a `SparkSession` instance named `spark` for you automatically.

---

**3. First PySpark Script: Hello World**

Let's write a simple script that creates a DataFrame and prints its content. This verifies your setup and shows basic DataFrame manipulation.

*   **Objective:** Create a DataFrame with names and ages, then display it.

*   **Code (`hello_spark.py`):**
    ```python
    from pyspark.sql import SparkSession
    from pyspark.sql import Row # Often used to create Rows for DataFrames

    # 1. Create a SparkSession
    # Use builder pattern; master("local[*]") runs locally using all cores
    # appName is a name for your application shown in the Spark UI
    spark = SparkSession.builder \
        .appName("HelloWorld") \
        .master("local[*]") \
        .getOrCreate()

    print(f"SparkSession created. Spark version: {spark.version}")

    # 2. Create Sample Data
    # Using a list of Row objects
    data = [
        Row(name="Alice", age=30),
        Row(name="Bob", age=25),
        Row(name="Charlie", age=35)
    ]

    # 3. Create a DataFrame
    # Spark can infer the schema from the Row objects
    df = spark.createDataFrame(data)

    # 4. Show DataFrame Content and Schema
    print("DataFrame Content:")
    df.show()

    print("DataFrame Schema:")
    df.printSchema()

    # 5. Perform a simple operation (e.g., select a column)
    print("Selecting only the 'name' column:")
    df.select("name").show()

    # 6. Stop the SparkSession (important in scripts)
    spark.stop()
    print("SparkSession stopped.")
    ```

*   **Running the Script:**
    *   Save the code above as `hello_spark.py`.
    *   Open your terminal or command prompt.
    *   Make sure your `JAVA_HOME` is set correctly.
    *   Use `spark-submit` (a tool included with Spark/PySpark) to run the script:
        ```bash
        spark-submit hello_spark.py
        ```
        (If `spark-submit` isn't in your PATH, you might need to find its location within your Python environment's `site-packages/pyspark/bin` or your separate Spark installation's `bin` directory).

*   **Expected Output:**
    ```
    SparkSession created. Spark version: <your_spark_version>
    DataFrame Content:
    +-------+---+
    |   name|age|
    +-------+---+
    |  Alice| 30|
    |    Bob| 25|
    |Charlie| 35|
    +-------+---+

    DataFrame Schema:
    root
     |-- name: string (nullable = true)
     |-- age: long (nullable = true)  # Note: Spark often infers integers as long

    Selecting only the 'name' column:
    +-------+
    |   name|
    +-------+
    |  Alice|
    |    Bob|
    |Charlie|
    +-------+

    SparkSession stopped.
    ```

---

**4. Configuration Basics**

You can control various Spark properties related to performance, resource allocation, and behavior.

*   **Why Configure?**
    *   Tune performance (e.g., memory allocation, parallelism).
    *   Specify cluster manager details (if not running locally).
    *   Set application-specific parameters.
    *   Configure connectors to external systems (databases, message queues).

*   **How to Set Configuration:**
    1.  **Using `SparkSession.builder.config()`:** Set options when creating the session.
        ```python
        spark = SparkSession.builder \
            .appName("ConfigExample") \
            .master("local[2]") \ # Use 2 local cores
            .config("spark.driver.memory", "1g") \ # Set driver memory to 1 GB
            .config("spark.sql.shuffle.partitions", "5") \ # Set default partitions for SQL shuffles
            .getOrCreate()
        ```
    2.  **Using `spark-submit` command-line options:** Override or set configurations at runtime.
        ```bash
        spark-submit \
          --master local[4] \
          --driver-memory 2g \
          --conf spark.executor.memory=1g \
          --conf spark.sql.shuffle.partitions=10 \
          my_spark_app.py
        ```
    3.  **Using `spark-defaults.conf` file:** Place this file in Spark's `conf` directory for default settings across applications. Each line is a key-value pair (e.g., `spark.driver.memory 1g`).
    4.  **Runtime Configuration:** You can also set *some* SQL-related configurations after the session is created:
        ```python
        spark.conf.set("spark.sql.shuffle.partitions", "8")
        current_partitions = spark.conf.get("spark.sql.shuffle.partitions")
        print(f"Current shuffle partitions: {current_partitions}") # Output: 8
        ```

*   **Common Configurations (Examples):**
    *   `spark.app.name`: Your application name.
    *   `spark.master`: Cluster URL (e.g., `local[*]`, `yarn`, `spark://host:port`).
    *   `spark.driver.memory`: Memory for the driver process (where your main script runs).
    *   `spark.executor.memory`: Memory per executor process (worker processes).
    *   `spark.executor.cores`: Number of CPU cores per executor.
    *   `spark.sql.shuffle.partitions`: Default number of partitions to use when shuffling data for joins or aggregations. Tuning this is important for performance.

*   **Viewing Configuration:**
    ```python
    # Get all configuration settings
    all_conf = spark.sparkContext.getConf().getAll()
    print(all_conf)

    # Get a specific setting
    driver_mem = spark.conf.get("spark.driver.memory")
    print(f"Driver Memory: {driver_mem}")
    ```

---

**Summary of Lesson 2:**

*   PySpark can be installed **locally** (via `pip`, needs JDK), on **cloud platforms** (using managed services like EMR, Dataproc), or via **Databricks**.
*   The **`SparkSession`** (`spark`) is the modern, unified entry point for PySpark applications, used for creating DataFrames, running SQL, and accessing configuration. It manages the underlying **`SparkContext`** (`sc`).
*   We wrote and ran a basic **"Hello World"** script using `spark-submit`, demonstrating `SparkSession` creation, DataFrame creation (`spark.createDataFrame`), and basic actions (`show`, `printSchema`).
*   Spark behavior can be tuned using **configuration properties**, set via the `SparkSession` builder, `spark-submit` options, or configuration files. Basic settings control application name, master URL, and memory allocation.

This lesson equips you with the practical steps to start using PySpark in your chosen environment.