# Spark Local Installation Guide

This notebook provides two methods for setting up Apache Spark in a local environment (e.g., Google Colab, local Jupyter notebook, or similar platforms). 

**Apache Spark** is an open-source unified analytics engine for large-scale data processing. **PySpark** is the Python API for Apache Spark, allowing you to perform real-time, large-scale data processing in a distributed environment using Python.

## Prerequisites
- Python environment (Jupyter, Colab, etc.)
- Internet connection (for downloading packages/binaries)
- Basic understanding of Python

## Two Installation Methods

1. **Method 1: Manual Installation with Binaries** - Download and extract Spark binaries manually (useful when you need a specific Spark version or have pre-downloaded files)
2. **Method 2: Simple pip Installation** - Install PySpark directly via pip (recommended for quick setup and most use cases)

---

## Method 1: Manual Installation with Binaries

This method involves downloading the Spark binaries and setting up the environment manually. This is useful when:
- You need a specific version of Spark
- You want to upload pre-downloaded binaries to avoid long download times
- You're working in an environment like Google Colab with ephemeral storage

### Step 1: Clean Previous Installation (Optional)

In [None]:
# Reset environment variables to ensure clean installation
# This removes any existing Spark environment variables
import os, sys, shutil, glob

for var in ["SPARK_HOME", "PYSPARK_SUBMIT_ARGS", "PYSPARK_PYTHON", "PYSPARK_DRIVER_PYTHON"]:
    os.environ.pop(var, None)

# Optional: remove any old spark folders (for Google Colab)
for p in glob.glob("/content/spark-*"):
    try:
        shutil.rmtree(p)
    except Exception:
        pass

print("Environment cleaned successfully!")

### Step 2: Install Java

Spark requires Java to run. We'll install OpenJDK 11 (a free and open-source implementation of Java).

**Note:** This command is for Linux-based systems (like Google Colab). For Windows or Mac, you'll need to install Java manually from the Java website.

In [None]:
# Install Java (OpenJDK 11) - This is required for Spark to run
# The -qq flag suppresses output, > /dev/null redirects output to null
!apt-get install openjdk-11-jdk-headless -qq > /dev/null

print("Java installed successfully!")
!java -version

### Step 3: Download and Extract Spark Binaries

There are two options here:

**Option A:** Download directly from Apache servers (can take a while)
- Uncomment the `!wget` line below to download

**Option B:** Upload pre-downloaded binaries
- Download `spark-3.5.1-bin-hadoop3.tgz` to your computer from https://archive.apache.org/dist/spark/spark-3.5.1/
- In Google Colab: Click the folder icon on the left sidebar
- Upload the `.tgz` file to the `/content/` directory
- Then run the extraction command below

We're using **Spark 3.5.1**, which is the latest stable release as of January 2025.

In [None]:
# Option A: Download Spark 3.5.1 from Apache (uncomment the line below if needed)
# This can take several minutes depending on your internet connection
# !wget -q https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

# Extract Spark 3.5.1 binaries
# This assumes the .tgz file is in the current directory
!tar xf spark-3.5.1-bin-hadoop3.tgz

print("Spark binaries extracted successfully!")

### Step 4: Install PySpark and Findspark Python Packages

- **PySpark:** Python API for Apache Spark
- **Findspark:** Helper library to locate Spark installation and initialize it

We install these via pip to get the Python bindings for Spark.

In [None]:
# Install PySpark and Findspark packages
# -q flag makes the installation quiet (less verbose output)
!pip install -q pyspark findspark

# Alternative: Install a specific version of PySpark (if needed)
# !pip install -q pyspark==3.5.1

print("PySpark and Findspark installed successfully!")

### Step 5: Initialize Spark Session

A **SparkSession** is the entry point to using Spark functionality. It allows you to:
- Create DataFrames
- Read and write data
- Execute SQL queries
- Manage Spark configuration

By setting `.master("local[*]")`, we configure Spark to run locally using all available CPU cores. The `*` means use all available cores.

In [None]:
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

# Create a Spark Session
# - master("local[*]"): Run Spark locally with all available CPU cores
# - appName(): Give your Spark application a name (useful for monitoring)
# - getOrCreate(): Get existing session or create new one if none exists

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("LocalSparkSetup") \
    .getOrCreate()

# Verify installation by printing Spark version
print("✅ Spark installation successful!")
print(f"Spark version: {spark.version}")

# You can also access SparkContext (lower-level API) through the session
sc = spark.sparkContext
print(f"SparkContext available: {sc is not None}")

### Step 6: Test the Installation

Let's test if Spark is working correctly by creating a simple DataFrame and performing a basic operation.

In [None]:
# Create a simple test DataFrame
test_data = [
    ("Alice", 25, "Engineer"),
    ("Bob", 30, "Data Scientist"),
    ("Charlie", 35, "Manager")
]

columns = ["Name", "Age", "Job"]

# Create DataFrame from test data
df = spark.createDataFrame(data=test_data, schema=columns)

# Display the DataFrame
print("Test DataFrame:")
df.show()

# Perform a simple operation - filter ages > 25
print("\nFiltered DataFrame (Age > 25):")
df.filter(df.Age > 25).show()

print("\n✅ All tests passed! Spark is working correctly.")

In [None]:
# Troubleshooting: If Method 1 didn't work, run this cell to reset everything

import os, sys, shutil, glob

# Remove all Spark environment variables
for var in ["SPARK_HOME", "PYSPARK_SUBMIT_ARGS", "PYSPARK_PYTHON", "PYSPARK_DRIVER_PYTHON"]:
    os.environ.pop(var, None)

# Remove any old spark folders
for p in glob.glob("/content/spark-*"):
    try:
        shutil.rmtree(p)
        print(f"Removed: {p}")
    except Exception as e:
        print(f"Could not remove {p}: {e}")

print("\n✅ Environment reset complete! Try running the installation steps again.")

---

## Method 2: Simple pip Installation (Recommended)

This is the **simplest and quickest method** for setting up Spark. It's ideal when:
- You want a quick setup without dealing with binaries
- You're okay with the latest stable PySpark version from PyPI
- You don't need a specific Spark distribution

This method only requires installing the PySpark package via pip - no need for Java installation or binary downloads!

### Step 1: Install PySpark Package

In [None]:
# Install PySpark via pip
# This will automatically download and install PySpark and all its dependencies
!pip install -q pyspark

# Optional: Install findspark if you need it for more complex setups
# !pip install -q findspark

print("✅ PySpark installed successfully!")

### Step 2: Create and Initialize Spark Session

That's it! With PySpark installed, you can now directly create a Spark session and start using Spark.

In [None]:
# Import SparkSession
from pyspark.sql import SparkSession

# Create Spark Session
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("PipSparkSetup") \
    .getOrCreate()

# Verify installation
print("✅ Spark session created successfully!")
print(f"Spark version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

# Get SparkContext for RDD operations
sc = spark.sparkContext
print(f"SparkContext ID: {sc.applicationId}")

### Step 3: Test Your Installation

Run a quick test to ensure everything is working properly.

In [None]:
# Test 1: Create a DataFrame
data = [("Spark", 2014), ("Python", 1991), ("Scala", 2003), ("Java", 1995)]
columns = ["Language", "Year"]

df = spark.createDataFrame(data, columns)

print("Test 1: DataFrame Creation")
df.show()

# Test 2: Perform transformations
print("\nTest 2: Filter languages created after 2000")
df.filter(df.Year > 2000).show()

# Test 3: Use RDD (low-level API)
print("\nTest 3: RDD Operations")
rdd = sc.parallelize([1, 2, 3, 4, 5])
squared = rdd.map(lambda x: x**2)
print(f"Original: {rdd.collect()}")
print(f"Squared: {squared.collect()}")

print("\n✅ All tests passed! Spark is fully functional.")

---

## Summary and Next Steps

### Which Method Should You Use?

| Method | Pros | Cons | Best For |
|--------|------|------|----------|
| **Method 1** | • Specific Spark version control<br>• Can work offline with pre-downloaded binaries<br>• Full Spark distribution | • More complex setup<br>• Requires Java installation<br>• Larger download size | • When you need a specific Spark version<br>• When working with large-scale projects<br>• When you need full Spark features |
| **Method 2** | • Quick and simple<br>• Minimal dependencies<br>• Easy to maintain | • Uses latest PyPI version<br>• Less control over Spark version | • Quick prototyping<br>• Learning PySpark<br>• Most data analysis tasks |

### Key Concepts to Remember

1. **SparkSession**: The entry point for all Spark functionality
2. **SparkContext**: Lower-level API accessed via `spark.sparkContext`
3. **Local Mode**: Running Spark on a single machine (what we're doing here)
4. **Distributed Mode**: Running Spark on a cluster (covered in advanced topics)

### Common Issues and Solutions

1. **Java Not Found Error**
   - Solution: Make sure Java is installed (required for Method 1)
   - Check with: `!java -version`

2. **Module Not Found: pyspark**
   - Solution: Run `!pip install pyspark`
   
3. **Port Already in Use**
   - Solution: Restart your kernel/runtime and try again

4. **Out of Memory Errors**
   - Solution: Reduce data size or configure Spark memory settings

### Additional Resources

- [PySpark Official Documentation](https://spark.apache.org/docs/latest/api/python/)
- [PySpark by Examples](https://sparkbyexamples.com/pyspark-tutorial/)
- [Spark SQL Guide](https://spark.apache.org/docs/latest/sql-programming-guide.html)
- [PySpark Cheat Sheet](https://sparkbyexamples.com/pyspark/pyspark-sql-cheat-sheet/)

### Stopping Spark Session

When you're done working with Spark, it's good practice to stop the session:

In [None]:
# Stop the Spark session when you're done
# This releases resources and cleans up
spark.stop()

print("✅ Spark session stopped successfully!")