# PySpark Windows Environment Diagnostic & Fix

This notebook will help identify and fix Windows-specific PySpark issues.

## Step 1: Check Current Environment

In [None]:
import sys
import os

print("Python Version:", sys.version)
print("Python Executable:", sys.executable)

# Check PySpark version
try:
    import pyspark
    print("PySpark Version:", pyspark.__version__)
except Exception as e:
    print("PySpark Error:", e)

# Check PyArrow (often the culprit on Windows)
try:
    import pyarrow
    print("PyArrow Version:", pyarrow.__version__)
except:
    print("⚠️ PyArrow NOT installed")

# Check Java
import subprocess
try:
    result = subprocess.run(["java", "-version"], capture_output=True, text=True)
    print("\nJava Version:")
    print(result.stderr.split('\n')[0])
except:
    print("⚠️ Java NOT found in PATH")

## Step 2: Install Required Packages

PyArrow is critical for Windows PySpark compatibility.

In [None]:
# Install pyarrow if missing
!pip install pyarrow --upgrade

## Step 3: Configure Spark for Windows

Set critical environment variables before creating Spark session.

In [None]:
import os
import sys

# Critical: Set Python executables
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

# Disable PyArrow optimization that can cause issues
os.environ['PYARROW_IGNORE_TIMEZONE'] = '1'

print("✅ Environment configured")
print(f"PYSPARK_PYTHON: {os.environ['PYSPARK_PYTHON']}")

## Step 4: Create Spark Session with Windows-Optimized Config

In [None]:
from pyspark.sql import SparkSession

# Create Spark session with Windows-friendly settings
spark = SparkSession.builder \
    .appName("WindowsTest") \
    .master("local[1]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.sql.shuffle.partitions", "1") \
    .config("spark.ui.enabled", "false") \
    .config("spark.sql.adaptive.enabled", "false") \
    .config("spark.python.worker.reuse", "false") \
    .getOrCreate()

print("✅ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Spark Master: {spark.sparkContext.master}")

## Step 5: Test Basic Operations

In [None]:
from pyspark.sql import Row

# Test 1: Simple data creation
print("Test 1: Creating simple DataFrame...")
try:
    simple_data = [(1, "Alice"), (2, "Bob"), (3, "Charlie")]
    df = spark.createDataFrame(simple_data, ["id", "name"])
    print("✅ DataFrame created successfully")
    
    # Test show()
    df.show()
    print("✅ show() works!")
    
except Exception as e:
    print(f"❌ Error: {e}")

In [None]:
# Test 2: Row-based creation (what the ETL uses)
print("\nTest 2: Creating DataFrame with Row objects...")
try:
    from datetime import date
    
    row_data = [
        Row(id=1, name="Alice", date=date(2024, 1, 1)),
        Row(id=2, name="Bob", date=date(2024, 1, 2)),
    ]
    df2 = spark.createDataFrame(row_data)
    print("✅ Row-based DataFrame created")
    
    df2.show()
    print("✅ Row-based show() works!")
    
except Exception as e:
    print(f"❌ Error: {e}")

In [None]:
# Test 3: Operations
print("\nTest 3: Testing transformations...")
try:
    from pyspark.sql.functions import col, upper
    
    df3 = df.withColumn("upper_name", upper(col("name")))
    df3.show()
    print("✅ Transformations work!")
    
except Exception as e:
    print(f"❌ Error: {e}")

In [None]:
# Test 4: Simple UDF
print("\nTest 4: Testing simple UDF...")
try:
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType
    
    def add_prefix(name):
        return f"Mr. {name}"
    
    add_prefix_udf = udf(add_prefix, StringType())
    
    df4 = df.withColumn("prefixed", add_prefix_udf(col("name")))
    df4.show()
    print("✅ UDF works!")
    
except Exception as e:
    print(f"❌ UDF Error: {e}")

## Step 6: Write Test (Critical for ETL)

In [None]:
import tempfile
import os

print("Test 5: Testing write operations...")

# Test CSV write
try:
    temp_dir = tempfile.mkdtemp()
    csv_path = os.path.join(temp_dir, "test_csv")
    
    df.coalesce(1).write.csv(csv_path, mode="overwrite", header=True)
    print("✅ CSV write works!")
    
except Exception as e:
    print(f"❌ CSV write failed: {e}")

# Test Parquet write
try:
    parquet_path = os.path.join(temp_dir, "test_parquet")
    
    df.coalesce(1).write.parquet(parquet_path, mode="overwrite")
    print("✅ Parquet write works!")
    
except Exception as e:
    print(f"❌ Parquet write failed: {e}")

# Test pandas conversion (workaround)
try:
    import pandas as pd
    
    pandas_df = df.toPandas()
    csv_file = os.path.join(temp_dir, "pandas_test.csv")
    pandas_df.to_csv(csv_file, index=False)
    print("✅ Pandas conversion and write works!")
    
except Exception as e:
    print(f"❌ Pandas approach failed: {e}")

## Summary & Recommendations

Based on which tests passed/failed above:

### If All Tests Pass:
✅ Your environment is fixed! The ETL notebook should now work.

### If show() Fails:
- Reinstall pyarrow: `pip uninstall pyarrow && pip install pyarrow`
- Check Java version (needs Java 11 or 17)

### If UDF Fails:
- Use `local[1]` instead of `local[*]`
- Set `spark.python.worker.reuse` to `false`

### If Write Fails:
- Use pandas workaround: `df.toPandas().to_csv()`
- This is what we implemented in the local notebook

In [None]:
# Cleanup
spark.stop()
print("✅ Spark session stopped")