# PySpark PCA Example

This notebook demonstrates how to use Principal Component Analysis (PCA) in PySpark to reduce the dimensionality of a dataset.

## Step 1: Set Up Spark Session
We first create a Spark session to work with PySpark.

In [None]:
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder \
    .appName("PCA Example") \
    .getOrCreate()

## Step 2: Load and Prepare Data
Create a simple dataset containing numerical features for applying PCA.

In [None]:
from pyspark.sql import Row

# Create a sample dataset
data = [
    Row(feature1=1.0, feature2=2.0, feature3=3.0),
    Row(feature1=4.0, feature2=5.0, feature3=6.0),
    Row(feature1=7.0, feature2=8.0, feature3=9.0),
    Row(feature1=10.0, feature2=11.0, feature3=12.0)
]

# Convert the list to a DataFrame
df = spark.createDataFrame(data)
df.show()

## Step 3: Assemble Features
Combine the features into a single vector column using `VectorAssembler`.

In [None]:
from pyspark.ml.feature import VectorAssembler

# Assemble features into a single vector column
assembler = VectorAssembler(inputCols=["feature1", "feature2", "feature3"], outputCol="features")
dataset = assembler.transform(df)
dataset.show()

## Step 4: Apply PCA
Reduce the dimensionality of the dataset to 2 principal components.

In [None]:
from pyspark.ml.feature import PCA

# Apply PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(dataset)

# Transform the dataset
result = model.transform(dataset)
result.select("features", "pcaFeatures").show()

## Step 5: Inspect PCA Results
We extract and display the principal components.

In [None]:
# Print the principal components
print("Principal Components:")
print(model.pc.toArray())

## Step 6: Stop the Spark Session
Release resources by stopping the Spark session.

In [None]:
# Stop the Spark session
spark.stop()