In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, PCA
from pyspark.ml.stat import Correlation

# Step 1: Create Spark session
spark = SparkSession.builder.appName("IrisMultivariateAnalysis").getOrCreate()

# Step 2: Load Iris dataset (with real column names)
df = spark.read.csv("Iris.csv", header=True, inferSchema=True)

# Step 3: Rename columns to standard format
df_renamed = df.selectExpr(
    "SepalLengthCm as sepal_length",
    "SepalWidthCm as sepal_width",
    "PetalLengthCm as petal_length",
    "PetalWidthCm as petal_width",
    "Species as species"
)

# Step 4: Assemble features into a vector
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
assembler = VectorAssembler(inputCols=feature_cols, outputCol="features")
df_vector = assembler.transform(df_renamed)

# Step 5: Descriptive Stats
print("=== Descriptive Statistics ===")
df_vector.select(feature_cols).describe().show()

# Step 6: Correlation Matrix
print("=== Correlation Matrix ===")
correlation_matrix = Correlation.corr(df_vector, "features").head()[0]
print(correlation_matrix)

# Step 7: PCA
pca = PCA(k=2, inputCol="features", outputCol="pca_features")
pca_model = pca.fit(df_vector)
df_pca = pca_model.transform(df_vector)

print("=== PCA Result (first 5 rows) ===")
df_pca.select("pca_features").show(5, truncate=False)

# Step 8: Stop Spark session
spark.stop()


=== Descriptive Statistics ===
+-------+------------------+-------------------+------------------+------------------+
|summary|      sepal_length|        sepal_width|      petal_length|       petal_width|
+-------+------------------+-------------------+------------------+------------------+
|  count|               150|                150|               150|               150|
|   mean| 5.843333333333335| 3.0540000000000007|3.7586666666666693|1.1986666666666672|
| stddev|0.8280661279778637|0.43359431136217375| 1.764420419952262|0.7631607417008414|
|    min|               4.3|                2.0|               1.0|               0.1|
|    max|               7.9|                4.4|               6.9|               2.5|
+-------+------------------+-------------------+------------------+------------------+

=== Correlation Matrix ===
DenseMatrix([[ 1.        , -0.10936925,  0.87175416,  0.81795363],
             [-0.10936925,  1.        , -0.4205161 , -0.35654409],
             [ 0.8717541

Viva-Ready Explanation of Multivariate Analysis
Method	Description
Descriptive Stats	Shows mean, stddev, min, max for each variable
Correlation Matrix	Measures linear relationship between features (1 = strong positive, -1 = strong negative)
PCA	Reduces dimensionality while retaining max variance; useful for visualizations & reducing computation
📌 Sample iris.csv Format (you can use this in Colab or HDFS):

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
6.2,3.4,5.4,2.3,virginica
...

🚀 What Makes This “Big Data”?

    Spark: Can handle large versions of Iris-like datasets distributed across a cluster.

    Scalable: You can scale it to millions of rows using the same code.

    Parallel: Multivariate stats and PCA are run in parallel across nodes.