In [1]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("MatrixMultiplication").setMaster("local")
sc = SparkContext(conf=conf)

# Sample matrix A (2x3)
A = [
    (0, 0, 1), (0, 1, 2), (0, 2, 3),
    (1, 0, 4), (1, 1, 5), (1, 2, 6)
]

# Sample matrix B (3x2)
B = [
    (0, 0, 7), (0, 1, 8),
    (1, 0, 9), (1, 1, 10),
    (2, 0, 11), (2, 1, 12)
]

# Convert to RDDs
rdd_A = sc.parallelize(A)  # (i, k, A_ik)
rdd_B = sc.parallelize(B)  # (k, j, B_kj)

# Map step: emit intermediate keys with tags
mapped_A = rdd_A.flatMap(lambda x: [((x[0], j), ('A', x[1], x[2])) for j in range(2)])  # 2 is number of columns in B
mapped_B = rdd_B.flatMap(lambda x: [((i, x[1]), ('B', x[0], x[2])) for i in range(2)])  # 2 is number of rows in A

# Combine and group by key
joined = mapped_A.union(mapped_B).groupByKey()

# Reduce step: multiply matching entries and sum
def multiply(values):
    A_vals = {k: v for tag, k, v in values if tag == 'A'}
    B_vals = {k: v for tag, k, v in values if tag == 'B'}
    total = sum(A_vals.get(k, 0) * B_vals.get(k, 0) for k in set(A_vals) & set(B_vals))
    return total

result = joined.mapValues(multiply)

# Display result as (i, j, C_ij)
for ((i, j), val) in result.collect():
    print(f"C[{i}][{j}] = {val}")

sc.stop()


C[0][0] = 58
C[1][1] = 154
C[0][1] = 64
C[1][0] = 139


We are multiplying two matrices using a MapReduce paradigm in PySpark, which simulates Hadoop-style processing using big data functions like map, flatMap, groupByKey, and reduce.
🔢 Example

Given:

Matrix A (2x3):

A = [[1, 2, 3],
     [4, 5, 6]]

Matrix B (3x2):

B = [[7, 8],
     [9, 10],
     [11, 12]]

We want to compute Matrix C = A × B, which will be a 2x2 matrix.
💡 Step-by-Step Explanation
✅ 1. RDD Initialization

rdd_A = sc.parallelize(A)  # Format: (i, k, A[i][k])
rdd_B = sc.parallelize(B)  # Format: (k, j, B[k][j])

We load both matrices into Resilient Distributed Datasets (RDDs), which are the fundamental data structure in Spark for handling large-scale distributed data.
✅ 2. Map Phase (flatMap)

mapped_A = rdd_A.flatMap(lambda x: [((x[0], j), ('A', x[1], x[2])) for j in range(2)])
mapped_B = rdd_B.flatMap(lambda x: [((i, x[1]), ('B', x[0], x[2])) for i in range(2)])

What This Does:

    We're preparing each matrix’s values so they can be joined by keys.

    For every cell in A, we broadcast its value to all columns in B.

    For every cell in B, we broadcast its value to all rows in A.

    This step simulates the Map step in MapReduce.

Big Data Concept:

flatMap() is a transformation that allows parallel broadcasting of elements. It's distributed across all Spark workers.
✅ 3. Union & Grouping (shuffle phase)

joined = mapped_A.union(mapped_B).groupByKey()

    union() merges the two RDDs.

    groupByKey() groups all emitted values for a particular cell (i, j) of result matrix C.

    All the values needed to compute one cell C[i][j] are now together.

Big Data Concept:

This is the shuffle phase in Hadoop, where key-value pairs are grouped across the cluster.
✅ 4. Reduce Phase (multiplication)

def multiply(values):
    A_vals = {k: v for tag, k, v in values if tag == 'A'}
    B_vals = {k: v for tag, k, v in values if tag == 'B'}
    total = sum(A_vals.get(k, 0) * B_vals.get(k, 0) for k in set(A_vals) & set(B_vals))
    return total

result = joined.mapValues(multiply)

    We now compute the dot product of the i-th row of A and j-th column of B.

    Only matching indices (k) are multiplied and summed.

Big Data Concept:

    mapValues() is a transformation used to apply logic in parallel to all values of grouped keys — like the Reduce step.

    This part benefits from Spark’s distributed in-memory computation.

✅ 5. Collecting the Result

for ((i, j), val) in result.collect():
    print(f"C[{i}][{j}] = {val}")

    collect() brings results back from all nodes to the driver program.

    Use only when final result is small (like our matrix).

🧠 Viva-Ready Big Data Concepts Used
PySpark Function	Big Data Equivalent	Purpose
parallelize()	Input Splitting	Loads data in parallel
flatMap()	Map Phase	Emits intermediate key-value pairs
union()	Merge Mappers' Output	Combines mapped datasets
groupByKey()	Shuffle/Sort	Groups values by key for reduction
mapValues()	Reduce Phase	Computes final values per key
collect()	Output	Gathers result to single machine
✅ Why This is a Good Big Data Example

    Works in parallel across distributed nodes.

    Avoids loading entire matrices into memory at once.

    Uses MapReduce paradigm naturally for a mathematical problem.

    Scales to huge matrices with minor code change.