<!-- use this command in cmd - spark-shell -->

In [1]:
import findspark
findspark.init()

In [2]:
import pyspark


In [3]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("MatrixMultiplication") \
    .getOrCreate()

# Sample matrices
matrix1 = [
    (0, 0, 2),
    (0, 1, 3),
    (1, 0, 4),
    (1, 1, 5)
]

matrix2 = [
    (0, 0, 6),
    (0, 1, 7),
    (1, 0, 8),
    (1, 1, 9)
]

# Create RDDs from the matrices
matrix1_rdd = spark.sparkContext.parallelize(matrix1)
matrix2_rdd = spark.sparkContext.parallelize(matrix2)

# Perform matrix multiplication using map-reduce
result_rdd = matrix1_rdd.flatMap(lambda x: [((x[0], y[1]), x[2] * y[2]) for y in matrix2 if x[1] == y[0]]). \
    reduceByKey(lambda x, y: x + y)


# Convert RDD to DataFrame
result_df = spark.createDataFrame(result_rdd.map(lambda x: (x[0][0], x[0][1], x[1])), ["row", "col", "result"])

# Display the result
result_df.show()

# Stop SparkSession
spark.stop()


+---+---+------+
|row|col|result|
+---+---+------+
|  0|  1|    41|
|  0|  0|    36|
|  1|  0|    64|
|  1|  1|    73|
+---+---+------+



In [None]:
from pyspark.sql import SparkSession: This line imports the SparkSession class from the pyspark.sql module. SparkSession is the entry point to Spark SQL functionality and allows the creation and management of DataFrame objects.
spark = SparkSession.builder \ .appName("MatrixMultiplication") \ .getOrCreate(): This code creates a SparkSession named "MatrixMultiplication" if it doesn't already exist. The appName method sets the name of the application.
matrix1 and matrix2: These variables hold sample matrices represented as lists of tuples. Each tuple contains three elements: row index, column index, and value.
matrix1_rdd and matrix2_rdd: These variables create RDDs (Resilient Distributed Datasets) from the sample matrices using parallelize. RDDs are the fundamental data structure in Spark.
result_rdd: This variable performs matrix multiplication using map-reduce operations on the RDDs. It flattens matrix1_rdd and matrix2_rdd, multiplies corresponding elements, and then reduces by key to sum up the results.
result_df: This variable converts the resulting RDD to a DataFrame using createDataFrame. It maps each tuple in the RDD to a tuple with three elements: row index, column index, and result. It also specifies column names as "row", "col", and "result".
result_df.show(): This line displays the DataFrame containing the result of the matrix multiplication.
spark.stop(): This line stops the SparkSession, releasing the resources associated with it.
In summary, this code demonstrates how to perform matrix multiplication using PySpark RDDs and then convert the result into a DataFrame for easy visualization and further analysis.