Parallel Matrix Multiplication on Titan Supercomputer & Local Machine
This project implements and benchmarks three hybrid MPI + OpenMP matrix-multiplication programs written in C.
The goal was to evaluate runtime scalability across:
- Local machine: Apple Silicon M3 MacBook Pro
- Titan Supercomputer: SB compute nodes (multi-CPU, multicore)
- Matrix sizes up to 6000 × 6000
The analysis compares how performance changes as we scale nodes, cores per node, and OpenMP parallelization strategy.
- MPI handles data distribution (scatter + broadcast + gather)
- OpenMP parallelizes the innermost dot-product loop
- File:
mmmpiOMP.c
- More efficient OpenMP region placement
- Reduced thread scheduling overhead
- File:
mmmpiOMP2.c
- OpenMP parallelizes at the row level
- Maximizes concurrency for large matrices
- Best performance across all Titan runs
- File:
mmmpiOMP3.c
- Hybrid MPI + OpenMP programming
- Distributed-memory row partitioning
- Shared-memory multithreading inside compute kernels
- Cache and NUMA behavior in HPC nodes
- Strong scaling across nodes and cores
- SLURM job scheduling and cluster execution
- High per-core performance
- Unified memory
- Excellent for small matrix benchmarks
- Dual Intel Xeon CPUs
- 40+ cores per node
- DDR4 memory subsystem
- Ideal for large distributed workloads
- 1000 × 1000
- 2000 × 2000
- 4000 × 4000
- 6000 × 6000
Nodes × Cores per Node:
- 1×1, 1×2, 1×4, 1×8, 1×16
- 8×8, 8×16
- 16×8, 16×16
Used with:
- 8 nodes × 16 cores
- 16 nodes × 16 cores
The M3 laptop outperformed Titan on smaller matrices (≈1000×1000) due to:
- Lower MPI overhead
- Faster unified memory
- Higher per-core performance
- Titan’s communication startup cost dominating small workloads
- OMP2 → Minimal improvement beyond ~8 cores
- OMP3 → Truly scalable; best performance for large matrices
mpiicx -O2 -qopenmp mmmpiOMP.c -o mmmpiOMP
mpiicx -O2 -qopenmp mmmpiOMP2.c -o mmmpiOMP2
mpiicx -O2 -qopenmp mmmpiOMP3.c -o mmmpiOMP3Example — 8 nodes × 16 cores per node:
sbatch --partition=sb -n 8 --ntasks-per-node=1 -c 16 \
mmmpi.bat mmmpiOMP3 6000 1Program Arguments
- Executable name
- Matrix size
- Transpose flag (1 = enabled)
- Parallel C source code
- Full performance analysis report
- Charts & benchmark tables
- Final PowerPoint summary presentation
- Ability to write optimized parallel C
- Deep understanding of MPI communication
- Efficient use of OpenMP multithreading
- Real HPC experience on a multi-node cluster
- Strong-scaling and performance interpretation
- Memory hierarchy & cache-aware reasoning
- Clean benchmarking and reproducible experiments