A comprehensive distributed computing project demonstrating both traditional HPC and modern Big Data processing approaches through machine learning on bioinformatics datasets.
This project implements two distinct distributed computing paradigms:
- Task 1: Traditional Mini-HPC cluster using MPI for distributed machine learning
- Task 2: Hybrid HPC-Big Data cluster using Docker Swarm and Apache Spark
Both approaches are evaluated on real-world bioinformatics datasets, specifically gene expression analysis for leukemia classification using the Golub dataset.
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Master โ โ Worker 1 โ โ Worker 2 โ
โ Node โโโโโโ Node โโโโโโ Node โ
โ โ โ โ โ โ
โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ โ โ
โโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโ
MPI Communication
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Docker Swarm โ
โโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโค
โ Spark โ Spark Worker 1 โ Spark Worker 2 โ
โ Master โ โ โ
โ + Jupyter โ โ โ
โโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
Golub Leukemia Dataset: Gene expression data for Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) classification
- Training samples: 38 samples across 7,129 genes
- Test samples: 34 samples
- Challenge: High-dimensional feature space with small sample size
MNIST Digits Dataset: Used for initial MPI testing and validation
- Samples: 1,797 digit images (8x8 pixels)
- Classes: 10 digits (0-9)
- Purpose: Validate MPI implementation before bioinformatics analysis
- Hardware: 3 Virtual Machines (1 master + 2 workers)
- OS: Ubuntu 20.04+ or CentOS 7+
- Memory: Minimum 4GB per VM
- Network: Configured for inter-VM communication
Install Dependencies
# Update system
sudo apt update && sudo apt upgrade -y
# Install OpenMPI
sudo apt install openmpi-bin openmpi-common libopenmpi-dev -y
# Install Python dependencies
pip install mpi4py numpy pandas scikit-learnConfigure Passwordless SSH
# Generate SSH key on master
ssh-keygen -t rsa -b 4096
# Copy to all nodes
ssh-copy-id user@worker1
ssh-copy-id user@worker2Create Hostfile
# /etc/openmpi/hostfile
master-node slots=1
worker1-node slots=2
worker2-node slots=2Initialize Docker Swarm
# On master node
docker swarm init --advertise-addr <master-ip>
# Join workers (run on worker nodes)
docker swarm join --token <token> <master-ip>:2377Deploy Spark Cluster
# Deploy the stack
docker stack deploy -c spark-stack.yml spark-cluster
# Verify deployment
docker service lsMNIST Digit Classification
mpirun -np 4 --hostfile hostfile python distributed_mnist.pyGene Expression Analysis
mpirun -np 4 --hostfile hostfile python distributed_gene_analysis.pyAccess Jupyter Notebook
# Navigate to http://master-ip:8888
# Open distributed_gene_expression_analysis.pyMonitor Spark Jobs
# Spark UI: http://master-ip:8080
# Job monitoring and resource utilization- Training Time: 0.0398 seconds (average across processes)
- Test Accuracy: 58.82%
- Ensemble Method: Majority voting across distributed models
- Scalability: Linear speedup with additional processes
- Training Time: ~15-20 seconds (including overhead)
- Test Accuracy: ~65-70% (improved with advanced feature selection)
- Features: Automated feature selection pipeline (7,129 โ 500 features)
- Advantage: Better handling of large-scale data processing
| Metric | MPI | Spark |
|---|---|---|
| Setup Complexity | Low | Medium |
| Training Time | 0.04s | 15-20s |
| Accuracy | 58.82% | 65-70% |
| Scalability | Excellent | Good |
| Fault Tolerance | Limited | High |
| Big Data Support | Limited | Excellent |
- Data Distribution: Manual data partitioning across processes
- Model Training: Independent RandomForest models per process (50 estimators)
- Feature Selection: SelectKBest with f_classif (top 500 features)
- Aggregation: Majority voting ensemble
- Communication: Point-to-point and collective MPI operations
- Data Processing: Distributed DataFrames with automatic partitioning
- Feature Engineering: MLlib pipeline with ChiSqSelector (1000 โ 500 features)
- Model Training: Distributed RandomForest (20 trees, maxDepth=10)
- Fault Tolerance: Automatic recovery and lineage tracking
- Optimization: Checkpointing and caching for performance
project_hpc_hybrid_cluster/
โโโ README.md # This file
โโโ hostfile # MPI hostfile configuration
โโโ distributed_mnist.py # MPI MNIST classification
โโโ distributed_gene_analysis.py # MPI gene expression analysis
โโโ spark-stack.yml # Docker Swarm Spark deployment
โโโ distributed_gene_expression_analysis.py # Spark gene analysis
โโโ bioinfo_data/ # Dataset directory
โ โโโ data_set_ALL_AML_train.csv # Training data
โ โโโ data_set_ALL_AML_independent.csv # Test data
โ โโโ actual.csv # Labels
โโโ screenshots/ # Setup and results screenshots
โโโ Final_Report.pdf # Comprehensive analysis report
- HPC Fundamentals: Understanding of distributed computing paradigms
- MPI Programming: Hands-on experience with message passing interface
- Big Data Processing: Apache Spark ecosystem and MLlib
- Container Orchestration: Docker Swarm cluster management
- Bioinformatics ML: Real-world gene expression analysis
- Performance Analysis: Comparative evaluation of different approaches
- OpenMPI Documentation
- Apache Spark MLlib Guide
- Golub et al. (1999) - Molecular Classification of Cancer
- Docker Swarm Documentation
This project is licensed under the MIT License - see the LICENSE file for details.
Built with โค๏ธ for distributed computing education