Skip to content

This is a comprehensive distributed computing project that demonstrates two distinct approaches to high-performance computing and big data processing. The project implements both traditional HPC methodologies and modern Big Data frameworks to solve real-world bioinformatics problems.

License

Notifications You must be signed in to change notification settings

xHaMMaDy/mpi-spark-bioinformatics

Repository files navigation

Mini-HPC and Hybrid HPC-Big Data Clusters

A comprehensive distributed computing project demonstrating both traditional HPC and modern Big Data processing approaches through machine learning on bioinformatics datasets.

๐ŸŽฏ Project Overview

This project implements two distinct distributed computing paradigms:

  • Task 1: Traditional Mini-HPC cluster using MPI for distributed machine learning
  • Task 2: Hybrid HPC-Big Data cluster using Docker Swarm and Apache Spark

Both approaches are evaluated on real-world bioinformatics datasets, specifically gene expression analysis for leukemia classification using the Golub dataset.

๐Ÿ—๏ธ Architecture

Task 1: Mini-HPC Cluster

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Master    โ”‚    โ”‚  Worker 1   โ”‚    โ”‚  Worker 2   โ”‚
โ”‚    Node     โ”‚โ”€โ”€โ”€โ”€โ”‚    Node     โ”‚โ”€โ”€โ”€โ”€โ”‚    Node     โ”‚
โ”‚             โ”‚    โ”‚             โ”‚    โ”‚             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ”‚                   โ”‚                   โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                    MPI Communication

Task 2: Docker Swarm + Spark Cluster

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                Docker Swarm                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Spark       โ”‚ Spark Worker 1  โ”‚ Spark Worker 2      โ”‚
โ”‚ Master      โ”‚                 โ”‚                     โ”‚
โ”‚ + Jupyter   โ”‚                 โ”‚                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“Š Datasets

Golub Leukemia Dataset: Gene expression data for Acute Lymphoblastic Leukemia (ALL) vs Acute Myeloid Leukemia (AML) classification

  • Training samples: 38 samples across 7,129 genes
  • Test samples: 34 samples
  • Challenge: High-dimensional feature space with small sample size

MNIST Digits Dataset: Used for initial MPI testing and validation

  • Samples: 1,797 digit images (8x8 pixels)
  • Classes: 10 digits (0-9)
  • Purpose: Validate MPI implementation before bioinformatics analysis

๐Ÿš€ Quick Start

Prerequisites

  • Hardware: 3 Virtual Machines (1 master + 2 workers)
  • OS: Ubuntu 20.04+ or CentOS 7+
  • Memory: Minimum 4GB per VM
  • Network: Configured for inter-VM communication

Task 1: MPI Cluster Setup

Install Dependencies

# Update system
sudo apt update && sudo apt upgrade -y

# Install OpenMPI
sudo apt install openmpi-bin openmpi-common libopenmpi-dev -y

# Install Python dependencies
pip install mpi4py numpy pandas scikit-learn

Configure Passwordless SSH

# Generate SSH key on master
ssh-keygen -t rsa -b 4096

# Copy to all nodes
ssh-copy-id user@worker1
ssh-copy-id user@worker2

Create Hostfile

# /etc/openmpi/hostfile
master-node slots=1
worker1-node slots=2  
worker2-node slots=2

Task 2: Docker Swarm + Spark Setup

Initialize Docker Swarm

# On master node
docker swarm init --advertise-addr <master-ip>

# Join workers (run on worker nodes)
docker swarm join --token <token> <master-ip>:2377

Deploy Spark Cluster

# Deploy the stack
docker stack deploy -c spark-stack.yml spark-cluster

# Verify deployment
docker service ls

๐ŸŽฎ Usage

Running MPI Distributed Analysis

MNIST Digit Classification

mpirun -np 4 --hostfile hostfile python distributed_mnist.py

Gene Expression Analysis

mpirun -np 4 --hostfile hostfile python distributed_gene_analysis.py

Running Spark Distributed Analysis

Access Jupyter Notebook

# Navigate to http://master-ip:8888
# Open distributed_gene_expression_analysis.py

Monitor Spark Jobs

# Spark UI: http://master-ip:8080
# Job monitoring and resource utilization

๐Ÿ“ˆ Performance Results

MPI Implementation Results

  • Training Time: 0.0398 seconds (average across processes)
  • Test Accuracy: 58.82%
  • Ensemble Method: Majority voting across distributed models
  • Scalability: Linear speedup with additional processes

Spark Implementation Results

  • Training Time: ~15-20 seconds (including overhead)
  • Test Accuracy: ~65-70% (improved with advanced feature selection)
  • Features: Automated feature selection pipeline (7,129 โ†’ 500 features)
  • Advantage: Better handling of large-scale data processing

Performance Comparison

Metric MPI Spark
Setup Complexity Low Medium
Training Time 0.04s 15-20s
Accuracy 58.82% 65-70%
Scalability Excellent Good
Fault Tolerance Limited High
Big Data Support Limited Excellent

๐Ÿ”ฌ Technical Implementation

MPI Approach

  • Data Distribution: Manual data partitioning across processes
  • Model Training: Independent RandomForest models per process (50 estimators)
  • Feature Selection: SelectKBest with f_classif (top 500 features)
  • Aggregation: Majority voting ensemble
  • Communication: Point-to-point and collective MPI operations

Spark Approach

  • Data Processing: Distributed DataFrames with automatic partitioning
  • Feature Engineering: MLlib pipeline with ChiSqSelector (1000 โ†’ 500 features)
  • Model Training: Distributed RandomForest (20 trees, maxDepth=10)
  • Fault Tolerance: Automatic recovery and lineage tracking
  • Optimization: Checkpointing and caching for performance

๐Ÿ“ Project Structure

project_hpc_hybrid_cluster/
โ”œโ”€โ”€ README.md                           # This file
โ”œโ”€โ”€ hostfile                           # MPI hostfile configuration
โ”œโ”€โ”€ distributed_mnist.py               # MPI MNIST classification
โ”œโ”€โ”€ distributed_gene_analysis.py       # MPI gene expression analysis
โ”œโ”€โ”€ spark-stack.yml                    # Docker Swarm Spark deployment
โ”œโ”€โ”€ distributed_gene_expression_analysis.py  # Spark gene analysis
โ”œโ”€โ”€ bioinfo_data/                      # Dataset directory
โ”‚   โ”œโ”€โ”€ data_set_ALL_AML_train.csv    # Training data
โ”‚   โ”œโ”€โ”€ data_set_ALL_AML_independent.csv # Test data
โ”‚   โ””โ”€โ”€ actual.csv                     # Labels
โ”œโ”€โ”€ screenshots/                       # Setup and results screenshots
โ””โ”€โ”€ Final_Report.pdf                   # Comprehensive analysis report

๐ŸŽฏ Key Learning Outcomes

  • HPC Fundamentals: Understanding of distributed computing paradigms
  • MPI Programming: Hands-on experience with message passing interface
  • Big Data Processing: Apache Spark ecosystem and MLlib
  • Container Orchestration: Docker Swarm cluster management
  • Bioinformatics ML: Real-world gene expression analysis
  • Performance Analysis: Comparative evaluation of different approaches

๐Ÿ“š References

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


Built with โค๏ธ for distributed computing education

About

This is a comprehensive distributed computing project that demonstrates two distinct approaches to high-performance computing and big data processing. The project implements both traditional HPC methodologies and modern Big Data frameworks to solve real-world bioinformatics problems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages