Message passing interface(MPI) is primarily designed for distributed memory machines, though it can be useful in shared memory systems.It was first standardized in 1994.

The MPI paradigm:
    Processes can access local memory only but can communicate using messages.

    The messages are passed from the local memory of one processor to the local memory of another processor magnitudes slower than to local memory.

Latency transfer using MPI:    
 Latency infiband<0.6 µs, max 15m cable.
    Bandwidth infiniband 400Gbit/s

Advantages of MPI:
1. Message passing in shared memory:
    Using MPI in distributed and shared systems, it enables the programmer to manage data locality- by exactly specifying where the data goes. In shared distributed systems, message passing often means rewriting to another part of shared memory, so you dont pay a network cost.
2. Debugging/overwriting advantage:
     MPI communication is explicit(using MPI_Send,MPI_Recv etc), makig it easier to trace where the data is coming from and going to, making debugging less painful(compared to open_mp).
3. MPI is a library, not a language:
    MPI provides functions you call inside your program- its not a new programming language.
    Bindings exist for C, C++, Python making it flexible.
    There are open-sourced MPI implementations (like MPICH, OpenMPI, LAM/MPI) and also vendor tuned versions optimized for supercomputers.
4. MPI is Portable:
    Since MPI is a standard, not just an implementation, you can move MPI programs between systems pretty easily.
5. Processes, not Threads:
    MPI programs are made up of processes that talk via messages.
    Processes can technically all run on the same machine, but that’s inefficient — you usually want them distributed across nodes in a cluster.

## Communicator:
In MPI, a communicator is an object that defines a group of processes that can communicate with each other. 
MPI.COMM_WORLD

## Rank:
- rank is the unique ID (an integer) of the process within COMM_WORLD.
- It starts at 0 and goes up to size - 1.

- Example: If you run the program with 4 processes, the ranks will be: 0, 1, 2, 3.

## size:
- size is the total number of processes in COMM_WORLD.

-This tells each process how many other processes are running.

- Example: If you start the program with 4 processes, each will see size == 4.



# print_rank_size.py

In [None]:
from mpi4py import MPI

# MPI.COMM_WORLD is the default communicator 
comm=MPI.COMM_WORLD

# rank is the unique ID of the process within COMM_WORLD. It starts from 0 to size-1
rank=comm.Get_rank()
size=comm.Get_size()

print("Hello World!")
print(f"rank={rank}, size={size}")

# Broadcast in mpi4py

from mpi4py import MPI
import sys

def main(argv):
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()

    print("Hello world!")
    print(f"rank={rank} size={size}")

    comm.Barrier()  # Every process stops here and waits untill all process arrrive and then move forward together.

    if rank == 0:
        n = (rank + 1) * 4711
    else:
        n = None

    n = comm.bcast(n, root=0)  # broadcast from rank 0 to remaining processes.

    print(f"Rank {rank} received = {n}")

if __name__ == "__main__":
    main(sys.argv)
    MPI.Finalize() # Shutsdown all the MPI processes


# output:
(base) [sy37tovi@mlogin01 mpi]$ mpiexec -n 5 python Bcast.py
Hello world!
rank=3; size=5
Hello world!
rank=1; size=5
Hello world!
Hello world!
rank=4; size=5
rank=2; size=5
Hello world!
rank=0; size=5
Rank 0 received = 4711
Rank 2 received = 4711
Rank 4 received = 4711
Rank 3 received = 4711
Rank 1 received = 4711

# Conventions present in mpi4py
The mpi4py library uses lower case (comm.bcast) for MPI communication of Python objects(convenient, high level - but lower case) and 
upper case (comm.Bcast) for buffers/arrays (low level, potentially faster).


Example:
from mpi4py import MPI
def main():
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
root = 0
# Using comm.bcast (Python object; an int is an object in Python)
if rank == root:
    value = 42
    else:
    value = None
value = comm.bcast(value, root=root)
print(f"Rank {rank}: Received value using comm.bcast: {value}")



# Using comm.Bcast (numpy array)
import numpy as np
if rank == root:
    value_np = np.array([42], dtype='i')
else:
    value_np = np.empty(1, dtype='i')
comm.Bcast([value_np, MPI.INT], root=root) #This is a buffer specification in mpi4py’s lower-level interface.  The MPI datatype (here: 32-bit integer). This tells MPI how to                                   interpret the raw bytes in the buffer.By passing this list, you’re giving MPI direct access to the array’s
print(f"Rank {rank}: Received value using comm.Bcast: {value_np[0]}")
if __name__ == "__main__":
    main()

# MPI Reduction

An MPI reduction is an operation where values from all ranks are combined into a single result, usually at one “root” rank, using an operation like sum, max, min, product, logical AND/OR, etc.

Common MPI reduction operations
MPI.SUM → sum of values

MPI.PROD → product of values

MPI.MAX → maximum value

MPI.MIN → minimum value

MPI.LAND → logical AND

MPI.LOR → logical OR

In [None]:
from mpi4py import MPI
import sys

def main(argv):
        comm=MPI.COMM_WORLD
        rank=comm.Get_rank()
        size=comm.Get_size()
        print("Hello world!")
        print(f"rank={rank} of size={size}")
        comm.Barrier()

        n=rank+1
        prod=comm.reduce(n,op=MPI.PROD,root=0)

        if rank==0:
                print(f"received={prod}")

if __name__=="__main__":
        main(sys.argv)
        MPI.Finalize()

Hello world!
rank=2 of size=5
Hello world!
rank=3 of size=5
Hello world!
rank=4 of size=5
Hello world!
rank=0 of size=5
Hello world!
rank=1 of size=5
received=120

 allreduce — similar to reduce, but every rank gets the result instead of only the root.

In [None]:
def main(argv):
    comm = MPI.COMM_WORLD
    rank = comm.Get_rank()
    size = comm.Get_size()
    print("Hello world!")
    print(f"rank={rank} size={size}")
    n = rank + 1
    comm.Barrier()
    # MPI_Reduce
    sum = comm.reduce(n, op=MPI.SUM, root=0)
    if rank == 0:
        print(f" Received reduce={sum}")
    comm.Barrier()
    # MPI_Allreduce
    sum_allreduce = comm.allreduce(n, op=MPI.SUM)
    print(f" Received Allreduce ={sum_allreduce}")
if __name__ == "__main__":
    main(sys.argv)
    MPI.Finalize()

(base) [sy37tovi@mlogin01 mpi]$ mpiexec -n 5 python all_reduce.py
Hello world!
rank=3; size=5
Hello world!
rank=1; size=5
Hello world!
rank=0; size=5
Hello world!
rank=4; size=5
Hello world!
rank=2; size=5
received reduce=15
Received allreduce=15
Received allreduce=15
Received allreduce=15
Received allreduce=15
Received allreduce=15

# Measuring Walltime

In [None]:
from mpi4py import MPI

comm=MPI.COMM_WORLD

comm.Barrier()
starttime=MPI.Wtime()

rank=comm.Get_rank()
size=comm.Get_size()

print("Hello world!")
print(f"rank={rank},size={size}")

comm.Barrier()
endtime=MPI.Wtime()

elapsed_time=endtime-starttime

sum=comm.reduce(elapsed_time,op=MPI.SUM,root=0)

if rank==0:
        print(f"Sum of elapsed time: {sum}")

(base) [sy37tovi@mlogin01 mpi]$ mpiexec -n 5 python mpi.py
Hello world!
rank=2,size=5
Hello world!
rank=3,size=5
Hello world!
rank=4,size=5
Hello world!
rank=0,size=5
Hello world!
rank=1,size=5
Sum of elapsed time: 0.00014458100000000002

# MPI_Send
MPI_Send lets one rank (process) send a message (data buffer) to another rank.

It’s a blocking send, meaning the sending process might wait until the message data has been copied out (depending on MPI implementation).

You specify:

    - The data buffer to send

    - The destination rank (which process to send to)

    - A message tag (an integer to help match sends and receives)

    - The communicator (usually MPI.COMM_WORLD)

# comm.send:
Sends buf from calling process to dest. The integer value tag is an ID from 0...32767(minimum guaranteed by the standard).

Python:
comm.Send(buf, dest=dest, tag=tag)s
MPI_Send lets one rank (process) send a message (data buffer) to another rank.

It’s a blocking send, meaning the sending process might wait until the message data has been copied out (depending on MPI implementation).

You specify:

The data buffer to send

The destination rank (which process to send to)

A message tag (an integer to help match sends and receives)

The communicator (usually MPI.COMM_WORLD)

# comm.recv:

MPI_Recv lets one rank receive a message from another rank.

It is blocking — the receiving process waits until the message arrives.

You specify:

A buffer to store the incoming data

The source rank (which process to receive from; or MPI.ANY_SOURCE to accept from anyone)

The message tag to match the send

# What happens during send?

# Handshaking with the receiver
MPI might first contact the destination process to ensure it is ready to receive (MPI_Recv posted).

For synchronous sends (like MPI_Ssend), your process waits until the receiver actually starts receiving.


# Copying data into MPI’s internal buffer (if available)
If the receiver isn’t ready, MPI might copy your send buffer into an internal temporary buffer so you can reuse your array.

If there’s no buffer space available (common in large messages), you wait until the receiver is ready.

# Actual network transfer
For large messages, MPI may require the data to be completely transferred before returning.

This can involve:

Writing to shared memory (if sender/receiver are on the same node)

Sending packets across network hardware (InfiniBand, Ethernet, etc.)

Waiting for acknowledgments from the destination


While you’re “waiting” in MPI_Send, your process can’t do anything else — MPI is busy completing enough of the send operation to make your buffer safe, which may involve:

Waiting for the receiver

Copying to MPI buffers

Fully transmitting data

In [None]:
from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data0 = "Hello from rank 0"
    comm.send(data0, dest=2, tag=11)  # send string to rank 1
elif rank == 1:
    data1 = "Hello from rank 0"
    comm.send(data1, dest=2, tag=12)  # send string to rank 1
elif rank==2:
    data0=comm.recv(source=MPI.ANY_SOURCE,tag=11) #receive string from rank 0   
    data1=comm.recv(source=MPI.ANY_SOURCE,tag=12) #receive string from rank 1   
    print(f"Rank1 received: {data0, data1}")


(base) [sy37tovi@mlogin01 mpi]$ mpiexec -n 5 python mpi_send.py
Rank1 received: ('Hello from rank 0', 'Hello from rank 0')

# Difference b/w MPI_send and MPI_Isend

# Blocking MPI_send:

The function doesn’t return control to your program until the operation is complete enough for MPI to guarantee safety.

Your process is “stuck” inside the MPI library during that time — it’s not running your code, not computing, not doing anything else.

Standard Mode Send
Mode: Standard send (MPI chooses between buffered or synchronous internally).
Behavior:

May complete before the matching receive is posted if MPI can copy the message into an internal buffer.

May also block until the receiver starts receiving, if no buffer is available or the message is large
When send returns: The send buffer is safe to reuse, but no guarantee the receiver has started reading it yet.

# MPI_Send — Standard Mode Send
Mode: Standard send (MPI chooses between buffered or synchronous internally).

Behavior:
May complete before the matching receive is posted if MPI can copy the message into an internal buffer.

May also block until the receiver starts receiving, if no buffer is available or the message is large.

When send returns: The send buffer is safe to reuse, but no guarantee the receiver has started reading it yet.

Example:

# MPI_Ssend — Synchronous Mode Send
Mode: Synchronous send (always handshake with the receiver).

Behavior:
Never completes until the matching receive has started.

Requires a rendezvous protocol: sender and receiver exchange a handshake before data transfer.

When send returns: The send buffer is safe to reuse and you are guaranteed the receiver has entered the receive call for that message.

Here’s how it recognizes it:

When a process calls Recv, it tells MPI “I’m ready to receive a message from this source.”

MPI matches this receive with the corresponding Ssend waiting on the sender side.

Once the receive is posted, MPI allows the Ssend to proceed and complete.

| Feature                    | `MPI_Send` (Standard)                     | `MPI_Ssend` (Synchronous)                   |
| -------------------------- | ----------------------------------------- | ------------------------------------------- |
| Completion condition       | Buffer copied or receiver started receive | Receiver has started receive (handshake)    |
| May return before receive? | Yes (if buffered)                         | No                                          |
| Guarantee about receiver?  | None                                      | Yes — receiver has posted the matching recv |
| Potential to block long?   | Yes, but depends on buffers               | Yes — will block until recv starts          |
| Typical use                | General communication                     | Debugging ordering or avoiding buffering    |


# Non-blocking MPI_Isend:

Non-blocking: Initiates the send but returns immediately, even before data is copied into MPI’s internal buffer (if any).

You cannot reuse or modify the send buffer until you call MPI_Wait or MPI_Test and confirm completion.

Gives you overlap of computation and communication — you can start sending and do other work before ensuring completion.

Still uses the standard send mode internally, so the same buffering/synchronous behavior applies — the difference is that progress happens in the background.

| Feature                     | `MPI.Send` (Blocking)                           | `MPI.Isend` (Non-blocking)                               |
| --------------------------- | ----------------------------------------------- | -------------------------------------------------------- |
| Returns immediately?        | No (waits until buffer safe to reuse)           | Yes (but you must wait/test later before reusing buffer) |
| Mode under the hood         | Standard (impl chooses buffered or synchronous) | Standard (same)                                          |
| Can overlap compute & comm? | No                                              | Yes                                                      |
| Need `MPI_Wait`/`MPI_Test`? | No                                              | Yes                                                      |



# Deadlock:
A deadlock in MPI (and in programming in general) happens when two or more processes are each waiting for the other to do something, so nobody can move forward — the program just gets stuck forever.

| Time | Rank 0        | Rank 1        |
| ---- | ------------- | ------------- |
| t0   | Send → wait   | Send → wait   |
| t1(Deadlock)   | waiting…      | waiting…      |
| t2   | still waiting | still waiting |

Here’s what happens:

Rank 0 starts Send to Rank 1 — it waits for Rank 1 to post a matching Recv.

Rank 1 starts Send to Rank 0 — it waits for Rank 0 to post a matching Recv.

Both are waiting for each other to post a receive… but no one ever does.
→ Deadlock.


# b) Producing a deadlock with MPI.Ssend
If every process tries to send first using Ssend before posting a recv, all will block waiting for the receiver to post a matching receive - but no one ever does.


In [None]:
from mpi4py import MPI

comm=MPI.COMM_WORLD
rank=comm.Get_rank()
size=comm.Get_size()

send_data=rank
recv_data= -1
next_rank=(rank+1)%size
prev_rank=(rank-1)%size

comm.Ssend(send_data,dest=next_rank)
comm.Recv(recv_data,source=prev_rank)

print(f"Rank {rank} received {recv_data}")

# (c) Avoiding the deadlock
Two common fixes:

Post the Recv before the Ssend (so receiver is ready):

In [None]:
comm.Recv(recv_data, source=prev_rank)
comm.Ssend(send_data, dest=next_rank)

# (d) MPI.Isend and MPI.Recv
MPI.Isend → Immediate (non-blocking) send:

Returns immediately after initiating the send, before the receiver is ready.

The send completes later; you must call .Wait() or .Test() on the returned request to ensure completion.

Often used to overlap computation with communication.

In [None]:
req = comm.Isend(send_data, dest=next_rank)
comm.Recv(recv_data, source=prev_rank)
req.Wait()

# (e) Isend + Recv vs Send + Irecv
Isend + Recv:

Send is non-blocking, receive is blocking.

Good if you want to start sending early and only wait after receiving.

Send + Irecv:

Send is blocking, receive is non-blocking.

Good if you want to post a receive early and process data when it arrives without blocking.

Key difference: Which direction of communication is allowed to progress without waiting — sending or receiving.



Why MPI_Irecv is useful

1. Overlap computation & communication
You don’t have to wait at the recv call — computation can happen while data is moving.

2. Avoid deadlocks
If two processes both call MPI_Send first (blocking), each will wait for the other to post a receive — deadlock.
With MPI_Isend/MPI_Irecv, both can start the exchange without blocking.

# MPI_Sendrecv
MPI_Sendrecv is a combined blocking operation that sends a message to one process and simultaneously receives a message from another (or the same) process.

It avoids deadlocks that can happen when you call MPI_Send and MPI_Recv separately, especially when two processes are sending and receiving data from each other at the same time.

# MPI_Alltoall
MPI_Alltoall is a collective communication operation where every process sends data to every other process, including itself, and receives data from every other process as well.

It’s like a “full exchange” of data between all ranks in the communicator.

Imagine you have N processes (like workers in a group). Each worker has N pieces of information, one piece meant for each worker (including themselves).

When MPI_Alltoall runs, each worker sends the piece meant for worker 0 to worker 0, the piece meant for worker 1 to worker 1, and so on — all at once.

At the end, every worker has received one piece from every other worker.


# Splitting Communicators:

Splitting a communicator means:

    Start with a big group of processes (oldcomm, often MPI.COMM_WORLD).

    Assign each process a color (integer).

    Same color → goes into the same subgroup.

    Different color → goes into different subgroups.

Inside each subgroup, order processes by key (smallest key = rank 0 in the new subgroup).

The result is a new communicator for each subgroup.


# MPI_Gather

MPI_Gather collects the data from all processors in a communicator and gathers it into a single process called the root.

Each process sends its local data to the root process.

The root process receives all the data and stores it in a buffer (usually an array or list).

# MPI_Scatter

MPI_Scatter distributes chunks of data from the root process to all other processes in the communicator.