### MPI Point to Point Messaging

Point-to-point communication is the most basic operation in an MPI program. Send a buffer of data from one node to another.

| Operation | Syntax |
| :--- | ---: |
| Blocking send: | `MPI_Send(buffer,count,type,dest,tag,comm)` |
| Non-blocking send: | `MPI_Isend(buffer,count,type,dest,tag,comm,request)` |
| Blocking receive: | `MPI_Recv(buffer,count,type,source,tag,comm,status)` |
| Non-blocking receive: | `MPI_Irecv(buffer,count,type,source,tag,comm,request)` |

`buffer` is a program (application) address space that references the data that is to be sent or received. In most cases, this is simply the variable name that is be sent/received. 

`count` indicates the number of data elements of a particular type to be sent.

`type` for reasons of portability, MPI predefines its elementary data types.

The `Send/Recv` operations a __synchronous__.
  * The sender does not return until the buffer has been transferred.
  * The receiver does not return until the message is received.
  
Isend/Irecv operations are __asychronous__. They return immediately.
  * The sender must keep the buffer intact (cannot reuse or destroy) until the send happens at a later time.
  * The receceiver gives a buffer that will be filled at a later time. It must check the status for completion.
  * If the sender/receiver overwrites or deallocates the buffer, unknown wrong things will happen.
   
Asynchronous messaging is more challenging in MPI than Ray. In Ray, the system took possesion of the buffers into the key/value store

### Synchronous Sends and Deadlock

With blocking send and receive, we know how to build a deadlock.  Let's try it in [examples/mpi/nodeadlock.c](examples/mpi/nodeadlock.c). We'll do this with four nodes where the processes are sends and the resources on which we are waiting are receives.  Recall our deadlock picture with 3 nodes.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/Wait-for_graph_example.svg/2880px-Wait-for_graph_example.svg.png" width=256 />

OK, that didn't work, why not?

The MPI messaging systems is running in something called standard mode. The gotcha here is:

> MPI_Send will not return until you can use the send buffer. It may or may not block (it is allowed to buffer, either on the sender or receiver side, or to wait for the matching receive).

This creates a real hazard. __Why?__ _discussion in HTML comments_


Buffering may work sometime and have the effect of turning potentially synchronous calls into asynchronous calls.  This adds queues to avoid deadlocks as we did in the ray examples.  The bad thing that happens:
* Build an MPI program and test it. It runs fine.
* Deploy it to a new environment. It deadlocks.


Buffering may be used when available and not other times. It depends on available memory and configuration of the cluster. Conditions in which buffering might stop working.
* Many sender or receivers
* Increased message size

Essentially, when your program scales up it breaks.

OK, let's make the deadlock happen [examples/mpi/deadlock.c](examples/mpi/deadlock.c)

The secret here is:

> MPI_Ssend: Send a message and block until the application buffer in the sending task is free for reuse and the destination process has started to receive the message.

Interesting, it's not fully synchronous. The receiver has not gotten the entire message yet. But, if the receiver is single threaded, it's thread safe. What could happen if the receiver is itself a parallel program?

### Eliminating Deadlock

Paired/sends and receives. [examples/mpi/passitforward.c](examples/mpi/passitforward.c)

<img src=./images/pairedsr.png width=512>

This is the two-phase, deadlock-free protocol we talked about in Ray.

The canonical form of deadlock avoidance is the [Banker's Algorithm](https://en.wikipedia.org/wiki/Banker%27s_algorithm) that performs admission based on the current allocated set of resources. It requires knowledge of all system resources and the max resources needed by each process.

### Best Practice

Develop a deadlock-free MPI program by:
* Implementing code with `MPI_Ssend` to reveal deadlocks in the messaging protocol.
* Deploy code with `MPI_Send` so that the system can optimize performance at runtime. 