## MPI parallel matrix-vector product in dolfinx/python

In this notebook we implement matrix-vector multiplication with MPI. 

1. For both matrices and vectors, we construct index maps and use that to distribute values. Basically, we create an IndexMap to describe the distribution scheme we want, and then we use the IndexMap to create objects (such as vectors and matrices) that have this distribution. 

2. After we define IndexMap and distributed objects we need, we show some basic arithmetic with those objects, such as adding vectors and matrix-vector product.

In [49]:
import logging
import ipyparallel as ipp

# create a cluster
cluster = ipp.Cluster(engines="mpi", n=3, log_level=logging.WARNING)
rc = cluster.start_and_connect_sync(activate=True)

  0%|          | 0/3 [00:00<?, ?engine/s]

In [50]:
%%px
# Find out rank, size
from mpi4py import MPI

comm = MPI.COMM_WORLD
world_rank = comm.Get_rank()
world_size = comm.Get_size()
print(f"I am rank {world_rank} / {world_size}")

[stdout:0] I am rank 0 / 3


[stdout:1] I am rank 1 / 3


[stdout:2] I am rank 2 / 3


We want to create an overlapping, ie ghosted, index map for our vectors and matrices. (see [dolfinx implementation](https://github.com/FEniCS/dolfinx/blob/160ed13eb476df99944072aec70bd46a6fcb9bcf/cpp/dolfinx/common/IndexMap.cpp))

What we need is:
- MPI communicator
- **source ranks** - ranks that own indices ghosted by the caller
- **destination ranks** - ranks that ghost indices owned by the caller
- local size of the index map (number of owned entries)
- global indices of ghost entries
- owner rank of each ghost entry on global communicator

In [4]:
def is_sorted(lst):
    return all(a <= b for a, b in zip(lst, lst[1:]))

In [56]:
%%px
class IndexMap():
    def __init__(self, comm, local_size, ghosts, owners):
        # fixme: perhaps hide these and use special methods to access
        # assume ghosts and owners are of type list
        assert len(ghosts) == len(owners)
        
        self.comm = comm
        self.forward_comm = None
        self.reverse_comm = None
        self.ghosts = ghosts
        self.owners = owners
        self.local_size = local_size
        
        # Get global size
        self.global_size = 0
        self.compute_size_global()
        
        # Get global offset (index) using global exclusive reduction
        self.local_range = None
        self.compute_local_range()
        
        # Get sources and destinations
        self.sources = None
        self.destinations = None
        self.compute_src_dest()
        
    def compute_src_dest(self):
        # src = sort(owners)
        # remove something unique from src
        # dest = sort(MPI.compute_graph_edges_nbx(self.comm, src))
        
        dest_ranks = np.unique(self.owners)
        print(f"{self.comm.rank=}, {dest_ranks=}")
        self.forward_comm = self.comm.Create_dist_graph([self.comm.rank], [len(dest_ranks)], dest_ranks.tolist(), reorder=False)
        src, dest, _ = self.forward_comm.Get_dist_neighbors()
        
        # recv_size = np.zeros(len(source_ranks), dtype=np.int32)
        self.reverse_comm = comm.Create_dist_graph_adjacent(dest, src, reorder=False)
        src2, dest2, _ = self.reverse_comm.Get_dist_neighbors()
        
        print(f"{src=}, {src2=}")
        print(f"{dest=}, {dest2=}")
        # assume src and dest are lists
        # assert is_sorted(src)
        # assert is_sorted(dest)
        
        self.sources = src
        self.destinations = dest
        
    def local_to_global(self, local_ind):
        # todo
        global_ind = None
        return global_ind
    
    def compute_size_global(self):
        # todo: test
        self.global_size = self.comm.allreduce([self.local_size], op=MPI.SUM)
        
    def compute_local_range(self):
        # todo: test
        # offset = 0
        offset = self.comm.exscan([self.local_size], op=MPI.SUM)
        if offset is None:
            self.local_range = (0, self.local_size)
        else: 
            print(f"{offset=}")
            self.local_range = (offset[0], offset[0] + self.local_size)


In [57]:
%%px
import numpy as np
if comm.rank == 0:
    IM = IndexMap(comm, 50, range(50, 200), list(np.ones(150, dtype=np.int32)))
elif comm.rank == 1:
    IM = IndexMap(comm, 150, range(50), list(np.zeros(50, dtype=np.int32)))
else:
    IM = IndexMap(comm, 33, range(5), list(np.zeros(5, dtype=np.int32)))    
    #raise RuntimeError("Too many processors!")

[stdout:0] self.comm.rank=0, dest_ranks=array([1], dtype=int32)
src=[1, 2], src2=[1]
dest=[1], dest2=[1, 2]


[stdout:2] offset=[50, 150]
self.comm.rank=2, dest_ranks=array([0], dtype=int32)
src=[], src2=[0]
dest=[0], dest2=[]


[stdout:1] offset=[50]
self.comm.rank=1, dest_ranks=array([0], dtype=int32)
src=[0], src2=[0]
dest=[0], dest2=[0]


In [58]:
%%px
# Let's check if everything makes sense
print(IM.comm.rank)
print(IM.local_size)
print(IM.local_range)
print(IM.sources, IM.destinations)

[stdout:0] 0
50
(0, 50)
[1, 2] [1]


[stdout:1] 1
150
(50, 200)
[0] [0]


[stdout:2] 2
33
(50, 83)
[] [0]




Now we want to use this index map for dolfinx objects (vectors, matrices)



In [59]:
import dolfinx as dfx



In [28]:
# Stop cluster
cluster.stop_cluster_sync()