# Bayesian Variational Federated Learning and Unlearning

![net](https://drive.google.com/uc?export=view&id=15ItguZ2h0i5rfpGCt16JAganXhOMd74H)

## Setting

We consider a system with a set $\mathcal{K}=\{1,\ldots,K\}$ of $K$ agents connected by communication network illustrated by the undirected graph $\mathcal{G}=\{\mathcal{K},\mathcal{E}\}$ with corresponding adjacency matrix $A$, where $\mathcal{E}$ denotes the set of edges. The local data set $\mathcal{D}_{k}=\{z_{k,n} \}_{n=1}^{N_k}$ of agent $k\in\mathcal{K}$ contains $N_k$ data points, and the associated training loss for model parameter $\theta$ is defined as

$$L_k(\theta)=\frac{1}{N_k}\sum_{n=1}^{N_k} \ell_k(z_{k,n}|\theta)$$

for some loss function $\ell_k(z|\theta)$. We also denote as $\mathcal{D}=\bigcup_{k=1}^K\mathcal{D}_k$ the global data set.

## Federated Bayesian learning

The agents collectively aim at obtaining the variational distribution $q(\theta)$ on the model parameter space that minimizes the *global free energy*

$$\min_{q(\theta)\in \mathcal{Q}} \bigg\{ F(q(\theta))= \sum_{k=1}^K \mathbb{E}_{\theta\sim q(\theta)} [L_k(\theta)] +\alpha\cdot \mathbb{D}\big( q(\theta) \big\|p_0 (\theta)\big) \bigg\}$$

where $\alpha > 0$ is a temperature parameter, $\mathbb{D}\left(\cdot\|\cdot\right)$ denotes Kullback–Leibler (KL) divergence, and $p_0(\theta)$ is a prior distribution. The variational posterior is constrained to lie in a set $\mathcal{Q}$ of distribution. When no constraints are imposed on set $\mathcal{Q}$, the optimal solution is given by the global generalized posterior distribution

$$q^*(\theta|\mathcal{D})=\frac{1}{Z}\cdot \tilde{q}^*(\theta|\mathcal{D})$$

where

$$\quad\tilde{q}^*(\theta|\mathcal{D})=p_0 (\theta) \exp \left(-\frac{1}{\alpha}\sum_{k=1}^K L_k(\theta)\right)$$

which coincides with the conventional posterior $p\big(\theta|\mathcal{D}\big)$ when $\alpha=1$ and the loss function is given by the log-loss $\ell_k(z|\theta)=-\log p(z|\theta)$.

## Federated Bayesian unlearning

Assume that the system has obtained a, possibly suboptimal, solution  $q(\theta|\mathcal{D})\in\mathcal{Q}$ to problem (\ref{eq:global_fe}). In federated machine unlearning, we wish to remove from this distribution information about data set $\mathcal{D}_k \subset\mathcal{D}$ of some agent $k$. Ideally, we would do so by addressing problem from scratch using the data set $\mathcal{D}_{-k}=\mathcal{D}\setminus \mathcal{D}_k$ to obtain a variational posterior $q(\theta|\mathcal{D}_{-k})$, but this may be costly in terms of computation and convergence time. In federated Bayesian unlearning, the goal is devising decentralized protocols that are more efficient than training from scratch.

## Communication protocol

a baseline gossip-based communication protocol based on a random walk on the graph. Accordingly, at any time slot $i=1,2,\ldots,$ a node $k^{(i)}$ is scheduled to carry out local computing, and to (possibly) communicate with one of its neighbors when local computation is completed. Specifically, we adopt the Metropolis-Hastings (MH) scheduling, whereby node $k^{(i)}$ chooses node $j$ uniformly at random from the set of neighbours $\mathcal{N}_{k^{(i)}}$. Then, for the given selected node $j$, node $k^{(i)}$ sets

$$k^{(i+1)}= 
\begin{cases}
j& \text{w. p. }\min\left(1, \frac{\deg\left(k^{(i)}\right)}{\deg(j)} \right)\\
k^{(i)} & \text{otherwise}
\end{cases}$$

Therefore, in the next time slot $i+1$, either the same node $k^{(i)}$ is scheduled, or a neighbor $j$ of node $k^{(i)}$ is scheduled.

# References

- J. Gong, O. Simeone, and J. Kang, “Bayesian Variational Federated Learning and Unlearning in Decentralized Networks,” in 2021 IEEE 22nd International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2021, pp. 216–220. [[Paper](https://ieeexplore.ieee.org/abstract/document/9593225)]