# Asynchronous Federated Unlearning

In the conventional context of synchronous federated learning, as most of the clients participate in the time-consuming retraining process, the overall performance — as measured by the wall-clock time of converging to a target accuracy — will be severely affected. If we wish to improve the overall performance even with the possibility of retraining, we may consider minimizing the number of clients that participate in the retraining process. Towards this objective, we should consider operating the FL training session in an asynchronous fashion: the server does not need to wait for all its selected clients to report their model updates, and proceeds with its aggregation process as soon as the model update from a minimum number of clients arrives. It has been shown in the literature (e.g., PORT and FedBuff) that the performance of such an asynchronous paradigm is far superior to synchronous FL, especially in cases where clients are heterogeneous in their training capabilities.

In asynchronous FL, different clients progress at different speeds naturally in their local training. Intuitively, if we allow some clients to move forward towards global convergence while retraining only a small subset of the clients as data samples are erased, the inherent overhead of retraining can be substantially mitigated. A simple yet effective mechanism is to divide all clients into a small number of clusters, and aggregate client updates within the confines of each cluster only. The immediate effect of such clustered aggregation is that, if any client requests its data samples to be erased, only clients within the same cluster need to be retrained, while other unaffected clusters may continue with normal FL training.

![1](https://drive.google.com/uc?export=view&id=1ypBh27I9JWMcTwqvTk_TV67L_rBX4gi1)

## Random Clustered Aggregation

Random clustered aggregation: an orthogonal approach to retraining algorithms. In the specific scenario of asynchronous FL, faster clients tend to go through more rounds of communication as their updates are aggregated and selected more frequently than slower clients. It has been clearly shown that when clients are heterogeneous in their training speeds, asynchronous FL outperforms synchronous FL by a substantial margin. Yet, the effects of local data samples from faster clients tend to propagate more quickly than slower clients as well, which makes federated unlearning more complex and challenging. To our best knowledge, potential mechanisms of mitigating the exorbitant costs of retraining from scratch in the context of asynchronous FL have not yet been explored in the literature.

![1](https://drive.google.com/uc?export=view&id=1qsU6s81bd6ZAFPKr380VKuUUQOUcnkVm)

In contrast to existing work that sought to replace the naïve mechanism of retraining from scratch with approximation algorithms, we propose to take a decidedly different approach that is orthogonal to approximation algorithms. Intuitively, if we can reduce the ripple effect of contributions from any particular client in the FL training process, the number of clients that must participate in the retraining process will be substantially reduced. With all existing works in the literature, as soon as a client’s contributions have propagated throughout the entire pool of participating clients, all clients will need to participate in the retraining process when federated unlearning commences.

**The need for asynchrony.** It turns out, however, that potential benefits from such a clustered aggregation mechanism come with a caveat: it only works effectively in the specific context of asynchronous FL, where client contributions are aggregated asynchronously without waiting for the slower clients. In the event that clients within one of the clusters need to roll back to a previous round and start their federated unlearning process, all the clients within the other clusters can proceed with their federated learning process normally. This phenomenon, where some clusters move forward in time while some other clusters roll back at the same time, is only feasible with asynchronous FL.

**Criteria for terminating the training session.** When is the right time for clustered aggregation to terminate, and for the server to begin aggregating across the clusters? In our random clustered aggregation mechanism, we consider two thresholds: one for the required validation accuracy, and one for the standard deviation across a recent history of such validation accuracies. As long as the highest validation accuracy across clusters is higher than the first threshold, and the standard deviation of recent accuracies is lower than the second, the training session will terminate, and the server will then perform its final round of aggregation to produce converged global model. Due to non-i.i.d. data distributions across the clients’ local data, such an aggregation process is inevitably less efficient with respect to converging to the best possible accuracy as quickly as we can. It is therefore critically important to design a suitable clustering algorithm to minimize such a drag on training efficiency in a federated learning session, which we will soon elaborate in great detail in the next section. 

**Proof-of-concept experimental evaluations.** Though suitable strategies for distributing clients across clusters are open to further investigation, we wish to first conduct some preliminary experiments to verify that clustered aggregation can indeed improve training performance in asynchronous FL, with some of the clients erasing their data. For the sake of simplicity, though random clustered aggregation can work with any retraining algorithm, we retrain from scratch when clients request erasing their data.

In our proof-of-concept experiments, we use the MNIST dataset to train the LeNet-5 model, with 100 clients to be randomly assigned to 5 clusters. We select 30 clients in each communication round, and 15 clients as the minimum number of clients to be aggregated asynchronously. The data distribution of local datasets is non-i.i.d., sampled with the symmetric Dirichlet distribution with a concentration of 5. To simulate the heterogeneity across clients, we use a heavytailed Pareto distribution to simulate the asynchronous clients’ training time, with its positive parameter $\alpha$ = 1.

![4](https://drive.google.com/uc?export=view&id=17_AD0lYrKKFUSivMzNaXOiefdqQQLvQR)

With 5 clusters, we proceed to compare the training performance of random clustered aggregation with two baseline algorithms: FedBuff as our asynchronous FL baseline, and Federated Averaging (FedAvg) as our synchronous baseline. Our results have been shown in Fig. 4, where we shaded the range of validation accuracies within each of the clusters. From these results, we confirmed the fact that synchronous FL with FedAvg was much slower than asynchronous FL with FedBuff before clients request to erase their data at the same wall-clock time (around 140 seconds). Once data erasure occurred, we observed a substantial reduction in global validation accuracy with the asynchronous FL baseline. This clearly showed the ripple effect of the erased data, in that all affected clients needed to be retrained from scratch.

In stark contrast to both synchronous and asynchronous baselines, only validation accuracies of some clusters experienced a substantial reduction with random clustered aggregation, while other clusters could be allowed to continue with their training process normally. As a result, global accuracies were not affected until convergence. After the criteria for terminating the training session is satisfied, the server produces the global model in the final round, which occurred at 360 seconds with a final accuracy of 90%, and outperformed both baselines. Further evaluations have shown that the number of clusters had no material effects on the training performance.

## KNOT

### Optimizing Client-Cluster Assignment

First, we wish to discuss what types of clients should be clustered together. In asynchronous FL, clients are likely heterogeneous with respect to their local resources, leading to different completion times in a communication round. Naturally, if we assign faster clients to some clusters and slow clients to others, clusters with faster clients may converge faster over more communication rounds. This leads us to consider local training times of the clients as the first factor that affects client-cluster assignment.

We consider assigning clients  $\{C_k\}_{k \in \mathcal{K}}$ to clusters $\{L_n\}_{n \in \mathcal{N}}$, where $\mathcal{K} = \{1,2,...,K\}$ and $\mathcal{N} = \{1,2,...,N\}$ are the corresponding indexing sets, and $K$ and $N$ correspond to the total number of clients and clusters, respectively. We define the training time $T_k$ as the wall-clock time that elapses after the server sends the model and until it receives a parameter update from client $C_k$. A client with more computational resources may have a small $T_k$. In our optimization problem, we prefer to cluster the clients based on their training times.

The next factor that is likely to influence the convergence speed in a training session is model disparity. Due to noni.i.d. data distributions, we use model disparity in our formulation to help assign clients with similar data in one cluster. The model disparity $S_k$ measures how much client $C_k$ ’s data diverges from the global model. We introduce cosine similarity $\Theta (u, v) = \frac{u \cdot v}{\Vert u \Vert \Vert v \Vert}$ which measures the angle between two vectors $u$ and $v$. Starting with a random initial model with model parameters $\omega_0$, we perform one training round on each client $C_k$ to obtain its local update parameter $\Delta_k^1$. These local updates are aggregated to form a global model with parameter $\omega_1$. Intuitively, the vector $\omega_0 - \omega_1$ represents the consensus of the overall clients while $\Delta_k^1$ reflects the influence of the individual client. Thus, a small angle between these two vectors implies that client $C_k$ is a "good" client whose data looks congruent to the overall global model. Now, we can define the model disparity as $S_k = \frac{1-\Theta(\omega_0 - \omega_1, \Delta_k^1)}{2}$, where we transform $Theta$ such that $S_k \in [0,1] \forall k \in \mathcal{K}$. Note that the smaller $S_k$ is, the smaller the cosine similarity, and the more representative client $C_k$ is concerning all clients. 

In short, we aim to distribute the clients based on how “good” it is; namely, a good client should have a relatively short training time and a low model disparity. To achieve this goal, we set a target training time $\widetilde{T_n}$ and a target model disparity $\widetilde{S_n}$ for each cluster as anchor points and assign clients based on their Tk and Sk values. Suppose the training times of all clients span the range from $S_∗$ to $S^∗$, we divide the ranges evenly by letting $\widetilde{T_n} = s_8 + \frac{S^8 - s_*}{N-1} \cdot (n-1)$ and $\widetilde{S_n} = \frac{n}{N}$. Next, to numerically represent the difference between client $C_k$ and cluster $L_n$, we define the match rating as $d_{kn} = \Vert [a(\widetilde{T_n}-T_k), b(\widetilde{S_n}-S_k)] \Vert_2$, which is a weighted $^2$ norm for the matrix scaled by hyper-parameters $a$ and $b$.

The match rating allows us to tell whether a client is suitable for a cluster or not. For an arbitrary client $C_k, d_{k1} \leq d_{k2}$ means that cluster $L_1$ is a better match for the client than cluster L2 is, and we should assign it to $L_1$. At first glance, one may think that it suffices to choose the cluster that results in the lowest match rating for each client. However, our problem is more subtle because we need to ensure that each cluster has sufficient client data to train a decent cluster model that can contribute appropriately to the overall model after aggregation. Ideally, the resulting clusters should have sizes that are most beneficial for their effective convergence in the training session. For instance, suppose the difference in $d_{k1}$ and $d_{k2}$ is insignificant, and that L1 is large while $L_2$ only has a few clients. In this case, it might be better in practice to assign the client to cluster $L_2$ so that when data erasure is requested, only a smaller fraction of the client needs to participate in retraining. For this reason, the match rating is not the only standard for our assignment. All decisions are interdependent, making our problem a more complicated one that involves more constraints.

### Formulating the Optimization Problem

To begin formulating our problem, we represent a client-cluster assignment using a vector $x \in \{ 0,1 \}^{KN}, x = (x_{11},...,x_{KN})$, where $x_{kn} = 1$ if client $C_k$ is assigned to cluster $L_N$ and 0 otherwise. Then the vector $f = (d_{11}x_{11},...,d_{kn}x_{kn},...,d_{KN}x_{KN})$ provides a full description of our assignment, where the match rating of each pair is multiplied by the value which indicates whether the pair is assigned. 

For our optimal clustering mechanism in KNOT, we aim to minimize the deviation of clients in the same cluster by keeping all $d_{kn}$ values of the assigned client-cluster pairs to be as small as possible. Note that we do not seek to minimize the average match ratings of all assigned pairs. Instead, we emphasize reducing extreme values while keeping all coordinates as small as possible. More specifically, we want to minimize the largest value of vector $f$, then minimize the second largest value, and so on. This objective leads us to formulate our problem as a lexicographical minimization problem. We present the basic definitions in the following: 
- *Definition 1*: For $\alpha \in \mathbb{R}^n$, let $\langle \alpha \rangle = (̰̰̰\widetilde{\alpha_1}, \widetilde{\alpha_2}, ..., \widetilde{\alpha_n})$ be $\alpha$ sorted in  non-increasing order. 
- *Definition 2*: For $\alpha \in \mathbb{R}^n, \beta \in \mathbb{R}^n$, we say that $\alpha$ is lexicographically smaller than $\beta$, denoted $\alpha \prec \beta$, if $\exists n_0 \in \{ 1,...,n \}$ such that $\widetilde{\alpha_{n_0}} < \widetilde{\beta_{n_0}}$ and $n < n_0 \Rightarrow \widetilde{\alpha_{n}} < \widetilde{\beta_{n}}$. In addition, $\alpha$ is lexicographically no greater than $\beta$ if $\alpha \prec \beta$ or two vectors have the same entries. That is, $\alpha \preceq \beta$ if $\alpha \prec \beta$ or $\widetilde{\alpha_{i}} < \widetilde{\beta_{i}}, \forall i \in \{ 1,2,...,n \}$.
- *Definition 3:* Given a collection F of real vectors of the same length, we say that $\alpha_0 \in F$ is the lexicographical minimum of $F$ if $\alpha_0 \preceq \alpha, \forall \alpha \in F$. For a function $f: \mathbb{R}^n \rightarrow \mathbb{R}^n; n \in \mathbb{R}, \text{lexmin}_x f = x_*, if f(x_*) \preceq f(x), \forall x$. 

Intuitively, the process of finding $\text{lexmin}_x f$ accounts for looking for a vector $x$ that minimizes the largest coordinate in the resulting vector $f(x)$, then minimizing the second largest value while keeping the largest coordinate unchanged. This procedure continues until the smallest value is minimal. Thus, the match rating of each client would be kept to the minimum as we desire.
$$\text{lexmin}_x f = (d_{11}x_{11},...,d_{kn}x_{kn},...,d_{KN}x_{KN})$$
such that
$$\sum_{k=1}^{\mathcal{K}} x_{kn} \leq c_1, \forall n \in \mathcal{N}$$
$$\sum_{k=1}^{\mathcal{K}} x_{kn} \geq 1, \forall n \in \mathcal{N}$$
$$\sum_{k=1}^{\mathcal{K}} x_{kn} \geq c_2, \forall n \in \mathcal{N}$$
$$\sum_{k=1}^{\mathcal{K}} x_{kn} \leq c_3, \forall n \in \mathcal{N}$$
$$x_{kn} \in \{ 0,1 \}, \forall n \in \mathcal{N}, \forall k \in \mathcal{K}$$
where the lexicographical minimum of function $f$ corresponds to the desired client-cluster assignment.

### Transforming into an LP Problem

- **Separable convex objective.** A separable convex function is one that can be represented as a sum of multiple convex functions. To replace the original lexicographical objective function with a separable convex one, for any arbitrary vector $\alpha \in \mathbb{R}^n$, consider:
$$\phi(\alpha) = \sum_{m=1}^{n} n^{\alpha_m}$$

In particular, the exponential function is convex everywhere, and that the function $\phi (\alpha)$ preserves $\preceq$ when $\alpha$ is integer-valued, as stated in the lemma:
  - *Lemma*: $\forall \alpha, \beta \in \mathbb{R}^n. \alpha \preceq \beta \Leftrightarrow \phi (\alpha) \leq \phi(\beta).$

We will transform each match rating $d_{kn}$ in the objective function to an integer-valued $D_{kn}$ by scaling and rounding. Let $d^* = \max_{k \in \mathcal{K}, n \in \mathcal{N}} d_{dn}$ and $d_* = \min_{k \in \mathcal{K}, n \in \mathcal{N}} d_{dn}$ and define $D_{kn} = \lceil \frac{d_{kn}-d_*}{d^* - d_*} \cdot 100 \rceil$. From our experiments, we observed that Dkn is sufficient to classify clients despite its reduced level of accuracy. In our problem, the length of vector $n = K \cdot N$ and $\alpha_n = d_{kn} x_{kn}$. Since $\alpha \preceq \beta$ is equivalent to $\phi (\alpha) \leq \phi(\beta)$, we simply the objective function to:
$$\min_x \sum_{k \in \mathcal{K}} \sum_{n \in \mathcal{N}} (KN)^{D_{kn}x_{kn}}.$$

### Totally unimodular matrix

Let $M_{ij}$ refer to the entry on the $i$-th row and the $j$-th column of a matrix $M$. Let $A_1 \in \mathbb{Z}^{K \times KN}$ be the client indicator matrix and consider each column as representing a client-cluster pair. For each column $j$, $A_{1, ij} = 1$ indicates that client $C_i$ is in the client-cluster pair $j$; otherwise, $A_{1, ij} = 0$. Similarly, $A_2 \in \mathbb{Z}^{N \times KN}$ N is the cluster indicator matrix, where $A_{2, ij}$ indicates whether or not cluster $L_i$ is in the client-cluster pair $j$. Now, we can rewrite our problem as follows:
$$\min_x \sum_{k \in \mathcal{K}} \sum_{n \in \mathcal{N}} (KN)^{D_{kn}x_{kn}}$$
such that
$$Ax \leq b$$
$$0 \leq x \leq 1, x \text{ integer}$$
where
$$A=\begin{bmatrix} A_1 \\ -A_1 \\ A_2 \\ -A_2 \end{bmatrix}; B=\begin{bmatrix} c_1 \\ -1 \\ c_3 \\ -c_2 \end{bmatrix}$$

A matrix $M \in \mathbb{R}^{m \times n}$ is TU if every square sub-matrix $M^{\prime}$ of $M$ has det$(M^{\prime}) \in \{ −1, 0, 1 \}$. This is an important property because it ensures that the inverse of any square sub-matrix is integral, thus all extreme points of the feasible region are integral. It is also known that a matrix $M$ with all entries in $\{−1, 0, 1\}$ is TU if there exists a partition of every subset $R \subseteq \{ 1,2,...,m \}$ into $\mathcal{I}_2$ and $\mathcal{I}_2$ such that for each column $j$, 
$$| \sum_{i \in \mathcal{I}_1} m_{ij} - \sum_{i \in \mathcal{I}_2} m_{ij} | \leq 1$$

Since matrix A contains only 0’s and 1’s, it suffices to provide a partition strategy for any subset of rows of A such the above inequality is satisfied.

Let $R \subset \{ 1,...,2K+2N \}$ be given and we will partition $R$ as follows. For each $k \in \{ 1,...,K \}$, if $k \in R$, put $k$ in $\mathcal{I}_1$ and $K + k$ in $\mathcal{I}_1$, if $K + k \in R$; otherwise, put $K+k$ in $\mathcal{I}_2$ if $K+k \in R$. This strategy ensures that $0 \leq \sum_{i \in \mathcal{I}_1, i \leq 2K} A_{ij} - \sum_{i \in \mathcal{I}_2, i \leq 2K} A_{ij} \leq 1$.

Next, for each $n \in \{ 1,...,N \}$, if $2K + n \in R$, put $2K + n$ in $\mathcal{I}_2$, and if $2K + N + n \in R$, put $2K+N+n$ in $\mathcal{I}_2$; otherwise, put $2K+N+n$ in $\mathcal{I}_1$ if $2K + N + n \in R$. Similarly, we have $-1 \leq \sum_{i \in \mathcal{I}_1, i > 2K} A_{ij} -  \sum_{i \in \mathcal{I}_1, i > 2K} A_{ij} \leq 0$. Adding up the two inequalities to prove that matrix A is TU.

### From an integer LP to an LP problem

The optimal solution for the above problem is the same as the x-coordinates of the optimal solution of the following LP problem: 
$$\min_{x, \lambda} \sum_{k \in \mathcal{K}} \sum_{n \in \mathcal{N}} \lambda_{kn}^0 + (KN)^{D_{kn} x_{kn}} \lambda_{kn}^1$$
such that
$$x_{kn} = \lambda_{kn}^1, \forall k \in \mathcal{K}, \forall n \in \mathcal{N}$$
$$\lambda_{kn}^0 + \lambda_{kn}^1 = 1, \forall k \in \mathcal{K}, \forall n \in \mathcal{N}$$
$$\lambda_{kn}^0, \lambda_{kn}^1, x_{kn} \in \mathbb{R}^{+}, \forall k \in \mathcal{K}, \forall n \in \mathcal{N}$$
where $\lambda_{kn}^0, \lambda_{kn}^1$ are newly introduced variable used in the $\lambda$-representation. 

## Experiments

![t1](https://drive.google.com/uc?export=view&id=1sefwki7Kyhi4QEr-sfMZKxl_DzPz2YOP)

![t2](https://drive.google.com/uc?export=view&id=1QCuEC_6h0LC2BEL3pjoBhamo8wNwX9tc)

![5](https://drive.google.com/uc?export=view&id=1_ZGgcjG9d5wQ1YqHgrYbtImkIX-oHfbl)

# References
- N. Su and B. Li, “Asynchronous Federated Unlearning,” Proceedings of International Conference on Computer Communications (INFOCOM), 2023. [[Paper](https://ningxinsu.github.io/projects/infocom23/)]