# Federated Unlearning with Rapid Retraining

The best mechanism so far for data erasure in FL is to perform retraining among all the data holders, so as to completely eliminate the contributions of data samples to be removed. The key problem is how to design a rapid retraining approach in FL while preserving model utility. 

Specifically, a distributed Newton-type model update algorithm is proposed, which follows the Quasi-Newton method and utilizes the first-order Taylor expansion to approximate the loss function trained by the local optimizer on the remaining dataset, i.e., the one excludes the removed data samples.

To further reduce the cost of retraining, diagonal empirical Fisher Information Matrix (FIM) is employed to efficiently and accurately approximate the inverse Hessian vector to avoid the expensive cost of directly calculating it. To preserve model utility, the momentum technique to the diagonal empirical FIM is applied to alleviate the error caused by the approximation techniques and make the model convergence faster and more stable. 

## Design

### Federated Unlearning Pipeline

- **Data Deletion**: Assuming that the $k_u$-th unlearned client initiates several data deletion requests $\mathcal{U}_k=\{u_1,u_2,\ldots,u_{R_k}\}$ to the server before the start of the $t$-th round of training, these clients need to perform data deletion operations, where the remaining local dataset is denoted as $\mathcal{D}_k^u$. 

- **Rapid Retraining**: Then all clients execute the proposed unlearning algorithm (see below for more details) on the remaining local dataset $\mathcal{D}_k^u$ to achieve unlearning. Specifically, to eliminate the influence of unlearned data samples on the FL model, we perform a *retrain* operation to make the model forget the knowledge represented by these unlearned data samples. In this context, the naive solution (i.e., retraining from scratch) is to apply mini-batch stochastic gradient descent (SGD) directly over the remaining local training dataset $\mathcal{D}_k^u$ (we use $\omega_t^{k_u}$ to denote the corresponding model parameter), i.e., for $k$-th client, the loss function can be defined as follows:
$${F_{k_u}}(\omega ) = \frac{1}{{B - \Delta B_t}}\sum\limits_{({x_i},{y_i}) \in \mathcal{D}_k^u} {{\ell _i}(\omega )},$$
where $B$ is the mini-batch size and ${\Delta {B_t}}$ is the size of the subset removed from the $t$-th mini-batch. Thus, each client updates its local model by using: $\omega _{t + 1}^{{k_u}} = \omega _t^{{k_u}} - \frac{\eta }{{B - \Delta {B_t}}}\sum\limits_{({x_i},{y_i}) \in \mathcal{D}_k^u} {\nabla {F_k}(\omega _t^u)} ,$ where $\eta $ is the learning rate. After executing the local unlearning step, all clients upload model updates to the server so that the server can aggregate these updates to obtain a new unlearned global model. Thus, we have the following execution step. 

- **Aggregation**: Recall that, to achieve unlearning, all unlearned clients and all learned clients perform local retraining and upload the updates of the unlearned model to the server. Then the server uses the aggregation rules to aggregate the model updates of the unlearned client and other clients. Suppose we use the classic aggregation rule (i.e., FedAvg) to aggregate these updates, the formal definition of this aggregation rule is as follows:
$$\omega _{t+1}^u = \frac{1}{{{K_c}\sum {{p_{{k_c}}}} }}\sum\limits_{{k_c}} {{p_{{k_c}}}\omega _t^{{k_c}}}  + \frac{1}{{{K_u}\sum {{p_{{k_u}}}} }}\sum\limits_{{k_u}} {{p_{{k_u}}}\omega _t^{{k_u}}}.$$
Note that user-defined aggregation rules can also be used in our pipeline, which do not affect the operations of it.

### Federated Rapid Retraining

Obviously, the baseline is a time-consuming and resource-consuming unlearning solution, which is undesirable for real-world large FL systems. This motivates us to seek a time-saving and energy-efficient rapid retraining solution. To this end, we follow the Quasi-Newton methods and use the first-order Taylor approximation technique to propose an efficient method. Specifically, for the $k$-th client, let $\nabla {F_{{k_u}}}({\omega}) = 0$ around $\omega^*$, we obtain:
$$\nabla {F_{{k_u}}}({\omega ^*}) + {H_{{k_u}}}({\omega ^*})({\omega ^{u}} - {\omega ^*}) \approx 0,$$
where ${\omega ^{u}} = \arg \min \nabla {F_{{k_u}}}(\omega )$ is a unique global minimum (i.e., global optimal unlearned model), ${H_{{k_u}}}({\omega ^*}) = {\nabla ^2}{F_{{k_u}}}({\omega ^*})$ is the Hessian matrix, and $\omega^*$ is a minimizer for $F_k(\omega)$. According to the first-order optimality condition, we have: $\omega _{t + 1}^{{k_u}} = \omega _t^{{k_u}} + \frac{1 }{{B - \Delta {B_t}}}H_{{k_u}}^{ - 1}{\Delta _k},$ where $\Delta$ is the local gradient, i.e., $\Delta_{k_u}  = \nabla {F_k}(\omega _t^{{k_u}},\mathcal{D}_k^u)$, and $H_{k_u} = {\nabla ^2}{F_k}(\omega _t^{{k_u}},\mathcal{D}_k^u)$. In this way, we can utilize \textit{Newton-type} update strategy to efficiently achieve unlearning goals. However, the calculation of the inverse Hessian-vector is still computationally expensive. Therefore, we further explore how to efficiently calculate the inverse Hessian matrix $H_{k_u}^{-1}$.

To address this problem, recent work uses the limited-memory Broyden Fletcher Goldfarb Shanno (L-BFGS) algorithm by leveraging the historical parameter-gradient pair $\{ (\omega _t^k,\nabla F_k(\omega _t^k))\} _{t = 1}^T$ stored in the unlearned client to approximate $H_{k_u}^{-1}$ for each of $T$ iterations. In particular, L-BFGS fits the Hessian matrix through the first $m$ historical parameter-gradient pairs without explicitly constructing and storing the approximation of the Hessian matrix or its inverse matrix, and its time and space complexity is $\mathcal{O}(mn)$. Nevertheless, L-BFGS algorithms can just efficiently solve the Hessian approximation problem only when the model is small (i.e., generally the model parameter is less than $10^4$, but they cannot be directly applied to the setting of large models (e.g., ResNet). Furthermore, if the server stores historical gradients and parameters, it will incur privacy disclosure risks to the clients.

Motivated by the limitations mentioned above, we aim to answer the following question: *how to efficiently approximate the inverse Hessian matrix without utilizing the historical parameter-gradient pair $\{ (\omega _t^k,\nabla F_k(\omega _t^k))\} _{t = 1}^T$ in FL?* In this paper, we propose a low-cost Hessian approximation method, i.e., diagonal empirical Fisher Information Matrix (FIM)-based approximation method to efficiently approximate Hessian matrix. So the unlearning update rule can be rewritten as follows:
$$\omega _{t + 1}^{{k_u}} = \omega _t^{{k_u}} - \frac{1 }{{B - \Delta {B_t}}}\Gamma _{{k_u}}^{ - 1}{\Delta _{{k_u}}},$$
where $\Gamma _k$ is the FIM. Then, we show that the outer product of the gradient is an asymptotically unbiased estimate of the Hessian matrix and we define the outer product matrix of the gradient at $\omega_t^{k_u}$ as follows:
$$H_t^{{k_u}} \approx \Gamma _t^{{k_u}} = \frac{1}{{B - \Delta B}}\sum\limits_{({x_i},{y_i}) \in \mathcal{D}_k^u} {\nabla {F_k}({x_i},\omega _t^{{k_u}})} \nabla {F_k}{({x_i},\omega _t^{{k_u}})^ \top },$$
where $H_{t}^{k_u}$ is the empirical expectation of the outer product as an approximation to the Hessian at $\omega_t^{k_u}$. Recall that, since the cross-entropy loss we use is negative log-likelihood, it is not difficult to obtain ${\mathbb{E}_{{x_i} \in \mathcal{D}_k^u}}[\nabla {F_{{k_u}}}({x_i},{\omega ^*})\nabla {F_{{k_u}}}{({x_i},{\omega ^*})^{\top}}]$, where $\omega ^*$ is the true parameter (i.e., the Softmax distribution of the local model) obtained in the form of FIM. According to the definition of two equivalent methods for calculating the FIM, the above mentioned equation can also be written as ${\mathbb{E}_{{x_i} \in \mathcal{D}_k^u}}[{\nabla ^2}{F_{{k_u}}}(x,{\omega ^*})]$. Thus, the lemma about the approximation error $\epsilon_t$ is as follows (proof omitted, see paper):

- **Lemma 1**: (Upper bound on $\epsilon_t$). Let $\epsilon_t$ be the approximate error of the FIM approximation of the Hessian matrix, then when $t \to \infty $, the following equation holds:

$$\epsilon_t = {\mathbb{E}_{(y|x,{\omega ^*})}}[H_t + \Gamma _t] \to 0.$$

## Theoretical Analysis

### Convergence Analysis

- **Assumption 1**: (Bounded gradients). For any model parameter $\omega$ and in the sequence $[{\omega _0},{\omega _1}, \ldots ,{\omega _t}, \ldots ]$, the norm of the gradient at every sample is bounded by a constant $\varepsilon_0^2$, i.e., $\forall \omega ,i,$, we have: $||\nabla {f_i}(\omega )|| \leqslant {\varepsilon_0^2}$.

- **Assumption 2**: (Lipschitz continuity). We assume that the function $F:{\mathbb{R}^d} \to \mathbb{R}$ is $L$-Lipschitz continuous, i.e., $\forall \omega_1, \omega_2$, the following equation holds:
$$|F({\omega _1}) - F({\omega _2})| \leqslant L||{\omega _1} - {\omega _2}||.$$

- **Assumption 3**: (Strong convexity and smoothness). $F(\omega)$ is $\mu$-strongly convex and $\rho $-smooth with positive coefficient $\mu$ if $\forall {\omega _1},{\omega _2}$, the following equations hold:
$$F({\omega _1}) \geqslant F({\omega _2}) + \nabla F{({\omega _2})^ \top }({\omega _1} - {\omega _2}) + \frac{\mu }{2}||{\omega _1} - {\omega _2}|{|^2}.$$
$$F({\omega _1}) \leqslant F({\omega _2}) + \nabla F{({\omega _2})^ \top }({\omega _1} - {\omega _2}) + \frac{\rho }{2}||{\omega _1} - {\omega _2}|{|^2}.$$

- **Assumption 4**: The function $F:{\mathbb{R}^d} \to \mathbb{R}$ is twice continuously differentiable, $\rho$-smooth, and $\mu$-strongly convex,  $\mu >0$, i.e.,}
$$\mu I \leqslant {\nabla ^2}f(\omega ) \leqslant \rho I,$$
where $I \in \mathbb{R}^d$ and ${\nabla ^2}f(\omega )$ is the Hessian of gradient.

- **Theorem 1**: Suppose the objective function is strongly convex and smooth, and Assumption \ref{assum-2} holds. Thus, we have:
$$f(\omega _{t + 1}^u) - f({\omega ^*}) \leqslant  \varepsilon,$$
where $\varepsilon  =  - \frac{\mu }{{2{\rho ^2}}}\varepsilon _0^2$ (proof omitted, see paper).

### Complexity Analysis

- **Time**: Let $f(p)$ be the time complexity of forward propagation, then the time complexity of one step backpropagation is at most $5f(p)$, so the total complexity of computing the derivative of each training sample is $6f(p)$. Thus, the total time complexity of the baseline at the step is $6f(p)[{k_u}(B - \Delta B) + {k_c}B]$. If we use a block diagonal structure with a block size of $b$ and iterate over $(B - \Delta B)$ samples in the diagonal estimation, the total computational complexity of the proposed algorithm at the step is $6f(p)[{k_u}(B - \Delta B) + {k_c}B] + k\mathcal{O}(b(B - \Delta B)d)$. Suppose there are $T_b$ ($T_u$) iterations in the retraining process. Then the running time ${\rm T}_b$ of baseline method will be $6f(p)[{k_u}(B - \Delta B) + {k_c}B]{T_b}*{t_b}$. The proposed algorithm's total running time ${\rm T}_u$ is $\big[6f(p)[{k_u}(B - \Delta B) + {k_c}B] + k\mathcal{O}(b(B - \Delta B)d)\big]{T_u}*{t_u}$. Accordingly, let $v$ denote the speed-up factor which is defined as follows:
$$v = {(\frac{{{{\rm T}_u}}}{{{{\rm T}_b}}})^{ - 1}} = {[\frac{{{T_u}*{t_u}}}{{{T_b}*{t_b}}}(1 + \frac{{k\mathcal{O}(b(B - \Delta B)d)}}{{6f(p)[{k_u}(B - \Delta B) + {k_c}B]}})]^{ - 1}},$$
where $t_u$ and $t_b$ are the running time of one round of training of Algorithm \ref{al-1} and the baseline, respectively. The cost of this estimation is one diagonal FIM (to compute $diag(\omega)$), which is equivalent to one gradient backpropagation, i.e., $k\mathcal{O}(b(B-\Delta B)d) \approx 5kf(p)$, thus, we have: 
$$v \approx {[\frac{{{T_u}*{t_u}}}{{{T_b}*{t_b}}}(1 + \frac{{5k}}{{6[{k_u}(B - \Delta B) + {k_c}B]}})]^{ - 1}}.$$

- **Space**: In the proposed algorithm, we use the diagonal technique to approximate the Fisher information matrix and update it with the mini-batch gradients. Specifically, as we employ a block-diagonal structure with blocks of size $b$, it needs a memory of size $\mathcal{O}(db)$ to compute the estimated diagonal Fisher information. As shown in our experiments, this allows us to support fairly large models and sample sets.

## Experiments

### Setup
To evaluate the performance of our proposed design, we conduct extensive experiments on four representative public datasets. All experiments were developed using Python 3.7 and PyTorch 1.7, and executed on a server with an NVIDIA GeForce RTX2080 Ti GPU and an Intel Xeon Silver 4210 CPU. 

![i](https://drive.google.com/uc?export=view&id=1noFSKPhwdIXSnLIyDpHbTXCUiFTsquim)

- **Datasets**: In this paper, we adopt four real-world image datasets for evaluations, i.e., MNIST\footnote, Fashion-MNIST, CIFAR-10, and CelebA (a.k.a. Federated LEAF dataset). The datasets cover different attributes, dimensions, and number of categories, as shown in Table I, allowing us to effectively explore the unlearning utility of the proposed algorithm. To simulate the real environment settings of FL, we evenly distribute the four training datasets to all clients.

- **Models**: In this experiment, we use a simple CNN model, i.e., CNN with 2 convolutional layers followed by 1 fully connected layer for classification tasks on the MNIST dataset and Fashion-MNIST dataset, the AlexNet model for classification tasks on the CIFAR-10 dataset, and the ResNet-18 model for classification tasks on the CelebA dataset. In particular, we performed a gender classification task on the CelebA dataset.

- **Hyperparameters**: In our design, we consider the cross-silo FL scenario. We set the number of clients $K=10$, proportion of client participation $q=1$, local epoch ${E_{local}}=1$, mini-batch size $B \in \{ 128,256,512,1024,2048\}$, learning rate $\eta  = 0.001$, delete rate $r = \{ 2\% ,1.5\% ,1\% ,0.5\% \}$, training round $T=200$, the block size $b=3$, and the momentum parameter $\beta_1=0.9$, $\beta_2=0.999$.

- **Federated Unlearning Pipeline**: First, we follow the above hyperparameters setting to start the training stage in the FL pipeline. Then, we use the obtained model $\omega^*$ for inference or unlearning. Second, if the clients initiate several data erasure requests, the server will start the unlearning stage after receiving all the requests to forget the contribution of the erased data. Notably, we define the percentage of client-side deleted data to all training data as the deletion rate $r$. Finally, the server reinitializes the global model $\omega^*$ and uses the Federated Rapid Retraining Algorithm or the baseline algorithm (see below) to perform rapid retraining to obtain the new model $\omega^u$.

- **Baseline**: **Retraining from Scratch.** This method is to delete the erased training samples and to retrain the FL model from scratch by using the remaining dataset as the training dataset. 

- **Evaluation Metrics**: First, we will evaluate the efficiency of the proposed algorithm, and the speed-up of our retraining algorithm running time is defined: $v  = \frac{{{\mathrm{T}_b}}}{{{\mathrm{T}_u}}}\mathrm{x}$. Second, to fairly compare model utility, we report the performance achieved by a given model relative to the performance of the baseline model. To this end, we use the Symmetric Absolute Percentage Error (SAPE) defined as: $\varepsilon_s = \mathrm{SAPE}(Acc_{test}^*,Acc_{test}^u) = \frac{{|Acc_{test}^u - Acc_{test}^*|}}{{|Acc_{test}^*| + |Acc_{test}^u|}}$, where $Acc_{test}^*$ denotes the accuracy of the model $\omega^*$ obtained by the baseline algorithm on the test dataset $\mathcal{D}_{test}$, and $Acc_{test}^u$ denotes the accuracy of the model $\omega^u$ obtained by the proposed algorithm on the same dataset.

### Results

![2](https://drive.google.com/uc?export=view&id=139k8M7P3-1xo45tfFTyqwZ_ArGpwnemp)

![3](https://drive.google.com/uc?export=view&id=1i5lrh0qKRnC3iV-YWVsRK2s8fgeuo3Nn)

![4](https://drive.google.com/uc?export=view&id=1BCelCi33GhhnFTofUPwgPSnkSpcl_FP6)

![5](https://drive.google.com/uc?export=view&id=1D_vxDaXMZ4qMCU4dHlVwCYpPW7toEmoJ)

# References

- Y. Liu, L. Xu, X. Yuan, C. Wang, and B. Li, “The Right to be Forgotten in Federated Learning: An Efficient Realization with Rapid Retraining,” May 2022. doi: 10.1109/infocom48880.2022.9796721. [[Paper](https://ieeexplore.ieee.org/document/9796721)]