# Subspace based Federated Unlearning

In each FL training round, local training of each client is a process that reduces the empirical loss. We argue that unlearning can be formulated as the inverse process of learning, in the sense that the gradient ascent on the target client can realize the forgetting of the client data. However, the loss is unbounded and we need to limit the gradient of training to ensure the quality of the model after unlearning. The whole process can be regarded as a constraint-solving problem to maximize the empirical loss of the target client within the constraints of the model performance. In this paper, we propose a Subspace-based Federated Unlearning method, dubbed SFU.  SFU restricts the gradient of the target client's gradient ascent to the orthogonal space of the input space of the remaining clients to remove  the target client's contribution from the final trained global model. In SFU, the server only needs the gradient information provided by the target client and the representation matrix information provided by other clients, without directly accessing the original data of each client. On the other hand, SFU can be used for models in any training phase without considering the specific details of model training and model aggregation. At the same time, SFU does not require the client or server to store additional historical gradient information or data.

Specifically, SFU participants can be divided into three kinds of roles: the target client to be forgotten, the remaining clients, and the server. In SFU, the target client performs gradient ascent locally based on the global model and sends the gradient to the server; each remaining client selects a certain amount of local data to build a representation matrix and send it to the server; the server receives the representation matrix from each client and merges it to obtain the input subspace with  Singular Value Decomposition (SVD); the server finally projects the gradient of the target client into the orthogonal subspace of the input space and updates the global model with it. In addition, we design a differential privacy method to protect the privacy of clients in the process of sending the representation matrices.  It needs each client to add random perturbation factors to each vector of the representation matrix  to prevent possible privacy leaks and those perturbation factors have no effect on the  input space search and the final model. Empirical results show that SFU beats other SOTA baselines with 1\%-10\% improvement in test sets. 

## Design

### Setup

Suppose that there are $N$ clients, denoted as $C_1, ..., C_N$,  respectively. Client $C_i$ has a local dataset $\mathcal{D}^i$. The goal of traditional FL is to collaboratively learn a machine learning model $w$ over the dataset $\mathcal{D}\triangleq \bigcup_{i\in[N]}\mathcal{D}^i$ :
$$\arg\min_{w} \mathcal{L}(w) = \sum_{i=1}^N \frac{|\mathcal{D}^i|}{|\mathcal{D}|}L_i(w),$$
$$w^* = \arg \min_{w} \mathcal{L}(w),$$
where $L_i(w) = \mathbb{E}_{(x,y)\sim \mathcal{D}^i} [\ell_i(w; (x, y))]$ is the empirical loss of $C_i$ and during federated training, each client  minimizes  their empirical risk $L_i(w)$, $w^*$  is the final model trained by the FL process.

Now we consider how to forget the contribution of the target client $C_I$. A natural idea  is  to increase  the empirical risk $L_{C_I}(w)$ of the target client $C_I$, which is equivalent to reversing the learning process. However, simply maximizing the loss  can influence the effect of the model on other clients. Federated unlearning needs to forget the contribution of the target client $C_I$ while ensuring the overall model performance. Thus, the objective of federated unlearning is defined below:
$$ \arg \max_{w} L_i(w) = \mathbb{E}_{(x,y)\sim \mathcal{D}^i} [\ell_i(w; (x, y))]\\
   s.t. \qquad \mathcal{L}^{ul}(w) - \mathcal{L}^{ul}(w^*) \leq \delta $$
where $\delta$  is a small change in the empirical loss, $\mathcal{L}^{ul}()$ is the empirical loss of the FL system after removing the  target client. 
$$\mathcal{L}^{ul}(w) = \sum_{i\in[N\backslash C_I]} \frac{|\mathcal{D}^i|}{|\mathcal{D}^{un}|}L_i(w),$$
where $\mathcal{D}^{un}\triangleq \bigcup_{i\in[N\backslash C_I]}\mathcal{D}^i$  is the remaining data set after removing the target client.

### Subspace-based Federated Unlearning (SFU)

![1](https://drive.google.com/uc?export=view&id=1akXxXsuBGrZk0Vk8HSJYjfUeZaNvhsLf)

We introduce a novel Subspace-based federated unlearning framework, named SFU. The main insight of the SFU is that we constrain the gradients generated by the target client's gradient ascent to the input subspace of the other clients to remove the contribution of the target client from the global model. As shown in Fig. 1, SFU participants can be divided into three kinds of roles: the target client to be forgotten, the remaining clients, and the server. The target client performs gradient ascent to upload the gradient to the server. Other clients compute the local representation matrix and upload it to the server. The server is responsible for the computation of other client input Spaces and the unlearning update of the global model. Next, we will introduce the specific tasks of the three participants respectively.

To find a model with a large empirical loss in target client $C_I$, we can simply make several local passes of (mini-batch stochastic) gradient ascent in client $C_I$ and add these gradient updates to the global model. 

Given a neural network $W$ and an input $\textbf{x}$ we can obtain an output $\textbf{y}$:
$$W\textbf{x} =  \textbf{y}_1.$$
When this model accepts a gradient update $\mathcal{D}elta w$, the output becomes:
$$(W + \Delta w) \textbf{x} =  \textbf{y}_2.$$
The difference between the two outputs is:
$$\Delta \textbf{y} =  \textbf{y}_2 - \textbf{y}_1 = (W + \Delta w) \textbf{x} - W\textbf{x} = \Delta w \textbf{x}.$$
When $ \Delta \textbf{y} $  is 0, the difference between the two outputs is minimized, which requires the updated gradient $\Delta w$ to be perpendicular to the original input gradient subspace  $\textbf{x}$. Therefore, we can project the updated gradient of the target client $C_I$ into the orthogonal space of the gradient subspace of $\mathcal{D}^{un}$ to minimize the degradation of the glob model performance. 

We first need to consider how to represent the input space in $\mathcal{D}^{un}$, the data  of other clients. For an individual network, we construct the gradient subspace by the following two steps:
- For each layer $l$ of the network, We  first construct a representation matrix, $\boldsymbol{R}^l =[\textbf{x}_{1}^l, \textbf{x}_{2}^l, ..., \textbf{x}_{n_s}^l ]$ concatenating $n_s$ representations along the column obtained from forward pass  of $n_s$ random samples from the current training dataset through the network.
- Next, we perform SVD on $\boldsymbol{R}^l =\boldsymbol{U}^l\boldsymbol{\Sigma}^l(\boldsymbol{V}^l)^T$ followed by its $k$-rank approximation $(\boldsymbol{R}_1^l)_k$ according to the following criteria for the given coefficient, $\epsilon^l$ :
$$||(\boldsymbol{R}^l))_k||_F^2 \geq \epsilon^l||\boldsymbol{R}^l||_F^2.$$
$S^l=span\{\boldsymbol{u}_{1,1}^l,\boldsymbol{u}_{2,1}^l, ...,\boldsymbol{u}_{k,1}^l\}$, spanned by the first $k$ vectors in $\boldsymbol{U}_1^l$ as the space of significant representation at layer $l$ since it contains all the directions with highest singular values in the representation.

For FL scenarios, we need the data on each client to seek the gradient subspace of the $\mathcal{D}^{un}$. First,  all clients  excluding the target client $C_I$ select the same number of  representations matrix of local samples for each layer $\boldsymbol{R}_i^l$ and send them to the central server to construct the representation matrix. 

To protect the privacy of the representation matrix, we design a differential privacy algorithm. We add random factors $\lambda_i^l$ to the representation of layer $l$ from  client $i$ to avoid leaking data information about the client data and it does not affect the search process of the subspace because of the nature of the orthogonal matrix. If 
$$\Delta w \textbf{x} = 0,$$
we have
$$\Delta w (\lambda \textbf{x}) = 0.$$

The final set of representation matrix in the server is $\boldsymbol{R} = \left\{\boldsymbol{R}^1, \boldsymbol{R}^2, ..., \boldsymbol{R}^L\right\} $, and $\boldsymbol{R}^l =[\lambda_{1}^l \textbf{x}_{1,1}^l, \lambda_{2}^l \textbf{x}_{2}^l, ..., \lambda_{N}^l \textbf{x}_{n_s}^l ]$.

After several local passes of (mini-batch stochastic) gradient ascent, client $C_I$ sends the updated gradient $g_{C_I}$ to the server. The server performs the update of the global model $w$ after collecting the set of representation matrix $\boldsymbol{R}$ and the  gradient $g_{C_I}$. The server first perform SVD on $\boldsymbol{R}$ to get the set of input gradient subspace $S = \left\{S^1, S^2, ..., S^L\right\} $. To achieve the goal of federated unlearning, we need to project $g_{C_I}$ onto $S$  and get projection $\tilde{g_{C_I}}$. $g_{C_I}-\tilde{g_{C_I}}$ orthogonal to $\boldsymbol{R}$  and the server update the global model $w$ with $g_{C_I}-\tilde{g_{C_I}}$:
$$w = w - (g_{C_I}-\tilde{g_{C_I}}).$$

The updated model $w$ removes the contribution of the target client ${C_I}$ and maintains a similar performance to the original global model.

After a global model is trained, a complete SFU training process mainly includes three steps as shown in Fig. 1:
- Besides target client ${C_I}$, each client selects the same number of samples to calculate the representation matrix $\boldsymbol{R}^1$ for each  layer$l$ of the network  and sends it to the server after adding random factors $\lambda_i^l$.
- The target client ${C_I}$ performs  several local passes of  gradient  ascent locally and sends the updated gradient to the server.
- The server perform SVD on the set of representation $\boldsymbol{R}$ to get the set of input gradient subspace $S$,  project $g_{C_I}$ onto $S$ and removes the contribution of the target client ${C_I}$ by updating the global model $w$.

## Experiments

![t](https://drive.google.com/uc?export=view&id=1UtKjXPNgH9WW4778stKPo33L6wBeZHRj)

![2](https://drive.google.com/uc?export=view&id=1DuU9jw_ikoH7NWUQ-31tRVCM8Hoz3jzs)

![4](https://drive.google.com/uc?export=view&id=1wU7IZQf67WgSHB8w-O4wX8waQYFhT7S8)

# References
- G. Li, L. Shen, Y. Sun, Y. Hu, H. Hu, and D. Tao, Subspace based Federated Unlearning. arXiv, 2023. doi: 10.48550/ARXIV.2302.12448. [[Paper](https://arxiv.org/abs/2302.12448)]