# Federated Unlearning for On-Device Recommendation

To fast unlearn target users, FRU calibrates all users' historical model updates and aggregates these updates to reconstruct FedRec models. FRU contains two parts: efficient on-device update storing and update revision for fast recovery. The efficient on-device update storing aims to record each client's historical updates on its resource-constrained device. It contains two components: a user-item mixed negative sampling method to reduce the size of updated parameters and an importance-based update filtering to dynamically clean less important historical updates. Based on the stored historical updates, FRU can roll back the FedRec to the point when the deleted user/client joins the FedRec and then calibrate the historical updates from that time point to recover the FedRec.

We present the details of FRU.
For clarity, we use $\mathcal{M}(V)_{k}^{t}$ and $\mathcal{M}(U)_{k}^{t}$ to represent the updates of global (or public) parameters (e.g. item embeddings) and local (or private) parameters (e.g. user embeddings) of user $u_{k}$ at $t$th global round.
$\mathcal{M}(U)_{k}^{t}$ is not submitted to the central server for privacy concerns, while $\mathcal{M}(V)_{k}^{t}$ is uploaded to the server and the server will update the global parameters by aggregating these clients' updates.

## Base Federated Recommenders

To present the generalization ability of FRU, we choose the two most commonly used recommenders as our base models and train them with the most typical federated learning protocol.

Neural collaborative filtering extends collaborative filtering (CF) by leveraging an $L$-layer feedforward network (FFN) to model the complex relationship between users and items.
$$\hat{r_{ij}} = \sigma (h^\top FFN([\mathbf{u}_{i}, \mathbf{v}_{j}]))$$
where $\mathbf{u}_{i}$ and $\mathbf{v}_{i}$ are user and item embeddings, respectively. $[\cdot]$ is concatenation operation.

As a graph-based recommender, LightGCN treats each user and item as a distinct node. The user-item interactions can be viewed as a bipartite graph. User and item embeddings are learned and updated by propagating their neighbors' embeddings:
$$\mathbf{u}_{i}^{l} = \sum\limits_{j\in \mathcal{N}_{u_{i}}}\frac{1}{\sqrt{\left| \mathcal{N}_{u_{i}} \right|} \sqrt{\left| \mathcal{N}_{v_{j}} \right|}}\mathbf{v}_{j}^{l-1}, \mathbf{v}_{j}^{l} = \sum\limits_{i\in \mathcal{N}_{v_{j}}}\frac{1}{\sqrt{\left| \mathcal{N}_{v_{j}} \right|} \sqrt{\left| \mathcal{N}_{u_{i}} \right|}}\mathbf{u}_{i}^{l-1}$$
where $l$ is the propagation layer in LightGCN. Note that under FedRec settings, the embeddings of neighbor users are not accessible. Therefore, each client can only use the local user-item interaction bipartite graph to update user and item embeddings.

The federated learning protocol is as follows. The user embedding $\mathbf{u_{i}}$ is private, initialized and maintained at each client. The item embeddings $\mathbf {V} $ and other global model parameters are initialized and sent from a central server. After receiving item embeddings and other global parameters, each client combines its local user embedding $\mathbf{u_{i}}$ and these global parameters to form a local recommender model and then updates these model parameters on the local training data. After local updating, each client sends the updated global parameters back to the server and keeps the updated user embedding $\mathbf{u_{i}}$ locally. The server applies the average aggregation for these global parameters to gain the new global parameters.

## Efficient On-Device Update Storing

Compared with retraining from scratch, FRU speeds up the reconstruction process by accurately rolling back the FedRec and then calibrating the stored model updates. FRU takes advantage of storing historical model updates (i.e., the log) to avoid retraining from scratch. We choose to store historical model updates on local devices since centrally storing all clients' historical updates will incur unaffordable storage costs when the number of clients is large. Therefore, how efficiently utilizing the limited storage space for the historical model updates  at each client device is the key.

Generally, the item embeddings dominate the size of a local recommender model in the FedRec. Based on this observation, we mainly focus on how to reduce the size of item embedding updates on each device. This paper proposes an importance-based update selection mechanism to store only important item embedding updates  and a user-item mixed semi-hard negative sampling method to reduce the number of negative instances at each client.

### Importance-based Update Selection

Instead of storing the updates of the whole item embedding table on each client device, the importance-based update selection mechanism only stores the embedding updates of the client's interacted items and the sampled negative items, which are a tiny portion of the whole item set. Then,  FRU further reduces the storage costs by ignoring non-significant updates. Intuitively, the influences of a client/user on different items' embeddings are different.
For example, some items are very popular, and their embeddings are updated by many users.  In this case, a single user can only have a very limited impact on these items.  Another example is that when an item's embedding is already well-trained, a user can only make an ignorable effect. For these items whose embedding updates are so small at a client,  FRU will not store them to reduce the storage costs on the device. Specifically, for a client/user, FRU only stores the top $\alpha$ proportion of item embedding updates based on their updates' significance $\lVert \mathcal{M} (\mathbf{v}_{i})_{k}^{t} \rVert$, where  $\mathcal{M}(\mathbf{v}_{i})_{k}^{t}=\mathbf{v}_{i}^{t}-\mathbf{v}_{i}^{t-1}$.  $\mathbf{v}_{i}^{t-1}$ is the received embedding of item $v_{i}$ from the central server, and $\mathbf{v}_{i}^{t}$ is its updated embedding on the local training data $\mathcal{D}_{k}$ of client $u_{k}$. 

### User-item Mixed Semi-hard Negative Sampling

A client only needs to store the embedding updates of its associated items. The associated items include the client's interacted items and  sampled negative items. In this part, we propose a more efficient negative sampling method to reduce the number of negative samples at each client without compromising the model's performance so that FRU can further alleviate the log storage space.

For each user, the traditional negative sampling method randomly selects a small portion of its non-interacted items as negative samples. However, some sampled negative items may not be informative enough to optimize the recommendation model. In a centralized recommendation system, some recent studies show that hard negatives contain more helpful information for model optimization and convergence. They proposed two methods to sample hard negatives based on user embeddings and adversarial model. However, these centralized negative sampling methods cannot be directly adopted in FedRecs. Specifically, the adversarial method brings much extra computation overhead, and each user's local data is highly sparse and biased. Sampling hard negatives based on user embedding is also unreliable in FedRecs because only a small portion of users are selected to train at each global epoch, and  the frequency of updating user embeddings is lower than centralized recommenders.
Consequently, how to retrieve high-quality negative samples from the item pool is still challenging for FedRecs.

In this paper, we propose a user-item mixed semi-hard negative sampling strategy. Specifically, we choose semi-hard negative samples from both user and item sides. The sampling from the user side is the same as in traditional centralized recommenders. We calculate the relevance between the user's embedding and each item's embedding in the candidate pool to choose hard negative samples for each client. As mentioned before, the frequency of updating user embedding is lower than centralized recommenders, so the user embedding needs to take a relatively long time to be informative. To compensate for the unreliability of user embeddings at the early stage, we integrate sampling negative items from the item side.  Specifically,  we use the embedding centroid of the user's interacted items as the pseudo user embedding at the early model training stage, as item embeddings are updated more frequently than user embeddings in FedRecs to be more reliable at a very early stage of training. We adopt the element-wise average to calculate the item embedding centroid.  Then, we choose hard negative items by calculating the relevance score with the centroid vector.

We first select the top $2R$\% items as the candidate item pool based on the user-item mixed sampling strategy. Then, we randomly select $N*\beta$ ($0 < \beta < 1$) negative samples from the candidate item pool to form semi-hard negative samples, and $N$ is the original size of negative samples for each user in the traditional negative sampling methods. We do not select the top negative items from the candidate pool (i.e., hard negatives) to avoid sample false negatives. The sampling mechanism can be formally described as follows:
$$\mathcal{V}_{k}^{u} = \mathop{argmax}\limits_{v_{i} \notin  \mathcal{D}_{k} \land \left| \mathcal{V}_{k}^{u} \right| = R\% * \left| \mathcal{V} \right|} R(\mathbf{u}_{k}^{t-1}, \mathbf{v}_{i}^{t-1})$$
$$\mathcal{V}_{k}^{v} = \mathop{argmax}\limits_{v_{i} \notin  \mathcal{D}_{k} \land \left| \mathcal{V}_{k}^{v} \right| = R\% * \left| V \right|} R(\mathbf{v}_{k,cet}^{t-1}, \mathbf{v}_{i}^{t-1})$$
$$\mathcal{V}_{k}^{neg} = \mathop{Random}\limits_{\left| \mathcal{V}_{k}^{neg} \right| =N * \beta}(\mathcal{V}_{k}^{u} \cup \mathcal{V}_{k}^{v})$$

where $\mathbf{v}_{k,cet}^{t-1}$ is the average embedding of $u_{k}$'s interacted items at the beginning of epoch $t$. $R(\cdot)$ is the relevance measure function. Here, we use Euclidean distance to measure the relevance. $N*\beta$ is the number of negative samples we finally select. The ablation study in the experiment shows that our proposed negative sampling method achieves comparable performance even with $\beta=0.5$. Therefore, our efficient sampling mechanism significantly reduces the number of required negative samples for each client, thus leading to less storage space for item embedding updates. 

### Storage Space Cost Analysis

$\left| \mathcal{V} \right|$ is the size of total items, $\left| \mathcal{V}_{k}^{pos} \right|$ and $\left| \mathcal{V}_{k}^{neg} \right|$ are the sizes of positive items and selected negative items for user $u_{k}$. Assume the number of global epochs in the FedRec is $B$, and  $b$\% users will be selected for model training at each global epoch. So averagely, a user will be trained $B\times b\%$ times in the whole training process. The cost for storing an item's embedding updates is $C$. As mentioned previously, for a local recommender, the cost of storing model updates mainly happens in item embeddings. Therefore, we focus on analyzing the space cost of storing item embedding updates here. Before applying our efficient on-device update storing method, each client keeps the whole item embedding table's updates after training.
Averagely, the cost of storing update logs for a user $u_{k}$ is $b\% \times B \times \left| \mathcal{V} \right| \times C$. By applying our proposed Importance-based Update Selection method, the storage space cost can be reduced to $b\% \times B \times \alpha(\left| \mathcal{V}_{k}^{pos} \right| + \left| \mathcal{V}_{k}^{neg} \right|)  \times C$. Then, by applying our proposed efficient negative sampling method, the storage space cost for each client can be further reduced to $b\% \times B \times \alpha(\left| \mathcal{V}_{k}^{pos} \right| + \beta\left| \mathcal{V}_{k}^{neg} \right|)  \times C$. Generally, in most FedRecs, the negative items are sampled with a certain ratio of the positive items. Assume the ratio is $1:n$, then, the cost of our FRU's storage for each client is $b\% \times B \times \alpha(1+ \beta n)\left| \mathcal{V}_{k}^{pos} \right|  \times C$.

Take the setting of our experiment on Steam-200k as an example, where $B=200$, $b=10$, $\alpha=0.5$, $\beta=0.5$. The average interacted item size $\left| \mathcal{V}_{k}^{pos} \right|$ is about $30$. The negative sample ratio $n$ for NCF and LightGCN is $4$ and $1$, respectively. On average, the space cost of storing model updates on each client device is about $900C$ for NCF and $450C$ for LightGCN. The total item size is $5134$. As a result, each client only needs to pay an extra $17.5\%$ space cost when using NCF ($8.75\%$ when using LightGCN) compared with the FedRec without unlearning ability.

## Unlearning with Updates Revision

hen a group of users leave the FedRec service and request to forget their information at a certain time $t$, FRU will first roll back the federated recommender model to the initial state (i.e. $t=0$) or the state when the first one of these users joined the federated training process, and then calibrate all historical model updates on the remaining clients. The basic idea of calibration is that we only calibrate the historical updates' direction while keeping the length of updates unchanged since the direction guides the model to fit the training data, which has been polluted by the removed users. Our unlearning method is based on FedEraser, which is a unlearning method for classification tasks. But we calibrate updates based on the efficient on-device update storing since the number of clients are greatly larger than classification tasks, meanwhile, we consider how to revise private parameters (i.e. user embeddings) which only exist in Fedrec tasks. In what follows, we will present how to perform unlearning in FRU.
To avoid complex presentation, in this part, we directly use $\mathcal{M}(V)_{k}^{t}$ to represent the model updates at time $t$ that are stored with our proposed efficient on-device update storing method. We use $\mathbf{V}$ to directly represent the global parameters since the item embedding table dominates the global parameters' part.

### Search Calibrated Direction

At $t$'th global training round, a group of users are selected to train the FedRec. We denote this group of users as $\mathcal{U}_{t}$. We first remove the target users $\mathcal{U}^{'}$ who request to forget their information. Then, for each remained user $u_{k} \in \mathcal{U}_{t}/\mathcal{U}^{'}$, we run  $\lambda * L$ local training epochs based on the  unlearned global model $\bar{\mathbf{V}}_{t-1}$ achieved in the last round and the local user embedding $\mathbf{u}_{k}^{t}$ to get new global parameters' updates $\hat{\mathcal{M}}(V)^{t}_{k}$ and new user embedding updates $\hat{\mathcal{M}}(U)^{t}_{k}$. Note that the users selected to recover the FedRec are the same as in the original training at round $t$, except the removed users.
Here, $L$ is the original local training epochs, and $\lambda$ is a \emph{speed-up factor}. We need much fewer local training epochs because we only want to approximate the update direction that can make the model fit the remaining data at this step. It is worth mentioning that the global model $\mathbf{V}_{0}$ is the initial global recommender model on the central server that has yet received any update from clients, therefore, when $t = 1$, we can directly get the unlearned global model through aggregating updates from the remaining users: $\bar{\mathbf{V}}_{1}= \mathbf{V}_{0} + agg(\mathcal{M}(V)_{k}^{1}|_{u_{k}\notin\mathcal{U}^{'}})$.

### Aggregate and Modify Updates

After the first step, the client computed new global parameter updates $\hat{\mathcal{M}}(V)^{t}_{k}$ and user embedding updates $\hat{\mathcal{M}}(U)^{t}_{k}$.
For the global parameter updates, clients at first upload their new computed updates $\hat{\mathcal{M}}(V)^{t}_{k}$ and the stored updates $\mathcal{M}(V)_{k}^{t}$ to the central server. Then, the central server aggregates these updates and combines the new updates' direction with the original updates' length to construct the new calibrated updates $\bar{\mathcal{M}}(V)_{k}^{t}$, since the original updates' direction guides model to fit old training data.
$$\bar{\mathcal{M}}(V)^{t} = \left\| agg(\mathcal{M}(V)_{k}^{t})|_{u_{k}\notin\mathcal{U}^{'}} \right\| \frac{agg(\hat{\mathcal{M}}(V)^{t}_{k})|_{u_{k}\notin\mathcal{U}^{'}}}{\left\| agg(\hat{\mathcal{M}}(V)^{t}_{k})|_{u_{k}\notin\mathcal{U}^{'}} \right\|}$$
where $agg(\cdot)$ is aggregation strategy.

Then, the global parameters can be recovered at the central server based on the above calibrated updates, as follows:
$$\bar{\mathbf{V}}_{t} = \bar{\mathbf{V}}_{t-1} + \bar{\mathcal{M}}(V)^{t}$$

For user embedding updates, we directly use $\hat{\mathcal{M}}(U)^{t}_{k}$ to update user embedding: $\bar{\mathbf{u}}_{k}^{t} = \bar{\mathbf{u}}_{k}^{t-1} + \hat{\mathcal{M}}(U)^{t}_{k}$,  because the updating frequency of user embeddings is much lower than item embeddings, and the previous user embedding updates may not be reliable.

FRU repeats the above process at each global round to achieve the unlearned model. 

### Time Complexity Analysis

One of the key advantages of FRU is that it can accelerate the reconstruction of the FedRec, compared with retraining from scratch. The speed-up ratio is mainly related to the speed-up factor $\lambda$, which allows clients to perform fewer rounds of local training. Note that the local training time costs dominate the time complexity of the whole federated unlearning process. In our experiments, we set $\lambda=0.1$. Under this setting, FRU can achieve up to 10x speedup compared with retraining from scratch. Empirically, FRU is $7$x faster than retraining from scratch.

## Experiments

![t1](https://drive.google.com/uc?export=view&id=1btbimOGhU-pQC4ADDMwkuq7MnLc4rGZ_)

![1](https://drive.google.com/uc?export=view&id=1KgNujTaEjS473F9HvikGqXedrTL3MWej)

![2](https://drive.google.com/uc?export=view&id=1yJzvwFswgefgz6lQS8kraKaIJl-Ku24h)

![3](https://drive.google.com/uc?export=view&id=1AfaVygf8meDhosWkekayOskWOb92VPLx)

![t1](https://drive.google.com/uc?export=view&id=1fNSjUYuCcIalwcBO8avsvpO5zoUj-CfD)

# References
- W. Yuan, H. Yin, F. Wu, S. Zhang, T. He, and H. Wang, “Federated Unlearning for On-Device Recommendation,” in Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, 2023, pp. 393–401. doi: 10.1145/3539597.3570463. [[Paper](https://dl.acm.org/doi/abs/10.1145/3539597.3570463)]