# Heterogeneous Federated Knowledge Graph Embedding Learning and Unlearning

In this paper, we consider the realistic challenges in federated KG embedding, and propose a novel FL framework for heterogeneous KG embedding learning and unlearning, dubbed **FedLU**. To address the data heterogeneity of multi-source KGs, we propose mutual knowledge distillation to transfer local knowledge to global, and absorb global knowledge back. Furthermore, to achieve knowledge forgetting, we present an unlearning method to erase specific knowledge from local embeddings and propagate to the global embedding by reusing knowledge distillation.
To validate the effectiveness of the proposed framework, we construct three new datasets based on FB15k-237 and carry out extensive experiments with varied number of clients.

- We propose a FL framework for KG embedding learning. In particular, we design a mutual knowledge distillation method to cope with the drift between local optimization and global convergence caused by data heterogeneity.
- Based on cognitive neuroscience, we present a novel KG embedding unlearning method, which combines retroactive interference and passive decay to achieve knowledge forgetting.
- We conduct extensive experiments on newly-constructed datasets with varied number clients. Experimental results show that **FedLU** outperforms the state-of-the-arts in both link prediction and knowledge forgetting.

## Framework Design

![2](https://drive.google.com/uc?export=view&id=1ezX48tTwtJQYELAkcDab_DVzMowGDsHv)

We aim to train a KG embedding model based on the FL architecture composed of a central server and a set of clients $K$. Each client $k\in K$ has a local KG $G^k=\{(h,r,t)\,|\,h,t\in E^k, r\in R^k\}$,  where $E^k,R^k$ denote the entity and relation sets, respectively. 
Local KGs have overlapping entities (i.e., $\forall i, \exists j\neq i, E^i\cap E^j\neq\emptyset$), and the whole dataset is denoted by $G=\bigcup_{k\in K}G^k$. Furthermore, we do not assume the overlap of relations.

At each communication round $t$, the server first samples $K_t\subseteq K$ to collaborate, and then distributes the corresponding local avatar $\mathbf{P}^{k}\mathbf{E}_t$ of global embedding $\mathbf{E}_t$ to each sampled client $k\in K_t$. Next, the selected client $k$ updates its local embedding $\mathbf{E}^k_{t}$ assisted by $\mathbf{P}^{k}\mathbf{E}_t$ which is received from the server. At the end of communication round $t$, the server receives the uploaded local avatars of the global embedding and aggregates them into $\mathbf{E}_{t+1}$. The objective of federated KG embedding learning is to generate a global embedding which minimizes the average local loss:
$$\mathbf{E}=\arg \min_{\mathbf{E}}\sum_{k\in K}\frac{|G^k|}{|G|}\mathcal{L}(\mathbf{P}^{k}\mathbf{E};G^k),$$
where $\mathcal{L}(\cdot)$ is the self-adversarial negative sampling loss of the embedding $\mathbf{P}^{k}\mathbf{E}$ on the local KG $G^k$.

After rounds of communication and training, the framework of FedLU gets the best global embedding $\mathbf{E}$ and local embeddings $\mathbf{E}^k,k\in K$. Later, the local KG $G^k$ of client $k$ may have a forgetting triplet set $G^k_{u}$ and its complementary triplet set $G^k_{c}$. To define the goal of unlearning, the local objective is to optimize
$$\mathbf{E}^{k-} = \arg \min_{\mathbf{E}^k}\Big(-\frac{|G^k_{u}|}{|G^k|}\mathcal{L}(\mathbf{E}^k;G^k_{u})+\frac{|G^k_{c}|}{|G^k|}\mathcal{L}(\mathbf{E}^k;G^k_{c})\Big).$$

Under the FL setting, a subset of clients $K_{u}\subseteq K$ have their own forgetting sets $\big\{G^{k}_{u}\big\}_{k\in K_{u}}$ and the complementary sets $\big\{G^{k}_{c}\big\}_{k\in K_{u}}$. The global objective of federated unlearning is to obtain a global embedding minimizing the average local loss with unlearning:
$$\mathbf{E}^{-}=\arg\ min_{\mathbf{E}}\sum\limits_{k\in K_{u}}\frac{|G^k|}{|G|}\Big(-\frac{|G^k_{u}|}{|G^k|}\mathcal{L}(\mathbf{P}^{k}\mathbf{E};G^k_{u})+\frac{|G^k_{c}|}{|G^k|}\mathcal{L}(\mathbf{P}^{k}\mathbf{E};G^k_{c})\Big).$$

As  shown in Figure 2, FedLU serves as a general federated KG embedding learning and unlearning framework for various KG embedding models.
In the learning module of FedLU, we transfer knowledge by mutual knowledge distillation instead of model replacement.
In the unlearning module, we design a two-step method combining retroactive interference and passive decay, which ensures exact forgetting and performance maintenance.

### Learning in FedLU

For FL algorithms such as FedAvg, clients accept the global model as the initial states of local models for each round of local training, and upload the updated model directly for global aggregation. The global model is expected to aggregate the knowledge on each client and obtain a balanced performance. However, FedNTD observes that the global convergence and local optimization in FL may interfere with each other, which we call *drift*. The global model may easily lose optimization details of local training after aggregation. Furthermore, the local model tends to forget external knowledge contained in the initial states during training.

Existing work suggests that there should be a trade-off and separation between global and local models, endowing the FL framework the compatibility of local optimization and global generalization. Inspired by this, in FedLU we maintain local and global embeddings in parallel that mutually reinforce each other but are not identical. Besides, identical global and local embeddings may be used to infer knowledge in a private client, resulting in data leakage. This problem is avoided by separation and communication through mutual distillation in FedLU.

Let us first dive into how to refine local embeddings with the global embedding by knowledge distillation in FedLU. Given a triplet $(h,r,t)$, we compute its local score and global score, respectively:
$$\mathcal{S}^\text{local}_{(h,r,t)}=\mathcal{S}(h,r,t;\mathbf{E}^\text{local},\mathbf{R}^\text{local}),$$
$$\mathcal{S}^\text{global}_{(h,r,t)}=\mathcal{S}(h,r,t;\mathbf{E}^\text{global},\mathbf{R}^\text{local}),$$
where $\mathcal{S}(\cdot)$ is the scoring function referred in Section~\ref{sect:prelim}.
$\mathbf{E}^\text{local}$ and $\mathbf{E}^\text{global}$ are entity embeddings in the local and global models, respectively, and $\mathbf{R}^\text{local}$ is the relation embedding in the local model.
To train each triplet $(h_i,r_i,t_i)\in G^k$, we generate its negative sample set $N(h_i,r_i,t_i)=\{(h_i,r_i,t_{i,j})\,|\,j=1,\dots,n\}$ with size $n$, s.t. $N(h_i,r_i,t_i)\cap G^k=\emptyset$.
We calculate the contrastive prediction loss of $(h_i,r_i,t_i)$ as follows:
$$\mathcal{L}^\text{predict}_{(h_i,r_i,t_i)}= -\log\Big(\sigma\big(\mathcal{S}^\text{local}_{(h_i,r_i,t_i)}\big)\Big) -\quad{\sum_{(h,r,t)\in N(h_i,r_i,t_i)}}\ \frac{1}{n}\log\Big(\sigma\big(-\mathcal{S}^\text{local}_{(h,r,t)}\big)\Big),$$
where $\sigma(\cdot)$ is the sigmoid activation function.

To avoid the inconsistency in the optimization direction of local and global embeddings, knowledge distillation is conducted along with sample prediction. The score of a KG embedding model on a triplet characterizes the probability of being predicted as positive. So, we transfer knowledge by distilling the distribution of local and global scores on samples and their negative sets. For a triplet $(h_i,r_i,t_i)$, we compute its distillation loss for the local embedding:
$$\mathcal{L}^\text{distill}_{(h_i,r_i,t_i)}=\text{KL}\Big(\mathcal{P}^\text{local}_{(h_i,r_i,\cdot)},\mathcal{P}^\text{global}_{(h_i,r_i,\cdot)}\Big),$$
where $\text{KL}(\cdot)$ is the Kullback-Leiber distillation function. $\mathcal{P}^\text{local}_{(h_i,r_i,\cdot)}$ is the local score distribution of sample $(h_i,r_i,t_i)$ generated by combining $\mathcal{P}^\text{local}_{(h_i,r_i,t_i)}$ and $\mathcal{P}^\text{local}_{(h_i,r_i,t_{i,j})}$, which are computed as follows:
$$\mathcal{P}^\text{local}_{(h_i,r_i,t_i)}=\frac{\exp\big(s^\text{local}_{(h_i,r_i,t_i)}\big)}{\sum_{(h,r,t)\in N(h_i,r_i,t_i)\cup\{(h_i,r_i,t_i)\}}\exp\big(s^\text{local}_{(h,r,t)}\big)},$$
$$\mathcal{P}^\text{local}_{(h_i,r_i,t_{i,j})}=\frac{\exp\big(s^\text{local}_{(h_i,r_i,t_{i,j})}\big)}{\sum_{(h,r,t)\in N(h_i,r_i,t_i)\cup\{(h_i,r_i,t_i)\}}\exp\big(s^\text{local}_{(h,r,t)}\big)},$$
and $\mathcal{P}^\text{global}_{(h_i,r_i,\cdot)}$ is the global score distribution of $(h_i,r_i,t_i)$ generated by combining $\mathcal{P}^\text{global}_{(h_i,r_i,t_i)}$ and $\mathcal{P}^\text{global}_{(h_i,r_i,t_{i,j})}$, which can be calculated in a similar way.

Finally, the local model of client $k$ is equipped with the two jointly optimized losses as follows:
$$\mathcal{L}^\text{local} = \sum_{(h_i,r_i,t_i)\in G^k}\Big(\mathcal{L}^\text{predict}_{(h_i,r_i,t_i)}+\mu_\text{distill}\,\mathcal{L}^\text{distill}_{(h_i,r_i,t_i)}\Big),$$
where $\mu_\text{distill}$ is the parameter to adjust the degree of distillation.

The joint loss $\mathcal{L}^\text{global}$ to optimize global embedding $\mathbf{E}^\text{global}$ can be computed similarly. 

### Unlearning in FedLU

To solve the difficulty in federated KG embedding unlearning, we resort to the forgetting theories in cognitive neuroscience. There are two major theories explaining for forgetting of cognition, namely interference and decay. The interference theory posits that forgetting occurs when memories compete and interfere with others. The decay theory believes that memory traces fade and disappear, and eventually lost if not retrieved and rehearsed. We propose a two-step unlearning method for federated KG embedding.
Inspired by the interference theory, FedLU first conducts a retroactive interference step with hard and soft confusions. Then, according to the decay theory, FedLU performs a passive decay step, which can recover the performance loss caused by interference while suppressing the activation of the  forgotten knowledge.

For client $k$ to perform unlearning with its local KG $G^k$, $G^k$ is split into the forgetting set $G^k_u$ and the retaining set $G^k_c$. We first introduce the retroactive interference step conducted to the local embedding. To unlearn each triplet $(h_i,r_i,t_i)\in G^k_u$, we have its negative sample set $N(h_i,r_i,t_i)=\{(h_i,r_i,t_{i,j})\,|\,j=1,\dots,n\}$ with size $n$, s.t. $N(h_i,r_i,t_i)\cap G^k=\emptyset$. We first compute the hard confusion loss to optimize the local embedding by treating $(h_i,r_i,t_i)$ as negative:
$$\mathcal{L}^\text{hard}_{(h_i,r_i,t_i)}=-\log\Big(\sigma\big(-\mathcal{S}^\text{local}_{(h_i,r_i,t_i)}\big)\Big) -\quad{\sum_{(h,r,t)\in N(h_i,r_i,t_i)}}\ \frac{1}{n}\log\Big(\sigma\big(-\mathcal{S}^\text{local}_{(h,r,t)}\big)\Big).$$

Unfortunately, if we blindly optimize the forgetting sets as negative, the embeddings are likely to be polluted. In fact, unlearning may leave trace of the forgetting set and thus results in privacy risks such as membership inference attacks. So, we design soft confusion, which reflects the distance between scores of the triplet in the forgetting set and its negative samples. By minimizing the soft confusion, the scores of the triplet and negatives are forced closer, which can help not only forgetting the triplet, but also preventing the overfitting of unlearning. The soft confusion loss for the triplet $(h_i,r_i,t_i)$ is calculated as follows:
$$\mathcal{L}^\text{soft}_{(h_i,r_i,t_i)}=\sum_{(h,r,t)\in N(h_i,r_i,t_i)}\frac{1}{n}\,\Big|\Big|\,\mathcal{S}^\text{local}_{(h,r,t)}-\mathcal{S}^\text{local}_{(h_i,r_i,t_i)}\,\Big|\Big|_{L2}.$$

To perform the retroactive interference while keeping the relevance between local and global embeddings, we calculate the total interference loss combining the KL divergence loss as
$$\mathcal{L}^\text{interference}_{(h_i,r_i,t_i)}=\mathcal{L}^\text{hard}_{(h_i,r_i,t_i)}+\mu_\text{soft}\,\mathcal{L}^\text{soft}_{(h_i,r_i,t_i)}+\mu_\text{distill}\,\mathcal{L}^\text{distill}_{(h_i,r_i,t_i)}.$$

The interference loss of the global embedding is calculated similarly, by changing the local scores to the global scores and reversing the distillation.

After the retroactive interference step, memories of the global and local embeddings on the forgetting set are erased. However, a significant decrease in model performance can happen. On one hand, the model optimization in the retroactive interference is limited to specific triplets to forget, which slightly destroys the generalization of the embedding model. On the other hand, the unlearning of a specific triplet would affect its associated triplets. So, we perform a passive decay step afterwards by mutual knowledge distillation with $G^k_c$ as input. $\mathbf{E}^{k,global}$ and $\mathbf{E}^{k,local}$ are optimized alternately in batches as the teacher model to each other. Learning on the retaining set can recover the generalization of the model. Mutual distillation suppresses the activation of the forgetting triplets.

## Experiments

![3](https://drive.google.com/uc?export=view&id=1siQPtMEpvdWKEoCTFe4Ba67L92X8azPK)

![t](https://drive.google.com/uc?export=view&id=19Y9LAvI05dVgd7dpdayPAo5HoV8JzqbZ)

![tf](https://drive.google.com/uc?export=view&id=15HyayoTHQfuxPqj0teXB5XvZS9Ml0lw_)

# References
- X. Zhu, G. Li, and W. Hu, Heterogeneous Federated Knowledge Graph Embedding Learning and Unlearning. arXiv, 2023. doi: 10.48550/ARXIV.2302.02069 [[Paper](https://arxiv.org/abs/2302.02069)]
