# Machine Unlearning

This Notebook gives you a brief introduction to machine unlearning. 

## Background

Today, we put a lot of data into [artificial intelligence (AI)](https://en.wikipedia.org/wiki/Artificial_intelligence), especially for the training of [machine learning (ML)](https://en.wikipedia.org/wiki/Machine_learning) models. Some of the data, though, can be of personal and private nature, such as your location, biometrics and medical records. It is natural that we want our data to be used and stored securely.

At the same time, ML systems are also prone to [attacks](https://en.wikipedia.org/wiki/Adversarial_machine_learning#Specific_attack_types) like other types of systems, such as membership inference, property inference, model stealing, data poisoning, etc. These attacks could lead to problems such as data leakage and wrong predictions. Or, in some cases such as poor data quality or system design, these problems can still occur even without an adversarial third-party. If you know something about ML, you'll know that ML models can sometimes "memorize" instead of "learn from" data, often as a result of [overfitting](https://en.wikipedia.org/wiki/Overfitting). And that's dangerous! Because it makes it so much easier to run model extraction attacks against such a model and data may well be leaked. Thus, it makes sense for us to want to delete some of the data, or erase some of the training, from our trained model.

Deleting data can be as easy as a single SQL execution in some systems, but it's not that straightforward in ML models. In an ML model, the connection between parameters and data is not clearly shown, hence we are unable to map a single data point's influence throughout the training process. Thus, it is difficult to remove information relating to a single data point from a trained ML model. In other words, it can be said that it is challenging to induce targeted "memory loss" in ML models, or to make ML models "forget" that it ever trained with specific data.

## Definition

The process of deleting certain data point(s) from ML systems is called **machine unlearning**, or simply **unlearning**, which is a term first proposed by [Cao and Yang](https://ieeexplore.ieee.org/document/7163042) in 2015.

As an obvious solution, we can retrain an ML model from scratch, using all the data points but the ones we want to erase. This will ensure that the targeted data points are not in the ML system, but this process is very slow. So, ideally an unlearning paradigm should be reasonably efficient.

We will talk about more aspects of unlearning defitions in a later notebook.

![ul](https://drive.google.com/uc?export=view&id=1YwGfL9V-YxwCMuRNXeQHyIN8kDssENMc)

## Reasons

### Security

As mentioned earlier, ML systems are prone to attacks. Sometimes, possible solutions can be devised from unlearning to fix a malfunctioning ML model. For example, a model poisoned with injected bad data can be fixed by unlearning these bad data. This is particularly important especially when the prediction outputs of the model have significant ramifications, such as ML models for healthcare, finance or military purposes.

### Privacy

Regulation on private data has been more and more stringent in multiple juristictions. A right called the [right to be forgotten](https://en.wikipedia.org/wiki/Right_to_be_forgotten) (RTBF; or sometimes *right to erasure*) has been established by [GDPR](https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) in the EU and the UK, and [CCPA](https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act) in California, where a consumer may request data holders to delete data relating to him/her. It is worthnoting that in Hong Kong, [PDPO](https://www.elegislation.gov.hk/hk/cap486!en-zh-Hant-HK?INDEX_CS=N) does not require data holders to delete data purely on the basis of RTBF (**X v Privacy Commissioner for Personal Data** ([Appeal No. 15/2019](https://www.pcpd.org.hk/english/enforcement/decisions/files/AAB_15_2019.pdf)); [Deacons](https://www.deacons.com/2021/02/05/territorial-limitation-of-data-protection-law-and-the-right-to-be-forgotten/) represented Google LLC). RTBF also does not exist in [PIPL](https://en.wikipedia.org/wiki/Personal_Information_Protection_Law_of_the_People's_Republic_of_China) of China.

### Usability

User experience can be enhanced with unlearning. For example, a company operating a search engine and an online advertisement service may use ML with a user's search history to customize its ad service for that user. If the user lent his/her device to a friend and the friend searched for things that the user did not like, the company's ad service might push ads based on those search histories to the user.

### Fidelity

Bias of ML systems can often occur due to bias in the data. For example, when a set of facial image data collected from mostly fair-skinned people is fed into an ML model, that model is more likely to perform worse on dark-skinned people's facial image. Or, when some features that need to be kept out of consideration is mistakenly included, it might also create unwanted predictions. In such cases, one solution is to make the model unlearn certain data.

## Challenges

### Stochasticity of training

Due to a large amount of randomness during training, it is difficult to keep track of a certain data point's influence to the model. This is especially true in complex models such as Deep Neural Networks (DNNs). Also, there may also be randomness in the order of the data being used for training. The stochasticity of training makes unlearning a certain data point challenging.

### Incrementality of training

The training procedure of a model is incermental. This means that a data point's influence on the data is influenced by data being fed into the model before this one, and will also influence later data's influence on the model. Hence, it is challenging to determine a separate data point's influence from the model as a whole.

### Catastrophic unlearning

From past research, we know that a unlearnt model performs worse than a model retained from scratch on remaining data. And this degradation of performance can be exponential when more data is unlearnt. This sudden performance degradation is called catastrophic unlearning. While the natural prevention of catastrophic unlearning is still an open question, some loss functions have been designed to mitigate such a problem. 

# References

- T. T. Nguyen, T. T. Huynh, P. L. Nguyen, A. W.-C. Liew, H. Yin, and Q. V. H. Nguyen, A Survey of Machine Unlearning. arXiv, 2022. [[Paper](https://arxiv.org/abs/2209.02299)]

- Q.-V. Dang, “Right to Be Forgotten in the Age of Machine Learning,” in Advances in Digital Science, 2021, pp. 403–411. [[Paper](https://link.springer.com/chapter/10.1007/978-3-030-71782-7_35)]

- N. Pitropakis, E. Panaousis, T. Giannetsos, E. Anastasiadis, and G. Loukas, “A taxonomy and survey of attacks against machine learning,” Computer Science Review, vol. 34, p. 100199, 2019. [[Paper](https://www.sciencedirect.com/science/article/abs/pii/S1574013718303289)]

- R. Shwartz-Ziv and N. Tishby, Opening the Black Box of Deep Neural Networks via Information. arXiv, 2017. [[Paper](https://arxiv.org/abs/1703.00810)]

- Y. Cao and J. Yang, “Towards Making Systems Forget with Machine Unlearning,” in 2015 IEEE Symposium on Security and Privacy, 2015, pp. 463–480. [[Paper](https://ieeexplore.ieee.org/document/7163042)] [[Video](https://www.youtube.com/watch?v=sUgIS6a665k)]

- L. Bourtoule et al., “Machine Unlearning,” in 2021 IEEE Symposium on Security and Privacy (SP), 2021, pp. 141–159. [[Paper](https://ieeexplore.ieee.org/document/9519428)] [[Video](https://www.youtube.com/watch?v=xUnMkCB0Gns)]