To achieve machine unlearning without retraining on the filtered dataset, we can modify the weights of the neural network model directly to forget about class 2. One way to do this is by adjusting the weights corresponding to class 2 to be closer to the weights corresponding to class 0 (or any other class we want the model to learn to resemble).

In [19]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten

# Load MNIST dataset
(X_train, y_train), (X_test, y_test) = mnist.load_data()

# Normalize pixel values
X_train = X_train / 255.0
X_test = X_test / 255.0

# Convert labels to one-hot encoding
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

# Define and train a model on all classes
model = Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=5, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy on test data before unlearning class 2:", accuracy)
from sklearn.metrics import confusion_matrix
y_pred = model.predict(X_test)
conf_matrix = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
print("Confusion matrix:")
print(conf_matrix)

# Define the class to forget
class_to_forget = 2

# Modify the weights of the model to forget about class 2
weights = model.layers[-1].get_weights()
weights[0][:,class_to_forget].fill(0) # Replace the weights corresponding to class 2 with zeros
weights[1][class_to_forget] =0 # Replace the bias corresponding to class 2 with zero
model.layers[-1].set_weights(weights)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print("Accuracy on test data after unlearning class 2:", accuracy)
y_pred = model.predict(X_test)
conf_matrix = confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
print("Confusion matrix:")
print(conf_matrix)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy on test data before unlearning class 2: 0.9779999852180481
Confusion matrix:
[[ 972    0    1    1    1    1    2    1    1    0]
 [   1 1124    2    0    0    1    2    1    4    0]
 [  10    3 1010    1    0    0    2    4    2    0]
 [   2    0    3  998    0    4    0    1    2    0]
 [   2    0    5    0  954    1    3    2    2   13]
 [   4    2    0    5    1  869    4    2    3    2]
 [   5    3    3    1    6    3  934    0    3    0]
 [   2    2    9    1    3    0    0 1006    1    4]
 [  10    0    6    9    3    3    1    4  935    3]
 [   5    5    0    5    5    4    1    6    0  978]]
Accuracy on test data after unlearning class 2: 0.8913999795913696
Confusion matrix:
[[ 972    0    0    1    1    1    2    1    2    0]
 [   1 1125    0    1    0    1    2    1    4    0]
 [  84  240  131  291   14    0   44  110  116    2]
 [   2    0    0  999    0    4    0    3    2    0]
 [   3    0    0    0  955    1    5

In this code:
1. We define and train a neural network model on all classes of the MNIST dataset.
2. We identify the class we want to forget, which in this case is class 2.
3. We modify the weights of the output layer corresponding to class 2 to be zeros, effectively "forgetting" about class 2.
4. We evaluate the accuracy of the modified model on the test data to see how well it performs after unlearning class 2.

Although it worked. but I don't think it is the correct mechansim because zeroing the weights for the forget class is only applied on theoutput layer. What about rest of the hidden layers of the network, it still has the weights corresponding to the forget class.

I am thinking in terms of the progressive learning setup. 