## Task 1: Setup the dataset

- Load the MNIST dataset using the Hugging Face datasets library.
- Convert the image data into Numpy arrays and normalize pixel values to the range [0,1].
- Flatten each image into a vector of 784 features.
- Split the dataset into training and testing sets.
- Randomly select an initially labeled dataset of 200 samples from training samples.
- Generate an "Unlabeled Pool," the Initial Dataset excluding 200 samples.

In [59]:
import datasets
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt

In [60]:
data = datasets.load_dataset("mnist")

In [61]:
# convert the image data into a numpy array and normalize the values from 0 to 1
X = np.array(data['train']["image"]) / 255
y = np.array(data['train']["label"])


In [62]:
X = X.reshape(X.shape[0], -1)
X.shape

(60000, 784)

In [63]:
X_test = np.array(data['test']["image"]) / 255
y_test = np.array(data['test']["label"])

# flatten the test data
X_test = X_test.reshape(X_test.shape[0], -1)

In [64]:
# randomly select 200 samples from training dataset and create a labelled dataset
np.random.seed(45)   
idx = np.random.choice(X.shape[0], 200, replace=False)
X_train_labelled = X[idx]
y_train_labelled = y[idx]

In [65]:
# create a pool of unlabelled data
X_train_unlabelled = np.delete(X, idx, axis=0)
y_train_unlabelled = np.delete(y, idx, axis=0)

X_train_unlabelled.shape


(59800, 784)

## Task 2: Implement Random Sampling for Active Learning


- Train a Random Forest Classifier (you can use “from sklearn.ensemble import RandomForestClassifier”)  on the initial dataset of 200 samples.
- Implement an active learning loop for 20 iterations:
    - Randomly select a sample from the unlabeled pool.
    - Get the selected sample and its true label.
    - Add the sample and label to the labeled dataset.
    - Remove the selected sample and label from the pool.
    - Retrain the model on the updated dataset.
    - Check the model's accuracy on the test set.
    - Print accuracy after every iteration.


In [68]:
from sklearn.ensemble import RandomForestClassifier

# train a random forest classifier on the labelled data
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train_labelled, y_train_labelled)

# predict the unlabelled data
y_prediction = clf.predict(X_test)

# calculate the accuracy of the classifier
accuracy = np.mean(y_test == y_prediction)
accuracy

0.7832

In [69]:
# implementing active learning by for 20 iterations by randomly choosing the samples from the unlabelled pool.
task1_results = []
for i in range(20):
    idx_curr = np.random.choice(X_train_unlabelled.shape[0], 1, replace=False) # randomly select 1 sample
    
    # add the sample to the labelled dataset
    X_train_labelled = np.concatenate([X_train_labelled, X_train_unlabelled[idx_curr]])
    y_train_labelled = np.concatenate([y_train_labelled, y_train_unlabelled[idx_curr]])

    # remove the sample from the unlabelled dataset
    X_train_unlabelled = np.delete(X_train_unlabelled, idx_curr, axis=0)
    y_train_unlabelled = np.delete(y_train_unlabelled, idx_curr, axis=0)

    # retrain the classifier
    clf.fit(X_train_labelled, y_train_labelled)

    # predict the unlabelled data
    y_prediction= clf.predict(X_test)

    # calculate the accuracy of the classifier
    accuracy = np.mean(y_test == y_prediction)
    task1_results.append(accuracy)
    print(f"Iteration: {i} Accuracy: {accuracy}")


Iteration: 0 Accuracy: 0.7853
Iteration: 1 Accuracy: 0.7874
Iteration: 2 Accuracy: 0.7861
Iteration: 3 Accuracy: 0.7812
Iteration: 4 Accuracy: 0.7773
Iteration: 5 Accuracy: 0.7837
Iteration: 6 Accuracy: 0.7889
Iteration: 7 Accuracy: 0.7809
Iteration: 8 Accuracy: 0.7792
Iteration: 9 Accuracy: 0.7923
Iteration: 10 Accuracy: 0.8031
Iteration: 11 Accuracy: 0.7882
Iteration: 12 Accuracy: 0.7897
Iteration: 13 Accuracy: 0.7946
Iteration: 14 Accuracy: 0.8058
Iteration: 15 Accuracy: 0.7948
Iteration: 16 Accuracy: 0.7958
Iteration: 17 Accuracy: 0.7903
Iteration: 18 Accuracy: 0.7905
Iteration: 19 Accuracy: 0.801


## Task 3: Implement Uncertainty Sampling for Active Learning.

- Train a Random Forest Classifier (you can use “from sklearn.ensemble import RandomForestClassifier”)  on the initial dataset of 200 samples.
- Implement an active learning loop for 20 iterations:
    - Compute uncertainty (Label Entropy) for each sample in the unlabeled pool using entropy.
    - Select the sample with the highest uncertainty and query its true label.
    - Add the queried sample to the labelled dataset and remove it from the unlabelled pool.
    - Retrain the model and check the model's accuracy on the test set.
    - Print accuracy after every iteration


In [70]:
np.random.seed(45)   
idx = np.random.choice(X.shape[0], 200, replace=False)
X_train_labelled = X[idx]
y_train_labelled = y[idx]

X_train_unlabelled=np.delete(X,idx,axis=0)
y_train_unlabelled=np.delete(y,idx,axis=0)


In [71]:
X_train_unlabelled.shape
# Y_train_unlabelled.shape


(59800, 784)

In [72]:
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train_labelled,y_train_labelled)
y_prediction=clf.predict(X_test)
accuracy=np.mean(y_test==y_prediction)
accuracy

0.7777

In [86]:
def labelentropy(probability):
    max_entropy=float('-inf')
    epsilon=1e-10
    maxind=-1
    for i in range(len(probability)):
        ent=np.sum(probability[i]*np.log10(probability[i]+epsilon))
        if(ent>max_entropy):
            max_entropy=ent
            maxind=i
    return [maxind]

In [78]:
X_train_unlabelled[0].shape

(784,)

In [87]:
task2_results=[]
for i in range(20):
    probability=clf.predict_proba(X_train_unlabelled)
    idx_curr=labelentropy(probability)
    
    # add the sample to the labelled dataset
    X_train_labelled = np.concatenate([X_train_labelled, X_train_unlabelled[idx_curr]])
    y_train_labelled = np.concatenate([y_train_labelled, y_train_unlabelled[idx_curr]])

    # remove the sample from the unlabelled dataset
    X_train_unlabelled = np.delete(X_train_unlabelled, idx_curr, axis=0)
    y_train_unlabelled = np.delete(y_train_unlabelled, idx_curr, axis=0)

    # retrain the classifier
    clf.fit(X_train_labelled, y_train_labelled)

    # predict the unlabelled data
    y_prediction= clf.predict(X_test)

    # calculate the accuracy of the classifier
    accuracy = np.mean(y_test == y_prediction)
    task2_results.append(accuracy)
    print(f"Iteration: {i} Accuracy: {accuracy}")



Iteration: 0 Accuracy: 0.7699
Iteration: 1 Accuracy: 0.7744
Iteration: 2 Accuracy: 0.7807
Iteration: 3 Accuracy: 0.7801
Iteration: 4 Accuracy: 0.7611
Iteration: 5 Accuracy: 0.7688
Iteration: 6 Accuracy: 0.7837
Iteration: 7 Accuracy: 0.7731
Iteration: 8 Accuracy: 0.7772
Iteration: 9 Accuracy: 0.7812
Iteration: 10 Accuracy: 0.7809
Iteration: 11 Accuracy: 0.7691
Iteration: 12 Accuracy: 0.7674
Iteration: 13 Accuracy: 0.7763
Iteration: 14 Accuracy: 0.7677
Iteration: 15 Accuracy: 0.7733
Iteration: 16 Accuracy: 0.7766
Iteration: 17 Accuracy: 0.7874
Iteration: 18 Accuracy: 0.7742
Iteration: 19 Accuracy: 0.7797


## Task 4: Implement Query-by-Committee for Active Learning 

In [93]:
np.random.seed(45)   
idx = np.random.choice(X.shape[0], 200, replace=False)
X_train_labelled = X[idx]
y_train_labelled = y[idx]

X_train_unlabelled=np.delete(X,idx,axis=0)
y_train_unlabelled=np.delete(y,idx,axis=0)

In [94]:
clf=RandomForestClassifier(n_estimators=100)
clf.fit(X_train_labelled,y_train_labelled)
y_prediction=clf.predict(X_test)


In [95]:
clf1=RandomForestClassifier(n_estimators=100)
clf1.fit(X_train_labelled,y_train_labelled)
y_prediction=clf1.predict(X_test)


In [96]:
clf2=RandomForestClassifier(n_estimators=100)
clf2.fit(X_train_labelled,y_train_labelled)
y_prediction=clf2.predict(X_test)


In [97]:
clf3=RandomForestClassifier(n_estimators=100)
clf3.fit(X_train_labelled,y_train_labelled)
y_prediction=clf3.predict(X_test)


In [103]:
clf4=RandomForestClassifier(n_estimators=100)
clf4.fit(X_train_labelled,y_train_labelled)
y_prediction=clf4.predict(X_test)


In [105]:
print(len(X_train_unlabelled))

59800


In [115]:
def voteentropy(X_train_unlabelled, clf, clf1, clf2, clf3, clf4):
    # Stack the predictions from all classifiers into a 2D array
    predictions = np.array([
        clf.predict(X_train_unlabelled),
        clf1.predict(X_train_unlabelled),
        clf2.predict(X_train_unlabelled),
        clf3.predict(X_train_unlabelled),
        clf4.predict(X_train_unlabelled)
    ])

    # Transpose predictions to have shape (5, n_samples)
    predictions = predictions.T  # Now shape is (n_samples, 5)

    # Initialize an empty array for storing probabilities
    prob = np.zeros((predictions.shape[0], 10))

    # Add 0.2 for each prediction in the corresponding class
    for i in range(predictions.shape[0]):
        np.add.at(prob[i], predictions[i], 0.2)

    return prob


In [116]:
task4_results=[]
for i in range(20):
    probability=voteentropy(X_train_unlabelled,clf,clf1,clf2,clf3,clf4)
    idx_curr=labelentropy(probability)
    
    # add the sample to the labelled dataset
    X_train_labelled = np.concatenate([X_train_labelled, X_train_unlabelled[idx_curr]])
    y_train_labelled = np.concatenate([y_train_labelled, y_train_unlabelled[idx_curr]])

    # remove the sample from the unlabelled dataset
    X_train_unlabelled = np.delete(X_train_unlabelled, idx_curr, axis=0)
    y_train_unlabelled = np.delete(y_train_unlabelled, idx_curr, axis=0)

    # retrain the classifier
    clf.fit(X_train_labelled, y_train_labelled)

    # predict the unlabelled data
    y_prediction= clf.predict(X_test)

    # calculate the accuracy of the classifier
    accuracy = np.mean(y_test == y_prediction)
    task4_results.append(accuracy)
    print(f"Iteration: {i} Accuracy: {accuracy}")
    

Iteration: 0 Accuracy: 0.793
Iteration: 1 Accuracy: 0.7855
Iteration: 2 Accuracy: 0.7936
Iteration: 3 Accuracy: 0.7962
Iteration: 4 Accuracy: 0.7953
Iteration: 5 Accuracy: 0.7865
Iteration: 6 Accuracy: 0.7866
Iteration: 7 Accuracy: 0.8018
Iteration: 8 Accuracy: 0.789
Iteration: 9 Accuracy: 0.7871
Iteration: 10 Accuracy: 0.8024
Iteration: 11 Accuracy: 0.7936
Iteration: 12 Accuracy: 0.7929
Iteration: 13 Accuracy: 0.7934
Iteration: 14 Accuracy: 0.7932
Iteration: 15 Accuracy: 0.794
Iteration: 16 Accuracy: 0.8002
Iteration: 17 Accuracy: 0.807
Iteration: 18 Accuracy: 0.791
Iteration: 19 Accuracy: 0.798
