<a href="https://colab.research.google.com/github/tidaltamu/workshops/blob/main/special_topics/workshop1/code.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

What is active learning?


Active learning is a ML technique in which we use a portion of labelled data and interactively and continously label new data points to improve the performance of the model.

- Train = Labelled data points
- Pool = Unlabelled data points
- oracle = human annotator
- uncertainity based sampling = Getting human feedback when a model is uncertain(low confidence)



We create a model and train it on the labelled data (we will label only a small sample but enough to have high performance). Then, we go through all the data points in the pool and identify the points that are most confusing for the classifier and add these points to the train data. We repeat this untile the models performance is high. Only used when the labelling cost is high (that is when annotating all these data points are not feasable).

In [None]:
from sklearn.svm import SVC, LinearSVC
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import imageio as io
import os
from sklearn import datasets

In [None]:
#origdata = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
origdata = pd.read_csv("Iris.csv")
origdata[:10]

FileNotFoundError: ignored

This dataset contains data about 3 species/subspecies of the Iris flower.

In [None]:
k1, k2 = 'petallength', 'petalwidth'
data = origdata[[k1, k2, 'class']].copy()
data[:10]

In [None]:
X = data[[k1, k2]]
y = data['class']
print('Classes:')
print(y.unique(), '\n\n\n')

y[y=='Iris-setosa'] = 0
y[y=='Iris-versicolor'] = 1
y[y=='Iris-virginica'] = 2

We plot the samples of versicolor and virginica on a 2D graph with versicolor in red and virginica in cyan.

In [None]:
plt.figure()
setosa = y == 0
versicolor = y == 1
virginica = y == 2

plt.scatter(X[k1][versicolor], X[k2][versicolor], c='r')
plt.scatter(X[k1][virginica], X[k2][virginica], c='c')
plt.xlabel(k1)
plt.ylabel(k2)
plt.show()

In [None]:
X1 = X[y != 0]
y1 = y[y != 0]
X1[:5]

In [None]:
X1 = X1.reset_index(drop=True)
y1 = y1.reset_index(drop=True)
y1 -= 1
print(y1.unique())
X1[:5]

In [None]:
fig = plt.figure()

plt.scatter(X1[k1][y1==0], X1[k2][y1==0], c='r')
plt.scatter(X1[k1][y1==1], X1[k2][y1==1], c='c')

plt.xlabel(k1)
plt.ylabel(k2)
fig.savefig('main.jpg', dpi=100)
plt.show() 

Linear SVM kernel
- Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line
-  The main idea is that based on the labeled data (training data) the algorithm tries to find the optimal hyperplane which can be used to classify new data points. In two dimensions the hyperplane is a simple line.
- Based on these support vectors, the algorithm tries to find the best hyperplane that separates the classes. 

In [None]:
y1 = y1.astype(dtype=np.uint8)
clf0 = LinearSVC()
clf0.fit(X1, y1)
LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)
print(clf0.coef_)
print(clf0.intercept_)

- Formula for reference
- a*x + b*y + c = 0
- y = -(a*x + c)/b

In [None]:
xmin, xmax = X1[k1].min(), X1[k1].max()
ymin, ymax = X1[k2].min(), X1[k2].max()
stepx = (xmax - xmin)/99
stepy = (ymax - ymin)/99
a0, b0, c0 = clf0.coef_[0, 0], clf0.coef_[0, 1], clf0.intercept_


lx0 = [xmin + stepx * i for i in range(100)]
ly0 = [-(a0*lx0[i] + c0)/b0 for i in range(100)]

plt.figure()

plt.scatter(X1[k1][y1==0], X1[k2][y1==0], c='r')
plt.scatter(X1[k1][y1==1], X1[k2][y1==1], c='c')

plt.plot(lx0, ly0, c='m')

plt.xlabel(k1)
plt.ylabel(k2)

plt.show()

Now, we split the dataset into two parts — pool(80%) and test(20%). We use a random state of 1.

In [None]:
X_pool, X_test, y_pool, y_test = train_test_split(X1, y1, test_size=0.2, random_state=1)
X_pool, X_test, y_pool, y_test = X_pool.reset_index(drop=True), X_test.reset_index(drop=True), y_pool.reset_index(drop=True), y_test.reset_index(drop=True)
# random state 1 5 iterations
# random state 2 20 iter

for a two-class linear SVM, the decision function outputs positive values for one of the classes (one side of the decision boundary) and negative values for the other class (other side of the decision boundary) and zero on the decision boundary.

In [None]:
clf0.decision_function(X_pool.iloc[6:8])


Thus, find_most_ambiguous, gives the unlabelled point that is the closest to the decision boundary.

For an SVM classifier, if a data point is closer to the decision boundary and less ambiguous if the data point is farther from the decision boundary no matter which side of the decision boundary the point is on. 

In [None]:
def find_most_ambiguous(clf, unknown_indexes):
    
    ind = np.argmin(np.abs( 
        list(clf0.decision_function(X_pool.iloc[unknown_indexes]) )
        ))
    return unknown_indexes[ind]

# unknown_indexes- indexes from the dataset that are the unlabelled/unknown pool

We also have the ideal decision boundary calculated earlier. This line is also plotted (in magenta).
Finally, we plot the new_index point, that is, the most ambiguous point (yellow star).

In [None]:
def plot_svm(clf, train_indexes, unknown_indexes, new_index = False, title = False, name = False):
    X_train = X_pool.iloc[train_indexes]
    y_train = y_pool.iloc[train_indexes]

    X_unk = X_pool.iloc[unknown_indexes]

    if new_index:
        X_new = X_pool.iloc[new_index]

    a, b, c = clf.coef_[0, 0], clf.coef_[0, 1], clf.intercept_
    # Straight Line Formula
    # a*x + b*y + c = 0
    # y = -(a*x + c)/b

    lx = [xmin + stepx * i for i in range(100)]
    ly = [-(a*lx[i] + c)/b for i in range(100)]

    fig = plt.figure(figsize=(9,6))

    # plt.scatter(x[k1][setosa], x[k2][setosa], c='r')
    plt.scatter(X_unk[k1], X_unk[k2], c='k', marker = '.')
    plt.scatter(X_train[k1][y_train==0], X_train[k2][y_train==0], c='r', marker = 'o')
    plt.scatter(X_train[k1][y_train==1], X_train[k2][y_train==1], c='c', marker = 'o')
    

    plt.plot(lx, ly, c='m')
    plt.plot(lx0, ly0, '--', c='g')

    if new_index:
        plt.scatter(X_new[k1], X_new[k2], c='y', marker="*", s=125)
        plt.scatter(X_new[k1], X_new[k2], c='y', marker="*", s=125)
        plt.scatter(X_new[k1], X_new[k2], c='y', marker="*", s=125)
        plt.scatter(X_new[k1], X_new[k2], c='y', marker="*", s=125)
        plt.scatter(X_new[k1], X_new[k2], c='y', marker="*", s=125)

    if title:
        plt.title(title)
    
    plt.xlabel(k1)
    plt.ylabel(k2)

    if name:
        fig.set_size_inches((9,6))
        plt.savefig(name, dpi=100)

    plt.show()

In [None]:
train_indexes = list(range(10))
unknown_indexes = list(range(10, 80))
X_train = X_pool.iloc[train_indexes]
y_train = y_pool.iloc[train_indexes]
clf = LinearSVC()
clf.fit(X_train, y_train)

# folder = "rs1it5/"
folder = "rs2it20/"
# folder = "rs1it20/"

try:
    os.mkdir(folder)
except:
    pass

# filenames = ["ActiveLearningTitleSlide2.jpg"] * 2
filenames = []
title = "Beginning"
# name = folder + ("rs1it5_0a.jpg")
name = folder + ("rs2it20_0a.jpg")
plot_svm(clf, train_indexes, unknown_indexes, False, title, name)

filenames.append(name)

n = find_most_ambiguous(clf, unknown_indexes)
unknown_indexes.remove(n)

title = "Iteration 0"
name = folder + ("rs1it5_0b.jpg")
# name = folder + ("rs2it20_0b.jpg")
filenames.append(name)
plot_svm(clf, train_indexes, unknown_indexes, n, title, name)

Next, we run the active learning algorithm for 5 iterations. In each of them, we add the most ambiguous point to the training data and train an SVM, find the most unambiguous point at this stage and then create a plot all this.

In [None]:
num = 5
# num = 20
t = []
for i in range(num):
    
    train_indexes.append(n)
    X_train = X_pool.iloc[train_indexes]
    y_train = y_pool.iloc[train_indexes]
    clf = LinearSVC()
    clf.fit(X_train, y_train)
    title, name = "Iteration "+str(i+1), folder + ("rs1it5_%d.jpg" % (i+1))
    # title, name = "Iteration "+str(i+1), folder + ("rs2it20_%d.jpg" % (i+1))

    n = find_most_ambiguous(clf, unknown_indexes)
    unknown_indexes.remove(n)
    plot_svm(clf, train_indexes, unknown_indexes, n, title, name)
    filenames.append(name)

This is how active learning can be used to create robust models with labelling fewer data points.

In [None]:
filenames

In [None]:
images = []
for filename in filenames[2:]:
    images.append(io.imread(filename))
io.mimsave('rs1it5.gif', images, duration = 1)
# io.mimsave('rs2it20.gif', images, duration = 1)
# io.mimsave('rs1it20.gif', images, duration = 1)
try:
    os.mkdir('rs1it5')
#    os.mkdir('rt2it20')
except:
    pass
os.listdir('rs1it5')

In [None]:
# with open('rs1it5.gif','rb') as f:
#     display(Image(data=f.read(), format='gif'))