# Multiclass Classifier (OneVsAll)

* We know that the perceptron is a binary classifier. However, MNIST dataset contains 10 classes. Then how can we extend the idea to handle multi-class problems?
* Solution: Combine multiple binary classifiers and devise a suitable scoring metric.
* Sklearn makes it extremely easy without modifying a single line of code that we have written for the binary classifier.
* Sklearn does this by counting a number of unique elements in the label vector y_train and converting labels using LabelBinarizer to fit each binary classifier

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.preprocessing import LabelBinarizer

In [None]:
clf = Perceptron(random_state=1729)

In [None]:
y_train_ovr = LabelBinarizer().fit_transform(y_train)
for i in range(10):
  print('{0}:{1}'.format(y_train[i],y_train_ovr[i]))

* The y_train_ovr will be of size of 60000 x 10
* The first column will be a binary label vector for 0-detector and the next one for 1-Detector and so on.

In [None]:
clf.fit(x_train, y_train)

* What had actually happened internally was that the API automatically created 10 binary classifiers, converted labels to binary sparse matrix and trained them with the binarized labels!
* During the inference time, the input will be passed through all these 10 classifiers and the highest score among the output from these will be considered as the predicted class.
* To see it in action, let us execute the following lines of code.

In [None]:
print('Shape of Weight matrix: {0} and bias vector: {}'.format(clf.coef_.shape,clf.intercept_.shape))

* SO it is a matrix of size 10 x 784 where each row represents the weights for a single binary classifier.
* Important difference to note is that there is no signum function associated with the perceptron.
* The class of a perceptron that outputs the maximum score for the input sample is considered as the predicted class.

In [None]:
scores = clf.decision_function(x_train[6].reshape(-1,1))
print(scores)
print('The predicted class:'. np.argmax(scores))

In [None]:
print('Predicted output:',clf.predict(x_train[0].reshape(-1,1)))

In [None]:
y_hat = clf.predict(x_train)

In [None]:
print(classification_report(y_train, y_hat))

Let us display the confusion matrix and relate it with the report above.

In [None]:
cm_display = ConfusionMatrixDisplay.from_prediction(y_train, y_hat, values_format='.5g')

* What are all the insights we could infer from the above figure?
* Digit 2 is often confused with Digit 3

# Making a Pipeline

* Let's create a pipeline to keep the code compact.
* Recall that, the MNIST dataset is clean and hence doesn't require much preprocessing
* The one potential preprocessing technique we may use us to scale the features within the range (0,1)
* It is not similar to scaling down the range values between 0 and 1.

In [None]:
estimators = [('std_scaler', MinMaxScaler()),('bin_clf',Perceptron())]
pipe = Pipeline(estimators)

In [None]:
pipe.fit(x_train,y_train_0)

In [None]:
y_hat_train_0 = pipe.predict(x_train)
cm_display = ConfusionMatrixDisplay.from_prediction(y_train_0, y_hat_train_0,values_format='.5g')
plt.show()

# Iteration vs Loss Curve

The ther way of plotting iteration vs loss curve with the Partial_fit method

In [None]:
iterations = 100
bin_clf1 = Perceptron(max_iter=1000, random_state=2094)
Loss_clf1 = []
for i in range(iterations):
  bin_clf1.partial_fit(x_train, y_train_0, classes=np.array([1,-1]))
  y_hat_0 = bin_clf1.decision_function(x_train)
  Loss_clf1.append(hinge_loss(y_train_0, y_hat_0))

In [None]:
plt.figure()
plt.plot(np.arange(iterations), Loss_clf1)
plt.grid(True)
plt.xlabel('Iterations')
plt.ylabel('Training Loss')
plt.show()

#GridSearchCV

* SO far we didn't do any hpt. We accepted the default value for learning rate of the perceptron class.
* Now, lets search for a better learning rate using GridSearchCV.
* No matter what the learning rate is, the loss will never converge to zero as the classes are not linearly separable.

In [None]:
scoring = make_scorer(hinge_loss, greater_is_better=False)
lr_grid = [1/2**n for n in range(1,6)]
bin_clf_gscv = GridSearchCV(Perceptron(), param_grid={'eta0': lr_grid}, scoring=scoring, cv=5)
bin_clf_gscv.fit(x_train, y_train_0)

In [None]:
pprint(bin_clf_gscv.cv_results_)

As you can see the best learning rate is 0.125

In [None]:
iterations = 100
Loss = []
best_bin_clf = Perceptron(max_iter=1000, random_state=2094, eta0=0.125)
for i in range(iterations):
  best_bin_clf = Perceptron(max_iter=1000, random_state=2094, eta0= 0.125)
  y_hat_0 = best_bin_clf.decision_function(x_train)
  Loss.append(hinge_loss(y_train_0,y_hat_0))


In [None]:
plt.figure()
plt.plot(np.arange(iterations), Loss_clf1, label='eta0=1')
plt.plot(np.arange(iterations), Loss, label='eta0=0.125')
plt.grid(True)
plt.legend()
plt.xlabel('Iteration')
plt.ylabel('Training Loss')
plt.show()

Well, instead of instatiating a Perceptron class with a new learning rate and re-train the model, we could simply get the best_estimator from GridSearchCV as follows.

In [None]:
best_bin_clf = bin_clf_gscv.best_estimator_

In [None]:
y_hat_train_0 = bin_clf.predict(x_train)
print(classification_report(y_teain_0, y_hat_train_0))