# Multiple Kernel Learning

#### By Saurabh Mahindre - <a href="https://github.com/Saurabh7">github.com/Saurabh7</a>

This notebook is about multiple kernel learning in shogun. We will see how to construct a combined kernel, determine optimal kernel weights using MKL and use it for different types of [classification](http://en.wikipedia.org/wiki/Statistical_classification) and [novelty detection](http://en.wikipedia.org/wiki/Novelty_detection).

1. [Introduction](#Introduction)
2. [Mathematical formulation](#Mathematical-formulation-(skip-if-you-just-want-code-examples))
3. [Using a Combined kernel](#Using-a-Combined-kernel)
4. [Example: Toy Data](#Prediction-on-toy-data)
  1. [Generating Kernel weights](#Generating-Kernel-weights)
5. [Binary classification using MKL](#Binary-classification-using-MKL)
6. [MKL for knowledge discovery](#MKL-for-knowledge-discovery)
7. [Multiclass classification using MKL](#Multiclass-classification-using-MKL)
8. [One-class classification using MKL](#One-class-classification-using-MKL)

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
import shogun as sg

%matplotlib inline

SHOGUN_DATA_DIR=os.getenv('SHOGUN_DATA_DIR', '../../../data')

### Introduction

<em>Multiple kernel learning</em> (MKL) is about using a combined kernel i.e. a kernel consisting of a linear combination of arbitrary kernels over different domains. The coefficients or weights of the linear combination can be learned as well.

[Kernel based methods](http://en.wikipedia.org/wiki/Kernel_methods) such as support vector machines (SVMs)  employ a so-called kernel function $k(x_{i},x_{j})$  which intuitively computes the similarity between two examples $x_{i}$ and $x_{j}$. </br>
Selecting the kernel function
$k()$  and it's parameters is an important issue in training. Kernels designed by humans usually capture one aspect of data. Choosing one kernel means to select exactly one such aspect. Which means combining such aspects is often better than selecting.

In shogun the [MKL](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKL.html) is the base class for MKL. We can do classifications:  [binary](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLClassification.html), [one-class](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLOneClass.html), [multiclass](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLMulticlass.html) and regression too: [regression](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLRegression.html).

### Mathematical formulation (skip if you just want code examples)

</br>In a SVM, defined as:
$$f({\bf x})=\text{sign} \left(\sum_{i=0}^{N-1} \alpha_i k({\bf x}, {\bf x_i})+b\right)$$</br>
where ${\bf x_i},{i = 1,...,N}$ are labeled training examples ($y_i \in {±1}$).

One could make a combination of kernels like:
$${\bf k}(x_i,x_j)=\sum_{k=0}^{K} \beta_k {\bf k_k}(x_i, x_j)$$
where $\beta_k > 0$ and   $\sum_{k=0}^{K} \beta_k = 1$

In the multiple kernel learning problem for binary classification one is given $N$ data points ($x_i, y_i$ )
    ($y_i \in {±1}$), where $x_i$ is translated via $K$ mappings $\phi_k(x) \rightarrow R^{D_k} $, $k=1,...,K$ , from the input into $K$ feature spaces $(\phi_1(x_i),...,\phi_K(x_i))$ where $D_k$ denotes dimensionality of the $k$-th feature space.

In MKL $\alpha_i$,$\beta$ and bias are determined by solving the following optimization program. For details see [1].

$$\mbox{min} \hspace{4mm} \gamma-\sum_{i=1}^N\alpha_i$$
$$ \mbox{w.r.t.} \hspace{4mm}   \gamma\in R, \alpha\in R^N \nonumber$$
$$\mbox {s.t.} \hspace{4mm}  {\bf 0}\leq\alpha\leq{\bf 1}C,\;\;\sum_{i=1}^N \alpha_i y_i=0 \nonumber$$
$$  {\frac{1}{2}\sum_{i,j=1}^N \alpha_i \alpha_j y_i y_j \leq \gamma},  \forall k=1,\ldots,K\nonumber\\
$$


Here C is a pre-specified regularization parameter.
Within shogun this optimization problem is solved using [semi-infinite programming](http://en.wikipedia.org/wiki/Semi-infinite_programming). For 1-norm MKL one of the two approaches described in [1] is used.
The first approach (also called the wrapper algorithm) wraps around a single kernel SVMs, alternatingly solving for $\alpha$ and $\beta$. It is using a traditional SVM to generate new violated constraints and thus requires a single kernel SVM and any of the SVMs contained in shogun can be used. In the MKL step either a linear program is solved via [glpk](http://en.wikipedia.org/wiki/GNU_Linear_Programming_Kit) or cplex or analytically or a newton (for norms>1) step is performed.

The second much faster but also more memory demanding approach performing interleaved optimization, is integrated into the chunking-based [SVMlight](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1SVMLight.html).



### Using a Combined kernel

Shogun provides an easy way to make a combination of kernels using the [CombinedKernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CombinedKernel.html) class, to which we can append any [kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1Kernel.html) from the many options shogun provides. It is especially useful to combine kernels working on different domains and to combine kernels looking at independent features and requires [CombinedFeatures](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CombinedFeatures.html) to be used. Similarly the CombinedFeatures is used to combine a number of feature objects into a single CombinedFeatures object

In [None]:
kernel = sg.create_kernel("CombinedKernel")

### Prediction on toy data

In order to see the prediction capabilities, let us generate some data using the [GMM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMM.html) class. The data is sampled by setting means ([GMM notebook](http://www.shogun-toolbox.org/static/notebook/current/GMM.html)) such that it sufficiently covers X-Y grid and is not too easy to classify.

In [None]:
num=30;
num_components=4
means=np.zeros((num_components, 2))
means[0]=[-1,1]
means[1]=[2,-1.5]
means[2]=[-1,-3]
means[3]=[2,1]

covs=np.array([[1.0,0.0],[0.0,1.0]])

# gmm=sg.create_distribution("GMM")
# gmm.set_pseudo_count(num_components)
gmm=sg.GMM(num_components)
[gmm.set_nth_mean(means[i], i) for i in range(num_components)]
[gmm.set_nth_cov(covs,i) for i in range(num_components)]
gmm.set_coef(np.array([1.0,0.0,0.0,0.0]))
xntr=np.array([gmm.sample() for i in range(num)]).T
xnte=np.array([gmm.sample() for i in range(5000)]).T
gmm.set_coef(np.array([0.0,1.0,0.0,0.0]))
xntr1=np.array([gmm.sample() for i in range(num)]).T
xnte1=np.array([gmm.sample() for i in range(5000)]).T
gmm.set_coef(np.array([0.0,0.0,1.0,0.0]))
xptr=np.array([gmm.sample() for i in range(num)]).T
xpte=np.array([gmm.sample() for i in range(5000)]).T
gmm.set_coef(np.array([0.0,0.0,0.0,1.0]))
xptr1=np.array([gmm.sample() for i in range(num)]).T
xpte1=np.array([gmm.sample() for i in range(5000)]).T
traindata=np.concatenate((xntr,xntr1,xptr,xptr1), axis=1)
trainlab=np.concatenate((-np.ones(2*num), np.ones(2*num)))

testdata=np.concatenate((xnte,xnte1,xpte,xpte1), axis=1)
testlab=np.concatenate((-np.ones(10000), np.ones(10000)))

#convert to shogun features and generate labels for data
feats_train=sg.create_features(traindata)  
labels=sg.BinaryLabels(trainlab)         

In [None]:
_=plt.jet()
plt.figure(figsize=(18,5))
plt.subplot(121)
# plot train data
_=plt.scatter(traindata[0,:], traindata[1,:], c=trainlab, s=100)
plt.title('Toy data for classification')
plt.axis('equal')
colors=["blue","blue","red","red"]
# a tool for visualisation
from matplotlib.patches import Ellipse
def get_gaussian_ellipse_artist(mean, cov, nstd=1.96, color="red", linewidth=3):
    vals, vecs = np.linalg.eigh(cov)
    order = vals.argsort()[::-1]
    vals, vecs = vals[order], vecs[:, order]    
    theta = np.degrees(np.arctan2(*vecs[:, 0][::-1]))
    width, height = 2 * nstd * np.sqrt(vals)
    e = Ellipse(xy=mean, width=width, height=height, angle=theta, \
               edgecolor=color, fill=False, linewidth=linewidth)
    
    return e
for i in range(num_components):
    plt.gca().add_artist(get_gaussian_ellipse_artist(means[i], covs, color=colors[i]))

### Generating Kernel weights

Just to help us visualize let's use two gaussian kernels ([GaussianKernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1GaussianKernel.html)) with considerably different widths. As required in MKL, we need to append them to the Combined kernel. To generate the optimal weights (i.e $\beta$s in the above equation), training of [MKL](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLClassification.html) is required. This generates the weights as seen in this example.

In [None]:
width0=0.5
kernel0=sg.create_kernel("GaussianKernel", log_width=np.log(width0))

width1=25
kernel1=sg.create_kernel("GaussianKernel", log_width=np.log(width1))

#combine kernels
kernel.add("kernel_array", kernel0)   
kernel.add("kernel_array", kernel1)
kernel.init(feats_train, feats_train)

mkl = sg.create_machine("MKLClassification", mkl_norm=1, C1=1, C2=1, kernel=kernel, labels=labels)

#train to get weights
mkl.train()    

w=kernel.get_subkernel_weights()
print(w)

### Binary classification using MKL

Now with the data ready and training done, we can do the binary classification. The weights generated can be intuitively understood. We will see that on plotting individual subkernels outputs and outputs of the MKL classification. To apply on test features, we need to reinitialize the kernel with `kernel.init` and pass the test features. After that it's just a matter of doing  `mkl.apply` to generate outputs. 

In [None]:
size=100
x1=np.linspace(-5, 5, size)
x2=np.linspace(-5, 5, size)
x, y=np.meshgrid(x1, x2)
#Generate X-Y grid test data
grid=sg.create_features(np.array((np.ravel(x), np.ravel(y))))

kernel0t=sg.create_kernel("GaussianKernel", log_width=np.log(width0))
kernel1t=sg.create_kernel("GaussianKernel", log_width=np.log(width1))

kernelt=sg.create_kernel("CombinedKernel")
kernelt.add("kernel_array", kernel0t)
kernelt.add("kernel_array", kernel1t)
#initailize with test grid
kernelt.init(feats_train, grid)

mkl.put("kernel", kernelt)
#prediction
grid_out=mkl.apply()    

z=grid_out.get_values().reshape((size, size))

plt.figure(figsize=(10,5))
plt.title("Classification using MKL")
c=plt.pcolor(x, y, z)
_=plt.contour(x, y, z, linewidths=1, colors='black')
_=plt.colorbar(c)


To justify the weights, let's train and compare two subkernels with the MKL classification output. Training MKL classifier with a single kernel appended to a combined kernel makes no sense and is just like normal single kernel based classification, but let's do it for comparison.

In [None]:
z=grid_out.get("labels").reshape((size, size))

# MKL
plt.figure(figsize=(20,5))
plt.subplot(131, title="Multiple Kernels combined")
c=plt.pcolor(x, y, z)
_=plt.contour(x, y, z, linewidths=1, colors='black')
_=plt.colorbar(c)

comb_ker0=sg.create_kernel("CombinedKernel")
comb_ker0.add("kernel_array", kernel0)
comb_ker0.init(feats_train, feats_train)
mkl.put("kernel", comb_ker0)
mkl.train()
comb_ker0t=sg.create_kernel("CombinedKernel")
comb_ker0t.add("kernel_array", kernel0)
comb_ker0t.init(feats_train, grid)
mkl.put("kernel",comb_ker0t)
out0=mkl.apply()

# subkernel 1
z=out0.get("labels").reshape((size, size)) 
plt.subplot(132, title="Kernel 1")
c=plt.pcolor(x, y, z)
_=plt.contour(x, y, z, linewidths=1, colors='black')
_=plt.colorbar(c)

comb_ker1=sg.create_kernel("CombinedKernel")
comb_ker1.add("kernel_array",kernel1)
comb_ker1.init(feats_train, feats_train)
mkl.put("kernel", comb_ker1)
mkl.train()
comb_ker1t=sg.create_kernel("CombinedKernel")
comb_ker1t.add("kernel_array", kernel1)
comb_ker1t.init(feats_train, grid)
mkl.put("kernel", comb_ker1t)
out1=mkl.apply()

# subkernel 2
z=out1.get("labels").reshape((size, size))  
plt.subplot(133, title="kernel 2")
c=plt.pcolor(x, y, z)
_=plt.contour(x, y, z, linewidths=1, colors='black')
_=plt.colorbar(c)



As we can see the multiple kernel output seems just about right. Kernel 1 gives a sort of overfitting output while the kernel 2 seems not so accurate. The kernel weights are hence so adjusted to get a refined output. We can have a look at the errors by these subkernels to have more food for thought. Most of the time, the MKL error is lesser as it incorporates aspects of both kernels. One of them is strict while other is lenient, MKL finds a balance between those.

In [None]:
kernelt.init(feats_train, sg.create_features(testdata))
mkl.put("kernel", kernelt)
out = mkl.apply()

evaluator = sg.create_evaluation("ErrorRateMeasure")
print("Test error is %2.2f%% :MKL" % (100*evaluator.evaluate(out,sg.BinaryLabels(testlab))))


comb_ker0t.init(feats_train, sg.create_features(testdata)) 
mkl.put("kernel", comb_ker0t)
out = mkl.apply()

evaluator = sg.create_evaluation("ErrorRateMeasure")
print("Test error is %2.2f%% :Subkernel1"% (100*evaluator.evaluate(out,sg.BinaryLabels(testlab))))

comb_ker1t.init(feats_train, sg.create_features(testdata))
mkl.put("kernel", comb_ker1t)
out = mkl.apply()

evaluator = sg.create_evaluation("ErrorRateMeasure")
print("Test error is %2.2f%% :subkernel2" % (100*evaluator.evaluate(out,sg.BinaryLabels(testlab))))


### MKL for knowledge discovery

MKL can recover information about the problem at hand. Let us see this with a binary classification problem. The task is to separate two concentric classes shaped like circles. By varying the distance between the boundary of the circles we can control the separability of the problem. Starting with an almost non-separable scenario, the data quickly becomes separable as the distance between the circles increases.

In [None]:
def circle(x, radius, neg):
        y=np.sqrt(np.square(radius)-np.square(x))
        if neg:
            return[x, -y]
        else:
            return [x,y]
        
def get_circle(radius):
    neg=False
    range0=np.linspace(-radius,radius,100)
    pos_a=np.array([circle(i, radius, neg) for i in range0]).T
    neg=True
    neg_a=np.array([circle(i, radius, neg) for i in range0]).T
    c=np.concatenate((neg_a,pos_a), axis=1)
    return c

def get_data(r1, r2):
    c1=get_circle(r1)
    c2=get_circle(r2)
    c=np.concatenate((c1, c2), axis=1)
    feats_tr=sg.create_features(c)
    return c, feats_tr

l=np.concatenate((-np.ones(200),np.ones(200)))
lab=sg.BinaryLabels(l)

#get two circles with radius 2 and 4
c, feats_tr=get_data(2,4)
c1, feats_tr1=get_data(2,3)
_=plt.gray()
plt.figure(figsize=(10,5))
plt.subplot(121)
plt.title("Circles with different separation")
p=plt.scatter(c[0,:], c[1,:], c=lab.get_labels())
plt.subplot(122)
q=plt.scatter(c1[0,:], c1[1,:], c=lab.get_labels())

These are the type of circles we want to distinguish between. We can try classification with a constant separation between the circles first.

In [None]:
def train_mkl(circles, feats_tr):
    #Four kernels with different widths 
    kernel0=sg.create_kernel("GaussianKernel", log_width=np.log(1)) 
    kernel1=sg.create_kernel("GaussianKernel", log_width=np.log(5))
    kernel2=sg.create_kernel("GaussianKernel", log_width=np.log(7))
    kernel3=sg.create_kernel("GaussianKernel", log_width=np.log(10))
    kernel = sg.create_kernel("CombinedKernel")
    kernel.add("kernel_array", kernel0)
    kernel.add("kernel_array", kernel1)
    kernel.add("kernel_array", kernel2)
    kernel.add("kernel_array", kernel3)
    
    kernel.init(feats_tr, feats_tr)
    mkl = sg.create_machine("MKLClassification", mkl_norm=1, C1=1, C2=2, kernel=kernel, labels=lab)
    
    mkl.train()
    
    w=kernel.get_subkernel_weights()
    return w, mkl

def test_mkl(mkl, grid):
    kernel0t=sg.create_kernel("GaussianKernel", log_width=np.log(1))
    kernel1t=sg.create_kernel("GaussianKernel", log_width=np.log(5))
    kernel2t=sg.create_kernel("GaussianKernel", log_width=np.log(7))
    kernel3t=sg.create_kernel("GaussianKernel", log_width=np.log(10))
    kernelt = sg.create_kernel("CombinedKernel")
    kernelt.add("kernel_array", kernel0t)
    kernelt.add("kernel_array", kernel1t)
    kernelt.add("kernel_array", kernel2t)
    kernelt.add("kernel_array", kernel3t)
    kernelt.init(feats_tr, grid)
    mkl.put("kernel", kernelt)
    out=mkl.apply()
    return out

size=50
x1=np.linspace(-10, 10, size)
x2=np.linspace(-10, 10, size)
x, y=np.meshgrid(x1, x2)
grid=sg.create_features(np.array((np.ravel(x), np.ravel(y))))


w, mkl=train_mkl(c, feats_tr)
print(w)
out=test_mkl(mkl,grid)

z=out.get_values().reshape((size, size))

plt.figure(figsize=(5,5))
c=plt.pcolor(x, y, z)
_=plt.contour(x, y, z, linewidths=1, colors='black')
plt.title('classification with constant separation')
_=plt.colorbar(c)

As we can see the MKL classifier classifies them as expected. Now let's vary the separation and see how it affects the weights.The choice of the kernel width of the Gaussian kernel used for classification is expected to depend on the separation distance of the learning problem. An increased distance between the circles will correspond to a larger optimal kernel width. This effect should be visible in the results of the MKL, where we used MKL-SVMs with four kernels with different widths (1,5,7,10). 

In [None]:
range1=np.linspace(5.5,7.5,50)
x=np.linspace(1.5,3.5,50)
temp=[]

for i in range1:
    #vary separation between circles
    c, feats=get_data(4,i) 
    w, mkl=train_mkl(c, feats)
    temp.append(w)
y=np.array([temp[i] for i in range(0,50)]).T


In [None]:
plt.figure(figsize=(20,5))
_=plt.plot(x, y[0,:], color='k', linewidth=2)
_=plt.plot(x, y[1,:], color='r', linewidth=2)
_=plt.plot(x, y[2,:], color='g', linewidth=2)
_=plt.plot(x, y[3,:], color='y', linewidth=2)
plt.title("Comparison between kernel widths and weights")
plt.ylabel("Weight")
plt.xlabel("Distance between circles")
_=plt.legend(["1","5","7","10"])
 

In the above plot we see the kernel weightings obtained for the four kernels. Every line shows one weighting. The courses of the kernel weightings reflect the development of the learning problem: as long as the problem is difficult the best separation can be obtained when using the kernel with smallest width. The low width kernel looses importance when the distance between the circle increases and larger kernel widths obtain a larger weight in MKL. Increasing the distance between the circles, kernels with greater widths are used. 

### Multiclass classification using MKL

MKL can be used for multiclass classification using the [MKLMulticlass](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLMulticlass.html) class. It is based on the GMNPSVM Multiclass SVM. Its termination criterion is set by  `set_mkl_epsilon(float64_t eps )` and the maximal number of MKL iterations is set by `set_max_num_mkliters(int32_t maxnum)`. The epsilon termination criterion is the L2 norm between the current MKL weights and their counterpart from the previous iteration. We set it to 0.001 as we want pretty accurate weights.

To see this in action let us compare it to the normal [GMNPSVM](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CGMNPSVM.html) example as in the [KNN notebook](http://www.shogun-toolbox.org/static/notebook/current/KNN.html#Comparison-to-Multiclass-Support-Vector-Machines), just to see how MKL fares in object recognition. We use the  [USPS digit recognition dataset](http://www.gaussianprocess.org/gpml/data/).

In [None]:
from scipy.io import loadmat, savemat
from os       import path, sep

mat  = loadmat(sep.join(['..','..','..','data','multiclass', 'usps.mat']))
Xall = mat['data']
Yall = np.array(mat['label'].squeeze(), dtype=np.double)

# map from 1..10 to 0..9, since shogun
# requires multiclass labels to be
# 0, 1, ..., K-1
Yall = Yall - 1

np.random.seed(0)

subset = np.random.permutation(len(Yall))

#get first 1000 examples
Xtrain = Xall[:, subset[:1000]]
Ytrain = Yall[subset[:1000]]

Nsplit = 2
all_ks = range(1, 21)

print(Xall.shape)
print(Xtrain.shape)

Let's plot five of the  examples to get a feel of the dataset.

In [None]:
def plot_example(dat, lab):
    for i in range(5):
        ax=plt.subplot(1,5,i+1)
        plt.title(int(lab[i]))
        ax.imshow(dat[:,i].reshape((16,16)), interpolation='nearest')
        ax.set_xticks([])
        ax.set_yticks([])
        
        
_=plt.figure(figsize=(17,6))
plt.gray()
plot_example(Xtrain, Ytrain)


We combine a [Gaussian kernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1GaussianKernel.html) and a [PolyKernel](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1CPolyKernel.html). To test, examples not included in training data are used.

This is just a demonstration but we can see here how MKL is working behind the scene. What we have is two kernels with significantly different properties. The gaussian kernel defines a function space that is a lot larger than that of the linear kernel or the polynomial kernel. The gaussian kernel has a low width, so it will be able to represent more and more complex relationships between the training data. But it requires enough data to train on. The number of training examples here is 1000, which seems a bit less as total examples are 10000. We hope the polynomial kernel can counter this problem, since it will fit the polynomial for you using a lot less data than the squared exponential. The kernel weights are printed below to add some insight.

In [None]:
# MKL training and output
labels = sg.MulticlassLabels(Ytrain)
feats  = sg.create_features(Xtrain)
#get test data from 5500 onwards
Xrem=Xall[:,subset[5500:]]
Yrem=Yall[subset[5500:]]

#test features not used in training
feats_rem = sg.create_features(Xrem)             
labels_rem = sg.MulticlassLabels(Yrem)

kernel = sg.create_kernel("CombinedKernel")
feats_train = sg.create_features("CombinedFeatures")
feats_test = sg.create_features("CombinedFeatures")

#append gaussian kernel
subkernel = sg.create_kernel("GaussianKernel", log_width=np.log(15))
feats_train.add("feature_array", feats)
feats_test.add("feature_array", feats_rem)
kernel.add("kernel_array", subkernel)

#append PolyKernel
feats  = sg.create_features(Xtrain)
subkernel = sg.create_kernel('PolyKernel', degree=10, c=2)
feats_train.add("feature_array", feats)
feats_test.add("feature_array", feats_rem)
kernel.add("kernel_array", subkernel)

kernel.init(feats_train, feats_train)

mkl = sg.create_machine("MKLMulticlass", C=1.2, kernel=kernel, 
                 labels=labels, mkl_eps=0.001, mkl_norm=1)

# set epsilon of SVM
mkl.get("machine").put("epsilon", 1e-2)

mkl.train()

#initialize with test features
kernel.init(feats_train, feats_test)     

out =  mkl.apply()
evaluator = sg.create_evaluation("MulticlassAccuracy")
accuracy = evaluator.evaluate(out, labels_rem)
print("Accuracy = %2.2f%%" % (100*accuracy))

idx=np.where(out.get("labels") != Yrem)[0]
Xbad=Xrem[:,idx]
Ybad=Yrem[idx]
_=plt.figure(figsize=(17,6))
plt.gray()
plot_example(Xbad, Ybad)

In [None]:
w=kernel.get_subkernel_weights()
print(w)

In [None]:
# Single kernel:PolyKernel
C=1

pk = sg.create_kernel('PolyKernel', degree=10, c=2) 

svm = sg.create_machine("GMNPSVM", C=C, kernel=pk, labels=labels)
_=svm.train(feats)
out=svm.apply(feats_rem)
evaluator = sg.create_evaluation("MulticlassAccuracy")
accuracy = evaluator.evaluate(out, labels_rem)

print("Accuracy = %2.2f%%" % (100*accuracy))

idx=np.where(out.get("labels") != Yrem)[0]
Xbad=Xrem[:,idx]
Ybad=Yrem[idx]
_=plt.figure(figsize=(17,6))
plt.gray()
plot_example(Xbad, Ybad)

In [None]:
#Single Kernel:Gaussian kernel
width=15
C=1

gk=sg.create_kernel("GaussianKernel", log_width=np.log(width))

svm=sg.create_machine("GMNPSVM", C=C, kernel=gk, labels=labels)
_=svm.train(feats)
out=svm.apply(feats_rem)
evaluator = sg.create_evaluation("MulticlassAccuracy")
accuracy = evaluator.evaluate(out, labels_rem)

print("Accuracy = %2.2f%%" % (100*accuracy))

idx=np.where(out.get("labels") != Yrem)[0]
Xbad=Xrem[:,idx]
Ybad=Yrem[idx]
_=plt.figure(figsize=(17,6))
plt.gray()
plot_example(Xbad, Ybad)

The misclassified examples are surely pretty tough to predict. As seen from the accuracy MKL seems to work a shade better in the case. One could try this out with more and different types of kernels too.

### One-class classification using MKL

[One-class classification](http://en.wikipedia.org/wiki/One-class_classification) can be done using MKL in shogun. This is demonstrated in the following simple example using [MKLOneClass](http://www.shogun-toolbox.org/doc/en/latest/classshogun_1_1MKLOneClass.html). We will see how abnormal data is detected. This is also known as novelty detection. Below we generate some toy data and initialize combined kernels and features.

In [None]:
X = -0.3 * np.random.randn(100,2)
traindata = np.r_[X + 2, X - 2].T

X = -0.3 * np.random.randn(20, 2)
testdata = np.r_[X + 2, X - 2].T

trainlab=np.concatenate((np.ones(99),-np.ones(1)))
#convert to shogun features and generate labels for data
feats=sg.create_features(traindata) 
labels=sg.BinaryLabels(trainlab)         

In [None]:
xx, yy = np.meshgrid(np.linspace(-5, 5, 500), np.linspace(-5, 5, 500))
grid=sg.create_features(np.array((np.ravel(xx), np.ravel(yy))))
#test features
feats_t=sg.create_features(testdata)   
x_out=(np.random.uniform(low=-4, high=4, size=(20, 2))).T
feats_out=sg.create_features(x_out)

kernel=sg.create_kernel("CombinedKernel")
feats_train=sg.create_features("CombinedFeatures")
feats_test=sg.create_features("CombinedFeatures")
feats_test_out=sg.create_features("CombinedFeatures")
feats_grid=sg.create_features("CombinedFeatures")

#append gaussian kernel
subkernel=sg.create_kernel("GaussianKernel", log_width=np.log(8))
feats_train.add("feature_array", feats)
feats_test.add("feature_array", feats_t)
feats_test_out.add("feature_array", feats_out)
feats_grid.add("feature_array", grid)
kernel.add("kernel_array", subkernel)

#append PolyKernel
feats  = sg.create_features(traindata)
subkernel = sg.create_kernel('PolyKernel', degree=10, c=3)
feats_train.add("feature_array", feats)
feats_test.add("feature_array", feats_t)
feats_test_out.add("feature_array", feats_out)
feats_grid.add("feature_array", grid)
kernel.add("kernel_array", subkernel)

kernel.init(feats_train, feats_train)

mkl = sg.create_machine("MKLOneClass", kernel=kernel, labels=labels, interleaved_optimization=False,
                 mkl_norm=1)

mkl.put("epsilon", 1e-2)
mkl.put('mkl_epsilon', 0.1)

Now that everything is initialized, let's see MKLOneclass in action by applying it on the test data and on the X-Y grid.

In [None]:
mkl.train()
print("Weights:")
w=kernel.get_subkernel_weights()
print(w)

#initialize with test features
kernel.init(feats_train, feats_test)     
normal_out =  mkl.apply()

#test on abnormally generated data
kernel.init(feats_train, feats_test_out)     
abnormal_out =  mkl.apply()

#test on X-Y grid
kernel.init(feats_train, feats_grid)
grid_out=mkl.apply()
z=grid_out.get_values().reshape((500,500))
z_lab=grid_out.get("labels").reshape((500,500))

a=abnormal_out.get("labels")
n=normal_out.get("labels")

#check for normal and abnormal classified data
idx=np.where(normal_out.get("labels") != 1)[0]
abnormal=testdata[:,idx]

idx=np.where(normal_out.get("labels") == 1)[0]
normal=testdata[:,idx]

plt.figure(figsize=(15,6))
pl =plt.subplot(121)
plt.title("One-class classification using MKL")
_=plt.pink()
c=plt.pcolor(xx, yy, z)
_=plt.contour(xx, yy, z_lab, linewidths=1, colors='black')
_=plt.colorbar(c)
p1=pl.scatter(traindata[0, :], traindata[1,:], cmap=plt.gray(), s=100)
p2=pl.scatter(normal[0,:], normal[1,:], c="red", s=100)
p3=pl.scatter(abnormal[0,:], abnormal[1,:], c="blue", s=100)
p4=pl.scatter(x_out[0,:], x_out[1,:], c=a, cmap=plt.jet(),  s=100)
_=pl.legend((p1, p2, p3), ["Training samples", "normal samples", "abnormal samples"], loc=2)


plt.subplot(122)
c=plt.pcolor(xx, yy, z)
plt.title("One-class classification output")
_=plt.gray()
_=plt.contour(xx, yy, z, linewidths=1, colors='black')
_=plt.colorbar(c)

MKL one-class classification will give you a bit more flexibility compared to normal classifications. The kernel weights are expected to be more or less similar here since the training data is not overly complicated or too easy, which means both the gaussian and polynomial kernel will be involved. If you don't know the nature of the training data and lot of features are invoved, you could easily use kernels with much different properties and benefit from their combination.

### References:

[1] Soeren Sonnenburg, Gunnar Raetsch, Christin Schaefer, and Bernhard Schoelkopf. Large Scale Multiple Kernel Learning. Journal of Machine Learning Research, 7:1531-1565, July 2006.

[2]F. R. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and
the SMO algorithm. In C. E. Brodley, editor, Twenty-first international conference on Machine
learning. ACM, 2004

[3] Kernel Methods for Object Recognition , Christoph H. Lampert