In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (8, 8)
plt.rcParams["font.size"] = 14

# Random Features and Kernels

In this lecture, we are going to discuss how changing the feature space can lead to computational advantages, or improved performances.

Suppose we have $N$ samples $\{x_i\}_{i=1\cdots N}$, each containing $P$ features. We arrange our samples into a matrix $X$ of size $N\times P$, such that each row is a sample:

$$X = \left(\begin{matrix} x_1  \\ x_2 \\ \cdots \\ x_N
\end{matrix}\right)$$

We now want to introduce a **feature map**, i.e. to construct a function $z_i = f(x_i) \in R^D$ that maps the points to a new feature space of different dimension. We also define a matrix $Z$ of size $N \times D$, as before:

$$Z = \left(\begin{matrix} z_1  \\ z_2 \\ \cdots \\ z_N
\end{matrix}\right) = f(X)$$

We are going to discuss two different applications:

$\bullet$ In the first case, we consider $D$ small, i.e. we want to reduce the dimension of the feature space, while preserving the distances between points, i.e. the Gram matrix $X X^T \sim Z Z^T$. This is useful to reduce the computational time needed to estimate the Gram matrix, in applications such as k-Nearest Neighbor classification

$\bullet$ In the second case, we consider $D$ large, i.e. we embed our points in a feature space of higher dimension. This is useful to improve e.g. classification performances, and in this case we want to choose the feature map in such a way that the Gram matrix is $Z Z^T \sim K(x_i, x_j)$ for a given **kernel** $K$.

# Random linear features - dimensional reduction
## Johnson-Lindenstrauss lemma and dimensional reduction.  Application to k-Nearest Neighbor classification

Briefly speaking, the JL lemma guarantees that some maps preserve distance or inner products in an approximate way. More precisely

> Given $0 < \varepsilon < 1$, a set $\{x_i\}_{i=1\cdots N}$ of $N$ points in $\mathbb{R}^P$, and a number $D > 8 \log(N) / \varepsilon^2$, there is a linear map $f: \mathbb{R}^P \to \mathbb{R}^D$ such that
>
>$$(1 - \varepsilon) \|x_i - x_j\|^2 \leq \|f(x_i) - f(x_j)\|^2 \leq (1 + \varepsilon) \|x_i - x_j\|^2$$
>
> for all $i, j \in \{1,\cdots,N\}$.
>
> Similarly, if $|x_i|^2<1$, $\forall i$, then there is a linear map such that
>
>$$ |x_i \cdot x_j - f(x_i) \cdot f(x_j) |^2 < \varepsilon $$

Moreover, we actually know how to construct such maps; for instance, we know that a random $D \times P$ matrix $A$ such that 

$$z = f(x) = \frac{1}{\sqrt{D}} A x,  \qquad A_{ij} \sim \mathcal{N} (0, 1)$$

satisfies the condition in the lemma with finite probability.
See e.g. http://ttic.uchicago.edu/~gregory/courses/LargeScaleLearning/lectures/jl.pdf for a simple proof.



The JL lemma can be used to speedup the calculation of the Gram matrix $X X^T$, which in the original feature space takes $O(N^2 P)$ operations.
The time needed to compute the random projections $z_i = A x_i/\sqrt{D}$ is $O(N D P)$, and then $O(N^2 D)$ to compute $Z Z^T\sim X X^T$. We thus obtain a speedup if 

$$N D \max(P, N) \ll N^2 P$$ 

According to the JL lemma we can choose $D\sim \log(N)$, and the speedup is of the order of 

$$D \max\left(\frac1P, \frac1N\right) \sim \frac{ 8\log N}{\varepsilon^2} \times \max\left(\frac1P, \frac1N\right)$$.

We are now going to apply this method to the k-Nearest Neighbor classification that we studied in lecture 1. 

## Digits dataset

We are going to use the UCI ML hand-written digits datasets. This dataset contains $N$=1797 images, each being a set of $P = 8\times 8=64$ pixels, each pixel value being an integer between 0 and 16. The image represents handwritten digits, from 0 to 9.

In [None]:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
n_samples, n_features = np.shape(X)
print("samples/features in data set: %d, %d" % X.shape)

# Do train/test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("samples/features in training set: %d, %d" % X_train.shape)

# Show an image
fig, axs = plt.subplots(1, 2)
axs[0].imshow(X_train[0, :].reshape((int(np.sqrt(n_features)), -1)), cmap="gray")

# Standardize data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Show an image after scaling
print("image label: %d" % y_train[0])
axs[1].imshow(X_train[0, :].reshape((int(np.sqrt(n_features)), -1)), cmap="gray")

## Dimensional reduction

Let's try the map mentioned above. We start with $N$ training samples, each containing $P$ features. We will generate $D$ new features by performing linear combinations of the features at random, with coefficients sampled from a standard Normal distribution. Our new set of samples is then gonna contain $N$ samples, each of dimension $D$ (instead of $P$).

By playing a bit with the number of features in the transformed space, $D$, we can see how the Gramiam evolves.

In [None]:
# Generate new set of samples by mapping from P features to D
n_projections = 40
A = np.random.randn(n_features, n_projections) / np.sqrt(n_projections)
Z_train = X_train.dot(A)
print("samples/features in transformed set: %d, %d" % Z_train.shape)

# Plot Gram matrix ...
fig, axs = plt.subplots(1, 2, figsize = (12, 6))

# ... for the original set of samples ...
p0 = axs[0].imshow(X_train.dot(X_train.T), interpolation="nearest", vmin = -50, vmax = 50)
plt.colorbar(p0, ax=axs[0], shrink = 0.75)

# ... and for the transformed one.
p1 = axs[1].imshow(Z_train.dot(Z_train.T), interpolation="nearest", vmin = -50, vmax = 50)
plt.colorbar(p1, ax=axs[1], shrink = 0.75)
plt.tight_layout()

Next we want to see how the distance between points change as we change the number of features in the transformed space.

**Exercise**: make a scatter plot of distances between samples in original and transformed space, for $D = 10$ and $D = 40$.

In [None]:
from scipy.spatial.distance import pdist

# Generate Z for a given value of D
def generate_random_features(X_train, n_projections):

# Compute distance matrix for both original and transformed sets of features
dists_orig =
dists_proj =

# Scatter plot of distances between samples in original/transformed spaces

In [None]:
# %load rks1.py

Now, we can perform k-Nearest Neighbors classification in the original space and in the transformed space, and compare the two results

In [None]:
from sklearn import neighbors

# Run the experiment for a range of values of k
ks = np.r_[np.arange(1, 10), np.arange(10, 150, 30)]
train_error_X = []
test_error_X = []
for k in ks:
    clf = neighbors.KNeighborsClassifier(k)
    clf.fit(X_train, y_train)
    train_error_X.append(1. - clf.score(X_train, y_train))
    test_error_X.append(1. - clf.score(X_test, y_test))
    print("k = %d; train error = %g, test error = %g" % (k, train_error_X[-1], test_error_X[-1]))

# Repeat the experiment using Z instead of X
Z_test = X_test.dot(A)
train_error_Z = []
test_error_Z = []
for k in ks:
    clf = neighbors.KNeighborsClassifier(k)
    clf.fit(Z_train, y_train)
    train_error_Z.append(1. - clf.score(Z_train, y_train))
    test_error_Z.append(1. - clf.score(Z_test, y_test))
    print("k = %d; train error = %g, test error = %g" % (k, train_error_Z[-1], test_error_Z[-1]))

# Plot error as a function of the degrees of freedom
plt.plot(len(y_train) / np.array(ks), train_error_X, "-o", label = "X_train")
plt.plot(len(y_train) / np.array(ks), test_error_X, "-o", label = "X_test")
plt.plot(len(y_train) / np.array(ks), train_error_Z, "-o", label = "Z_train")
plt.plot(len(y_train) / np.array(ks), test_error_Z, "-o", label = "Z_test")
plt.legend()
plt.xlabel(r"degrees of freedom $N / k$")
plt.ylabel("misclassification error")
plt.ylim((0., 0.15))

print("minimum test error orig./transf.: %g/%g" % (min(test_error_X), min(test_error_Z)))

How much did we gain/lost in terms of speed and accuracy when going from $P = 64$ to $D = 40$?