# Extreme Learning Machine

Let's review all of the different ways we've optimized the same problem - classification of MNIST data by comparing each objective function

Linear Least Squares - 2 norm

\begin{gather}
E(\textbf{C}_{\text{obs}}, \textbf{W})=\Vert \textbf{WY} - \textbf{C}_{\text{obs}} \Vert^2
\end{gather}

Linear Classification Softmax - Cross Entropy

\begin{gather}
E(\textbf{C}_{\text{obs}}, \textbf{W})=-\frac{1}{n}\textbf{e}^\top_{n_c}(\textbf{C}_{\text{obs}} \odot \textbf{WY}) \textbf{e}_n + 
\frac{1}{n}\log(\textbf{e}^\top_{n_c}\exp(\textbf{WY}))\textbf{e}_n
\end{gather}

Single-layer feedforward neural network - regression

\begin{gather}
E(\textbf{C}_{\text{obs}}, \textbf{W}, \textbf{K})=\Vert \textbf{W}\sigma(\textbf{KY})-\textbf{C}_{\text{obs}}) \Vert^2
\end{gather}

Single-layer feedforward neural network - classification

\begin{gather}
E(\textbf{C}_{\text{obs}}, \textbf{W}, \textbf{K})=-\frac{1}{n}\textbf{e}^\top_{n_c}(\textbf{C}_{\text{obs}} \odot \textbf{W}\sigma(\textbf{KY})) \textbf{e}_n + 
\frac{1}{n}\log(\textbf{e}^\top_{n_c}\exp(\textbf{W}\sigma(\textbf{KY})))\textbf{e}_n
\end{gather}

So when comparing the single layer networks to the linear scenario, the only difference is that the $\textbf{Y}$ input feature
matrix is transformed into $\sigma(\textbf{KY})$. We apply a matrix $\textbf{K}$ to expand the matrix, and then change the rank by applying the nonlinear activation function $\sigma$. What deep learning essentially does is increase the rank of the matrix, which is usually an undetermined system, to improve the conditions of the optimization problem (make it convex, for example). The idea of Extreme Learning Machines is that if we can randomize the $\textbf{K}$, then we can just use existing methods for 
solving the linear regression and classification case without having to resort to previous iterative methods like backpropagation for finding the weights for both $\textbf{W}$ and $\textbf{K}$. It removes the optimization for $\textbf{K}$, and let's us just focus on the $\textbf{W}$. 

In [33]:
n_c = 3
n_f = 2
n = 100
m = 50

Y = np.random.normal(size = (n_f, n))
C = np.random.uniform(size = (n_c, n))

In [67]:
def ELM(Y, C, m):
    """
    Implementation of the Extreme Learning Machine Regression. So easy!
    
    Parameters
        Y -- input feature matrix (n_f x n)
        C -- output label matrix (n_c x n)
        m -- number of hidden nodes
        
    Returns
        W -- weight matrix (n_c x m)
        S -- transformed feature matrix (m x n)
    """
    n_f, n = Y.shape
    n_c = C.shape[0]
    
    # Add bias vector
    b = np.ones((1, n))
    Y = np.vstack((Y, b))
    
    # Randomize the weights and biases from input layer to hidden layer
    K = np.random.uniform(size = (m, n_f + 1))
    # Elementwise activation function to increase rank
    S = np.tanh(K @ Y)
    # Solve the linear system
    W = np.linalg.lstsq(S.T, C.T)[0]
    
    return W.T, S

In [68]:
W, S = ELM(Y, C, 100)



In [69]:
np.linalg.norm(W @ S - C)

0.0002656694159921141

In [70]:
D = np.linalg.lstsq(Y.T, C.T)[0]

  """Entry point for launching an IPython kernel.


In [71]:
np.linalg.norm(D.T @ Y - C)

10.040964175188037