## Here we will do an implementation of the neural probabilistic language model 

### FNN architecture

The architecture of the Forward Neural Network. 

* $n$ context size
* $m$ the number of features associated with each word (ex: m = 100, Each word is represented by a vector of size 100).
* $C$ is size $|V|\times m$

$$y = b + Wx + U\tanh(d + Hx)$$

Where:

* $x = (C(w_{t-1}), C(w_{t-2}), \ldots, C(w_{t-n+1}))$, vector of size $m\times(n-1)$
* $h$ be the number of hidden units
* $H$ Corresponds to the dense layer. $H$ has $m\times(n-1)$ columns and $h$ rows
* $d$ Corresponds to the dense layer. $d$ is a vector of size $h$
* $U$ Corresponds to the second dense layer. $U$ has $h$ columns $|V|$ lines
* W dense **(can be equal to zero)** 
* $b$ vector of size $|V|$ 


Total number of parameters

$ |V |(1 + nm + h) + h(1 + (n − 1)m)$

Input data
=====

For n=4

$$D = [(2, 10, 3, 5), (8, 30, 2, 20), ...]$$

In [1]:
import numpy as np

## Creating the neural network


In [2]:
np.random.seed(42)
from nplm import neurnetmodel as Neur
nb_features = 2
dict_size = 5
context_size = 5
h = 20 # The number of hidden units
N = Neur.Network([Neur.ProjectVectors(dict_size, nb_features),
                  Neur.ConcatProjections(), 
                  Neur.Dense(nb_features * (context_size-1), h), 
                  Neur.Tanh(), Neur.Dense(h, dict_size)])

In [3]:
# The input vectors look like this
X = np.array([[0, 0, 1, 0, 0], [0,1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0]])
X = X.T
X = np.array([X, X, X])

In [4]:
X

array([[[0, 0, 0, 1],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 1, 0]],

       [[0, 0, 0, 1],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 1, 0]],

       [[0, 0, 0, 1],
        [0, 1, 0, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 1, 0]]])

In [5]:
N.forward(X)

array([[-3.60815078, -3.60815078, -3.60815078],
       [ 3.70375656,  3.70375656,  3.70375656],
       [-0.4153328 , -0.4153328 , -0.4153328 ],
       [ 2.4245358 ,  2.4245358 ,  2.4245358 ],
       [-0.38709004, -0.38709004, -0.38709004]])

In [6]:
# The output vectors look like this
Y = np.array([[0, 0, 1, 0, 0], [0,1, 0, 0, 0], [0, 0, 0, 0, 1]])
Y = Y.T
N_a=Neur.Network([N,Neur.Ilogit_and_KL(Y)])

In [7]:
N_a.forward(X)

9.024379019014331

In [8]:
gradient = N_a.backward(None)[0]
gradient[:20]

array([ 7.49916018e-01, -3.41331596e+00, -5.78717943e+00,  0.00000000e+00,
        4.36670815e+00, -3.66315494e+00,  2.40974064e+00,  4.43578581e+00,
        0.00000000e+00, -5.11369573e+00, -7.53203803e-05, -8.92457906e-05,
        1.60789008e-05, -1.83648317e-04,  2.72299418e-05, -6.30948772e-05,
       -5.77634105e-05,  2.72280326e-05,  2.14671423e-03,  2.54360385e-03])

In [9]:
print(len(gradient))

295


### Verify the gradient

In [10]:
np.random.seed(13)

In [11]:
for i in range(20):
    # Set a random parameter vector
    n = N_a.nb_params
    theta = np.random.random(n)
    N_a.set_params(theta)
    # Get a random direction
    d = np.random.random(n)
    d = d / np.linalg.norm(d)
    # Compute theoretical and numerical gradients
    theor_deriv = np.dot(N_a.backward(None)[0], d)
    h = 1e-10
    Ftheta = N_a.forward(X)
    N_a.set_params(theta + h*d)
    Ftheta_plus_hd = N_a.forward(X)
    num_deriv = (Ftheta_plus_hd - Ftheta) / h
    print(np.linalg.norm(num_deriv - theor_deriv))

0.500292991360365
0.2976649653220803
0.08257019942768073
0.14523603725477274
0.048913994990833654
0.10963832982273058
0.04405780644131758
0.2169164561448086
0.027237449276053788
0.01493925806953722
0.22115014961248058
0.058307322615911095
0.1224352379263049
0.27541666263628917
0.3580144371413932
0.7340034589913387
0.541896530713164
0.10483467707414681
0.018977081812186186
0.0005117745557910169
