# Citation classifier excite


GIT repo: `git clone git@nopro.be:james/excite.git` 

Implementation of LSTM that can classify citation strings into regions denoting author, article title, journal, volume, year, pages, doi, notes and some other classes. There are 13 classes in total.

## Preparing Data regions

The following function takes citations that have been annotated and builds a mapping of character to classes. Since neural networks are also completely numerical constructs, we create an alphabet that maps numerical indices in a vector to letters.


In [2]:
import lxml.etree as ET
from collections import Counter

def prepare_data(citefile):

    #a list of all citations in training data
    citations = []
    #letters is essentially our alphabet used to map alphanumerical chars to integers
    letters = set()
    #this is a map of example classes found in the training data
    classes = Counter()

    with open(citefile) as f:

        for line in f.readlines():
            #add citation element to beginning and end of each example line so that we can parse into xml doc
            root = ET.XML("<citation>" + line.replace("&","&amp;") + "</citation>")
            cite = ""
            regions = []

            #iterate over child elements of our citation doc
            for el in root.iterchildren():
                classes[el.tag] = 1
                regions.append( (el.tag, len(cite)) )
                cite += el.text.replace("&amp;","&")

            #letters is a set so we just union against our new citation string to get unique chars used
            letters = letters.union( set(list(cite)) )
            citations.append( (cite, regions) )

    return citations, sorted(classes.keys()), sorted(letters)

We can use this function to build test and training sets. We can import data from the training sets attached

In [3]:
#extract training data from examples
citations, classes, alphabet = prepare_data("excite/data/citeseerx.tagged.txt")

# Find out how many classes there were in the training data as a sanity check
print ( str.format("Total classes found: {}",len(classes) ) )
print ("Classes: ", classes)
    


Total classes found: 13
Classes:  ['author', 'booktitle', 'date', 'editor', 'institution', 'journal', 'location', 'note', 'pages', 'publisher', 'tech', 'title', 'volume']


## Architecture of Network

Now that we have an idea of the number of classes we can start to plan network architecture.

### Network input

We are using a 5 character context window over our time series (which is essentially moving the context window along one character at a time. We need a context window function, $c(s,t)$ which given an input citation $s$ and an offset, $t$ creates a context window for network input, $x$

So you might expect the following:

In [4]:
#given a citation string that looks like this
citation = "this is an example"

x_t_1 = list(citation[0:5])

print ("x for t=1: ", x_t_1)

x_t_2 = list(citation[1:6])

print ("x for t=2: ", x_t_2)

x for t=1:  ['t', 'h', 'i', 's', ' ']
x for t=2:  ['h', 'i', 's', ' ', 'i']


The problem is that neural networks are completely numerical and strings must be encoded as numbers in order to be passed in. Therefore we use the alphabet collected above to map our x values to something more RNN friendly.

In [5]:
import numpy as np

def amap(letters):
    return np.array([ alphabet.index(x) for x in letters ])

print( "Encoded x for t=1", amap(x_t_1) )
print( "Encoded x for t=2", amap(x_t_2) )

Encoded x for t=1 [73 61 62 72  0]
Encoded x for t=2 [61 62 72  0 62]


    
### Network output
    

The output will be a vector of values between 0,1 as wide as the number of classes. 

The aim is to use back propogation through time to make the input match with an output of all zeroes except the correct class in the vector space.

In [6]:
import theano
import theano.tensor as T


#we know the input context window is 5 
n_in = 5
#the output is the number of classes (currently 13)
n_out = len(classes)

#hidden units somewhere between input and output - we'll try 10 for now 
n_hidden = 10

# we set up the various layers of the network 

#the input layer is just a vector - yes it is 5 wide but we don't care about this yet
L_in = T.vector('x_in')

# States is a vector of memory unit values
S_lstm = T.vector('S_lstm')

#output layer is a vector of values which will eventually be len(classes) wide
L_out = T.vector('y_out')

# set up all the weights - these are matrices that weight values map connections between layers
# W[l,m] is the weight of connection from unit m to unit l

#weights for hidden layer internal connections 
W_hh = theano.shared(np.array((n_hidden,n_hidden)))
#weights for input to hidden connections
W_hi  = theano.shared(np.array((n_hidden, n_in)))
#weights for hidden to out
W_oh  = theano.shared(np.array((n_out, n_hidden)))



## Forward Propagation

Each time the sequence advances (or each time $t$ goes up by 1) we have to forward propagate the input $x$ through the network and calculate output. Here we define what that looks like.

In [7]:

#input 
z_lstm = theano.shared( np.array(n_hidden) )




def forward_propogate(x):
    
    # x should be an input vector n_input wide
    pass


## Single step

Now that we have weights defined, we can define what it means to do a single step given time, $t$. This involves forward propogating new (and recycled inputs) and then backpropogating error signals.

Here we define a python function, $step$ that carries out these processes


In [8]:

def step(x,y):
    

SyntaxError: unexpected EOF while parsing (<ipython-input-8-0f5016b65c16>, line 3)

Forget gate value:

$$y_{\varphi_j}(t)= f_{\varphi_j}(z_{\varphi_j}(t))$$

$$ z_{\varphi_j}(t)=\sum_m w_{\varphi_j m} y_m (t-1)$$

Initial cell state is zero 

$$ s_{c^v_j} (0) = 0 $$

If $ t \gt 0$ then it is calculated like so:

$$s_{c^v_j} (t) =  y_{\varphi j}(t)s_{c^v_j} (t-t) + y_{in_j}(t) g(z_{c^v_j}(t))  $$

Note that functions $z_{c^v_j}(t)$ and $y_{in_j}(t)$ are included as products in this value.

The previous state value remains inside the CEC and a product of $s_{c^v_j} (t)$ provided that $y_{\varphi j}(t) \approx 1$