&copy;Copyright for [Shuang Wu] [2017]<br>
Cite from the [coursera] named [Neural network for Machine Learning] from [University of Toronto]<br>
Learning notes<br>

# Learning to predict the next word

## Simple e.g. of relational info.

* ![img41](imgs/img41.jpg)

## Another way to express same info.

* Make set of propositions using 12 relationships:
    * son, daughter, nephew, niece, father, mother, uncle, aunt
    * brother, sister, husband, wife
* colin has-father james
* colin has mother victoria
* james has wife victoria
    * this follows from the two above
    
## A relational learning task

* Given a large set of triples that come from some family trees, figure out the regularities
    * obvious way to express the regularities is as symbolic rules
        * (x has-mother y) & (y has-hausband z) => (x has-father z)
* Finding symbolic rules involves a difficult search through a very large discrete space of possibilities
* Can a NN capture the same knowledge by searching through a continuous space ofweights?

## Str. of the NN

* ![img42](imgs/img42.jpg)
* ![img43](imgs/img43.jpg)

## Nets. learns

* The 6 hidden units in the bottleneck connected to the input representation of person 1 learn to represent features of people that are useful for predicting the answer
    * Nationality, generation, branch of the family tree
* These features only useful if the other bottlenecks use similar representations and the central layer learns how features predict other features. E.g.
    * input person is of generation 3 and
    * relationship requires answer to be one generation up implies
    * output person is of generation 2
    
## Another way to see that it works

* Train net on all but 4 of the triples that can be made using the 12 relationships
    * needs to sweep through the training set many times adjusting the weights slightly each time
* Test it on the 4 held-out cases
    * 3/4 correct
    * good for 24-way choice
    * On much bigger datasets can train on a much smaller fraction of the data
    
## Large-scale e.g.

* For a database of millions of relational facts of the form (A R B)
    * Could train a net to discover feature vector representations of the terms that allow the third term to be predicted from the first 2
    * then could use the trained net to find very unlikely triples. Good candidates for errors in the database
* Instead of predicting the third term, could use all 3 terms as input and predict the probability that the fact is correct
    * train net need good source of false facts
    

# A brief diversion into cognitive science

## The family trees e.g. tells us about these concepts

* There are been a long debate in cognitive science between two rival theories of what it means to have a concept:
    * Feature theory: Concept is a set of semantic features
        * Good for explaining similarities b/w concepts
        * Convenient: a concept is a vector of feature activities
    * Structuralist theory: The meaning of a concept lies in its relationships to other concepts
        * So conceptual knowledge is best expressed as a relational graph
        * Minsky used the limitations of perceptrons as evidence against feature vectors and in favor of relational graph representations
        
## Both sides are wrong

* These 2 theories need not be rivals. A NN can use vectors of semantic features to implement a relational graph
    * In the NN that learns family trees, no **explicit** inference is required to arrive at the intuitively obvious consequences of the facts that have been explicitly learned
    * The net can "intuit" the answer in a forward pass
* We may use explicit rules for conscious, deliberate reasoning, but do a lot of commonsense, analogical reasoning by just "seeing" the answer w/ no conscious intervening steps
    * Even when using explicit rules, need to just see which rules to apply
    
## Localist and distributed representations of concepts

* Obvious way to implement a relational graph in a NN is to treat a neuron as a node in the graph and a connection as a binary relationship. But this "localist" method will not work:
    * need many different types of relationship and the connections in a NN do not have discrete labels
    * need ternary relationships as well as binary ones. (e.g. A is b/w B and C)
* Right way to implement relational knowledge in a NN is still an open issue
    * But many neurons are probably used for each concept and each neuron is probably involved in many concepts. This is called a "distributed representation"

# Another diversion: Softmax output func.

## Problems w/ squared error

* Drawbacks for squared error measure
    * If desired output is 1 and the actual is 0.00000001, there is almost no gradient for a logistic unit to fix up the error
    * If trying to assign probabilities to mutually exclusive class labels, know that the outputs should sum to 1, but depriving the network of this knowledge
* Different cost func that works better:
    * Force the outputs to represent a probability distribution across discrete alternatives
    
## Softmax

$$y_i=\frac{e^{z_i}}{\sum_{j\in group}e^{z_j}}$$

$$\frac{\partial y_i}{\partial z_i}=y_i(1-y_i)$$

* The output units in a softmax group use a non-local non-linearity:
    * ![img44](imgs/img44.jpg)
    
## Cross-entropy: the right cost func. to use w/ softmax

* The right cost func. is the negative log probability of the right answer
    * $$C = -\sum_j t_jlogy_j$$
        * $t_j$ is the target value
* C has a very big gradient when the target val. is 1 and the output is almost zero
    * $$\frac{\partial C}{\partial z_i} = \sum_j\frac{\partial C}{\partial y_j}\frac{\partial y_j}{\partial z_i}=y_i-t_i$$
    * A val. of 0.000001 is much better than 0.000000001
    * The steepness of dC/dy exactly balances the flatness of dy/dz

# Neuro-probabilistic language models

## A basic problem in speech recognition

* Cannot identify phonemes perfectly in noisy speech
    * The acoustic input is often ambiguous: there are several different words that fit the acoustic signal equally well
* People use their understanding of the meaning of the utterance to hear the right words
    * do this unconsciously when wreck a nice beach
    * very good at it
* this means speech recognizers have to know which words are likely to come next and wihch are not
    * Fortunately, words can be predicated quite well w/o full understanding
    
## The standard "trigram" method

* Take huge amount of text and count the frequencies of all triples of words
* Use these frequencies to make bets on the relative probabilities of words given the previous two words:
    * $$\frac{p(w_3=c|w_2=b,w_1=a)}{p(w_3=d|w_2=b,w_1=a)}=\frac{count(abc)}{count(abd)}$$
* Until very recently this was the state-of-the-art
    * Cannot use a much bigger context b/c there are too many possibilities to store and the counts would mostly be zero
    * Have to "back-off" to digrams when the count for a trigram is too small
        * The probability is not zero just b/c the count is zero
        
## Info. that the trigram model fails to use

* Suppose seen the sentence
    * "the cat got squashed in the garden on friday"
* This should help predict words in the sentence
    * "the dog got flattened in the yard on monday"
* A trigam model does not understand the similarities b/w:
    * cat/dot
    * squashed/flattened
    * garden/yard
    * friday/monday
* To overcome this limitation, need to use the semantic and syntactic features of previous words to predict the features of the next word
    * Using a feature representation also allows a context that contains many more previous words
    
## Bengio's NN for predicting the next word

* ![img45](imgs/img45.jpg)

## Problem w/ having 100,000 output words

* Each unit in the last hidden layer has 100,000 outgoing weights
    * So cannot afford to have many hidden units
        * Unless have a huge # of training cases
    * Could make the last hidden layer small, but then its hard to get the 100,000 probabilities right
        * The small probabilities are often relevent
* Best way to deal w/ such a large # of outputs?

# Ways to deal w/ large # of possible outputs in neuro-probabilistic language models

## A serial architecture

* ![img46](imgs/img46.jpg)

## Learning in the serial architecture

* After computing the logit score for each candidate word, use all of the logits in a softmax to get word probabilities
* The difference b/w the word probabilities and their target probabilities gives cross-entropy error derivatives
    * The derivatives try to raise the score of the correct candidate and lower the scores of its high-scoring rivals
* Can save a lot of time if only use a small set of candidates suggested by some other kind of predictor
    * e.g., could use the NN to revise the probabilities of the words that the trigram model thinks are likely

## Learning to predict the next word by predicting a path through a tree

* Arrange all the words in a binary tree w/ words as the leaves
* Use the previous context to generate a "prediction vector" v.
    * Compare v w/ a learned vector, u,  at each node of the tree
    * Apply the logistic func. to the scalar product of u and v to predict the probabilities of taking the two branches of the tree
    * ![img47](imgs/img47.jpg)

## Picture of the learning

* ![img48](imgs/img48.jpg)

## Convenient decomposition

* Maximizing the log probability of picking the target word is equivalent to maximizing the sum of the log probabilities of taking all the branches on the path that leads to the target word
    * During learning, only need to consider the nodes on the correct path. This is an exponential win: <font color='orange'> log(N) instead of N </font>
    * For each of these nodes, know the correct branch and know the current probability of taking it so can get derivatives for learning both the prediction vector v and that node vector u
* Unfortunately, still slow at test time.

## A simpler way to learn feature vectors for words

* ![img49](imgs/img49.jpg)

## Displaying the learned feature vectors in a 2-D map

* Can get an idea of the quality of the learned feature vectors by displaying them in a 2-D map
    * Display very similar vectors very close to each other
    * Use a multi-scale method called "t-sne" that also displays similar clusters near each other
* The learned feature vectors capture lots of subtle semantic distinctions, just by looking at strings of words.
    * No extra supervision is required
    * The info. is all in the contexts that the word is used in
    * Consider "She scrammed him w/ the frying pan"

## Part of a 2-D map of the 2500 most common words

* ![img50](imgs/img50.jpg)
* ![img51](imgs/img51.jpg)
* ![img52](imgs/img52.jpg)