TODO

Do a speed up of the parser using the tricks of pruning and refactoring.
Write a unit test of the tag_rhmm on a specific model parameter etc.

DEBUG:root:The validation error of different batches and learning rates was the following: {
(1, 0.04): 0.8057851239669421,
(5, 0.2): 0.7913223140495868,
(3, 0.04): 0.6487603305785123,
(1, 0.008): 0.6053719008264463,
(5, 0.04): 0.6033057851239669,
(10, 0.04): 0.5557851239669421,
(3, 0.008): 0.5289256198347108,
(5, 0.008): 0.4462809917355372,
(1, 0.0016): 0.4462809917355372,
(10, 0.2): 0.3450413223140496,
(3, 0.00032): 0.23553719008264462,
(5, 0.0016): 0.23553719008264462,
(3, 0.0016): 0.23553719008264462,
(1, 0.2): 0.0,
(10, 0.00032): 0.21487603305785125,
(5, 0.00032): 0.22107438016528927,
(1, 6.4e-05): 0.22107438016528927,
(3, 6.4e-05): 0.20867768595041322,
(10, 0.008): 0.23553719008264462,
(10, 0.0016): 0.23553719008264462,
(3, 0.2): 0.03925619834710744,
(1, 0.00032): 0.23553719008264462,
(5, 6.4e-05): 0.19421487603305784,
(10, 6.4e-05): 0.15289256198347106}

There are 8 GPU machines with two GPU each, qlogin -l gpu=1, nvidia-smi, b0* and g0* 

training_method is decides the update equation. It is either
        [add_projected, add_naturalparam, mult_exponentiated, mult_prod]
        add_projected      | additive sgd with probabilities and
                           | then reprojects them back to probability land
        add_naturalparam   | additive sgd on a multinomial expressed as
                           | the natural parameters
        mult_exponentiated | multiplicative exponentiated sgd with
                           | probabilities
        mult_prod          | multiplicative prod SGD.
	
python src/predict_tag.py  res/train_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip res/ log/predict_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip

python src/postag_accuracy.py log/predict_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip res/
# OPTIMIZATION        = em, sgd, naturalgrad #LBFGS, EM, SGD, Natural Gradient
def train_on_file(mo_model, train_fn, num_pass, eta, dev_fn, filetype):
    for pass_ in xrange(num_pass):
        for row in open(train_fn, "rb"):
            row=row.strip().split()
            if filetype=="sup":
                sentence, tag = get_sentence_tag_from_row(row)
                mo_model.update_parameter(mo_model.gradient_sto(sentence, tag)*eta)
            elif filetype=="unsup":
                mo_model.update_parameter(mo_model.gradient_so(row)*eta)
            else:
                raise ValueError(filetype)
        eval_on_train_dev("train_on_file, Pass: %d"%pass_, mo_model, train_fn, dev_fn)
    return

def train_on_twofile(mo_model, sup_train_fn, unsup_train_fn, num_pass, eta, sup_dev_fn):
    for pass_ in xrange(num_pass):
        for i, row in enumerate(itertools.izip_longest(open(sup_train_fn, "rb"), open(unsup_train_fn, "rb"))):
            if row is None:
                continue
            elif i%2==0:
                row=row.strip().split()
                sentence, tag = get_sentence_tag_from_row(row)
                mo_model.update_parameter(mo_model.gradient_sto(sentence, tag)*eta)
            elif i%2==1:
                row=row.strip().split()
                mo_model.update_parameter(mo_model.gradient_so(row)*eta)
        eval_on_train_dev("train_on_twofile, Pass: %d"%pass_, mo_model, sup_train_fn, sup_dev_fn)
    return
train_on_file(mo_model, sup_train_fn, num_pass, eta, sup_dev_fn, "sup")
train_on_file(mo_model, unsup_train_fn, num_pass, eta, sup_dev_fn, "unsup")

mo_model.reinitialize()
train_on_file(mo_model, unsup_train_fn, num_pass, eta, sup_dev_fn, "unsup")
train_on_file(mo_model, sup_train_fn, num_pass, eta, sup_dev_fn, "sup")

mo_model.reinitialize()
train_on_twofile(mo_model, sup_train_fn, unsup_train_fn, num_pass, eta, sup_dev_fn)

# projected gradient is fastest to move currently small
        # probabilities, when the objective calls for it, while GD in
        # logspace is slowest to move them.  EG is in between. Because
        # Projected gradient does φ += ε ∂F/∂φ (followed by additive
        # renormalization).  EG scales the update size by a factor of
        # φ, since for small ε, the EG update φ *= exp ε ∂F/∂φ is
        # close to φ += ε φ ∂F/∂φ (followed by multiplicative
        # renormalization).  GD in logspace adds another factor of φ
        # (after shifting by E).
        # EG can actually be viewed as a projected subgradient method
        # using generalized relative entropy (D(x || y) = \sum_i x_i
        # log (x_i/y_i) - x_i + y_i ) as the distance function for
        # projections (Beck & Teboulle, 2003)

# So most of the non-convex functions I've optimized have been done
# with LBFGS optimization with one crucial change: any time the
# optimizer thinks its converged, dump the history cache and force it
# to flush the current approximation of the inverse hessian and take
# just a normal gradient step. Most of the Berkeley NLP papers since
# 2006 which do LBFGS non-convex optimization have used this trick and
# found it pretty important I believe. 


# Pylearn2 represents the while loop itself as a class with three
# things.
# The algorithm, the model, save_path, save_freq, "extensions".
# I think these are the most important.
# The training algorithms are,
# What is interesting to learn is how the problem has been parametized
# and the classes defined quite beautifully. The algorithms have a
# base class, (which specifies only three function continue_training,
# setup, train) and then the default. Then there are learning_rules
# for sgd, The main sgd algorithm is
# pylearn2.training_algorithms.sgd.SGD
# but there is also batch gradient descent.

# The objective is written as a Theano function. The only problem is
# that here I would have a faily large number of parameters. So a
# simple problem could be optimizing a max-ent lm using theano. If I
# can do that using theano. (and pylearn2) then I can do this also.
# For this part I found the following code which I can learn from
# https://github.com/ddahlmeier/neural_lm/blob/master/lbl.py
# https://github.com/ddahlmeier/neural_lm/blob/master/lbl_nce.py
# https://github.com/turian/neural-language-model
# https://github.com/gwtaylor/theano-rnn

# My model has a objective. and during training since I want to
# leverage both supervised and unsupervised data therefore I need to
# write 3 kinds of scores
# 1. score_sto: sentence, tag observed but some of the tags can be
# missing. this is the most general case.
# 2. score_ao: 
# 3. score_so: 
# (It calculates this using the observed data, and parameters)

# Theano helps me calculate gradients automatically.
# As long as I can write the score (calculated using DP or not) as a
# theano function. It is possible by using scan in theano.
# I need two types of gradients. The gradient I calculate given a
# supervised data point and the gradient with unsuperrvised datapoint.

# My model has a method to give gradient (wrt parameter)
# (Given observed data only, and parameters) # Calculate the gradients using Theano

# Train the model using pylearn2/scipy.optimize
# pylearn2 allows me to quickly switch between different optimization
# methods. along with tricks of the trade that are a part of
# optimization business.

# Perform prediction using whatever framework.
# My model can predict hidden var. (Given observed var, parameters)

# The prediction method would be used in calculating the objective/score
# anyway. This would be necessary because the objective would be an expectation.
# So the parameters are an intrinsic part of the model.

# Probability of tag given the previous sequence of words.
# And we are assuming that the parametrization is the appropriate
# parametrization to use. The important thing is that we can
# "efficiently" predict. so we need to maintain that. Infact code for
# that.

# So generate toy data. and call make on that target all the time.
# Train the model and then use it as a LM, (calculate the probability
# efficiently) and get the perpelexity, also make an eval method that
# gives a tagged sequence, and write code to evaluate it.


# TODO
# 1.[X] Make a function that can serialize and deerialize a model and load parameters from file.
# 2.[X] Implement the training and prediction part
# 3.[ ] Test that the score function is correct.
# 4.[ ] Discuss how we can update, locally normalized models through SGD ?
# I mean the problem is that let's say we have a 0 order HMM. It is locally normalize, However when we stream through the data there is no guarantee that taking a gradient step to update the probabilities would still keep the model locally normalized. so should we normalize at each step ?
# Also sampling from log transformed probabilities.

# Show that
# self = tag_order0hmm("lbl10", "LL", "L1", 0.01, 0.5, False, None, dict(t0=0, t1=1, t2=2), dict(w0=0, w1=1, w2=2))
# self.score_ao([1], [1])
# self.score_so([1], [-1])
# self.gradient_ao([1], [1])
# self.gradient_so([1], [-1])
# According to Apps Hungarian
# mo = model(object)
# it = iterator
# cl = class
# Perform training and testing using the small test data on 3 different types of models.
# all: # Train_train > Predict_train > Evaluate_train > Tune.parameters + Profile/debug.code (Loop)  > Train_train > Predict_dev


# theano.config.compute_test_value = 'warn'
# x = theano.tensor.dvector('x')
# f = theano.function([x], x * 5)
# x_printed = theano.printing.Print('this is a very important value')(x)
# f_with_print = theano.function([x], x_printed * 5)
# pp, debugprint, printing.pydotprint

# Every Apply node will be printed out, along with its position in the graph
# the arguments to the functions perform or c_code and the output it computed
# you can do something like
# numpy.isnan(output[0]).any()
# def inspect_inputs(i, node, fn):
#     print i, node, "input(s) value(s):", [input[0] for input in fn.inputs],
# def inspect_outputs(i, node, fn):
#     print "output(s) value(s):", [output[0] for output in fn.outputs]
# mode=theano.compile.MonitorMode(
#                         pre_func=inspect_inputs,
#                         post_func=inspect_outputs)


take many steps along the supervised and then interleave between supervised and unsupervised.

Run a bigram model and see the differences I get ff

Add the idea to th google doc that we can use the tag vector with atensor (tag vector may be one hot) and thenuse those to predict the tags.

also the bigram i s

Implement two taggers. One is a zero order word based probability.
Then build an accuracy evaluator.
Also add tuning support, proper training support, and then dev.
Another is an HMM based tagger. Use Theano for both. do it so that you
get comfortable with the api etc.
WRITE A MAXENT LM IN THEANO.

We present enriched generative models for tagged sentences and
dependency-parsed sentences.  There's been a lot of discussion in the
literature of CRFs and MEMMs.  These are discriminative models, which
are rich and robust to model misspecification but also cannot be
trained on unsupervised examples.  Standard generative alternatives
such as HMMs are less rich.  We show how generative models can be
enriched and still be efficient for use with dynamic programming
(i.e., we do not go all the way to a globally normalized joint model
although we discuss the possibility).  In particular, we can condition
on all the words of the input sentence just as for CRFs, but our
generative approach allows semi-supervised training.

This class of models has not been entirely missed before, as
generative models with latent tag and parse variables have been
previously investigated in the language modeling community, where the
goal is to predict words.  However, they have apparently [check!] not
been used to predict the tags and parses, a setting in which (in
contrast to language modeling) one can do efficient decoding with an
unbounded amount of sentence context.  Also, our dependency syntax
model differs from the SLM and related models in the LM literature --
it is top-down generative rather than history-based, and requires a
novel parsing algorithm.

1. [ ] Change the size of tagset in unsupervised tagging.
2. [ ] compare diff model topologies as LM, as Taggers
3. [ ] Train/eval in the same way (regularization, jackknifing, sup-unsup)
4. [ ] /joint prob of tagged words, tagging, LM, using sup data and
   unsup data
   5. [ ] /cite the paper that was using HMM
   6. [ ] The main clam is that our work gives richer models exist for
      tagging/parsing and they can work better for tagging on small datasets.
      7. [ ] Cite multi conditional learning.
      8. [ ] Read the paper disitrbuted, lexical semantic, syntactic LM by
         shaojun wang, for backoff strategies.
	 9. [ ] The tree parse version
	 10. [ ] Cite multi conditional learnng by maccallum
	 11. [ ] Vary window length
	 12. [ ] We will need to have unbounded istory of at least for tagging
	     to make our model sufficiently different from the one in the paper
	         by shaoun wang
		 13. [ ] Ask TIM for features for CRFs that model unbounded contexts
		 14. [ ] Look at multi floor chinese restaurant process/franchise
		 15. [ ] Non parametric intepolation -> Seuqnce memoizer
		 16. [ ] Back off smoothing is justified as a small approximation to
		     the hierarchical bayesian stuff.
		     17. [ ] Hierarchical bayesian frank wood, (multi floor chinese restaurant process)

		     http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
		     http://ilk.uvt.nl/~stehouwer/files/ICGI2010.pdf
		     http://www.di.ens.fr/~fbach/anil_emnlp.pdf

		     The Data Sheet
		     I still have to use this model for tagging and then to get an accuracy.
		     I also have to get the gradients of either the likelihood or the objective.
		     I have not added a tree like hierarchy for fast enumeration of the
		     partitions function nor have I implemented NCE yet.
		     Also currently I am treating the embeddings as fixed not as parameters
		     to be learnt ? Should I learn them ?

		     Dependency Shift reduce (generative vs discriminative)
		     then the NCE and how does it fare against importance sampling
		     This leads to a thought for us: Train a hierarchy of
		     stepwise-generation models where each model generates the sentence in
		     successively larger steps. 
		     1. One bit at a time (using a clustering of words).
		     2. One word or tag at a time.
		     Write paper.
		     Write pseudo code of project - in terms of building blocks like earley
		     parser, inside outside and forward backward.
		     Implement CRF / MEMM / HMM

		     NEXT STEPS: 1a and 1b in parallel, 1c less important, 2a depends on 1a and 1b
		     1a. Figure out what is meant by tree structured sparsity and
		     Nellakanti's work. Specifically How does it apply to our project for
		     selecting the model ? How can sparsity help set the order of the HMM
		     dynamically ? This would affect how I build the later models

		     1b. we decided that I should build at least the following pipeline
		     (since parts this pipeline would be used for many other model
		     topologies)
		     1. Use the model to tag words
		     2. Compute accuracy of the model
		     3. Use Chokkan crfsuite as the baseline CRF tagger.
		     4. Clean up the test suite
		     5. Build GMEMM likelihood as well
		     6. Use Auto-Diff to calculate the gradients.
		     7. Learn parameters of the model

		     1c. Look at different ways of formulating the RNN structure, (Note that word2vec uses a simpler model, It is flat and uses previous 15 words and learns a single matrix for all of them.)
		     2a. Compare Different model topologies ( HMM, Ngram, HNMM) (NGram, MEMM, GMEMM)
		     a. as LM
		     b. as tagger
		     c. as unsupervised tagger (This is not a numeric comparison, we want to understand what the model topology learns)

		     - Train/Eval all in the same way
		     a. Regularization
		     b. Jackknifing
		     c . Sup-Unsup LL mix
		     d. Different sup/unsup training set sizes
		     e. Different sizes of target tagset in unsupervised tagging case.

		     - Parametrization
		     Log Linear (The features are the words/tags being predcited, n_w previous words, n_t previous generated tags)
		     Prune if frequency not enough