-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO
334 lines (288 loc) · 16.2 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
Do a speed up of the parser using the tricks of pruning and refactoring.
Write a unit test of the tag_rhmm on a specific model parameter etc.
DEBUG:root:The validation error of different batches and learning rates was the following: {
(1, 0.04): 0.8057851239669421,
(5, 0.2): 0.7913223140495868,
(3, 0.04): 0.6487603305785123,
(1, 0.008): 0.6053719008264463,
(5, 0.04): 0.6033057851239669,
(10, 0.04): 0.5557851239669421,
(3, 0.008): 0.5289256198347108,
(5, 0.008): 0.4462809917355372,
(1, 0.0016): 0.4462809917355372,
(10, 0.2): 0.3450413223140496,
(3, 0.00032): 0.23553719008264462,
(5, 0.0016): 0.23553719008264462,
(3, 0.0016): 0.23553719008264462,
(1, 0.2): 0.0,
(10, 0.00032): 0.21487603305785125,
(5, 0.00032): 0.22107438016528927,
(1, 6.4e-05): 0.22107438016528927,
(3, 6.4e-05): 0.20867768595041322,
(10, 0.008): 0.23553719008264462,
(10, 0.0016): 0.23553719008264462,
(3, 0.2): 0.03925619834710744,
(1, 0.00032): 0.23553719008264462,
(5, 6.4e-05): 0.19421487603305784,
(10, 6.4e-05): 0.15289256198347106}
There are 8 GPU machines with two GPU each, qlogin -l gpu=1, nvidia-smi, b0* and g0*
training_method is decides the update equation. It is either
[add_projected, add_naturalparam, mult_exponentiated, mult_prod]
add_projected | additive sgd with probabilities and
| then reprojects them back to probability land
add_naturalparam | additive sgd on a multinomial expressed as
| the natural parameters
mult_exponentiated | multiplicative exponentiated sgd with
| probabilities
mult_prod | multiplicative prod SGD.
python src/predict_tag.py res/train_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip res/ log/predict_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip
python src/postag_accuracy.py log/predict_tag_rhmm~addlambda0.1~LL~sgd~0.5~l1~0.01~toy.tag.vocab~toy.word.vocab~toy_sup~toy_unsup~toy.dev~toy_embedding@toy.dev.tagstrip res/
# OPTIMIZATION = em, sgd, naturalgrad #LBFGS, EM, SGD, Natural Gradient
def train_on_file(mo_model, train_fn, num_pass, eta, dev_fn, filetype):
for pass_ in xrange(num_pass):
for row in open(train_fn, "rb"):
row=row.strip().split()
if filetype=="sup":
sentence, tag = get_sentence_tag_from_row(row)
mo_model.update_parameter(mo_model.gradient_sto(sentence, tag)*eta)
elif filetype=="unsup":
mo_model.update_parameter(mo_model.gradient_so(row)*eta)
else:
raise ValueError(filetype)
eval_on_train_dev("train_on_file, Pass: %d"%pass_, mo_model, train_fn, dev_fn)
return
def train_on_twofile(mo_model, sup_train_fn, unsup_train_fn, num_pass, eta, sup_dev_fn):
for pass_ in xrange(num_pass):
for i, row in enumerate(itertools.izip_longest(open(sup_train_fn, "rb"), open(unsup_train_fn, "rb"))):
if row is None:
continue
elif i%2==0:
row=row.strip().split()
sentence, tag = get_sentence_tag_from_row(row)
mo_model.update_parameter(mo_model.gradient_sto(sentence, tag)*eta)
elif i%2==1:
row=row.strip().split()
mo_model.update_parameter(mo_model.gradient_so(row)*eta)
eval_on_train_dev("train_on_twofile, Pass: %d"%pass_, mo_model, sup_train_fn, sup_dev_fn)
return
train_on_file(mo_model, sup_train_fn, num_pass, eta, sup_dev_fn, "sup")
train_on_file(mo_model, unsup_train_fn, num_pass, eta, sup_dev_fn, "unsup")
mo_model.reinitialize()
train_on_file(mo_model, unsup_train_fn, num_pass, eta, sup_dev_fn, "unsup")
train_on_file(mo_model, sup_train_fn, num_pass, eta, sup_dev_fn, "sup")
mo_model.reinitialize()
train_on_twofile(mo_model, sup_train_fn, unsup_train_fn, num_pass, eta, sup_dev_fn)
# projected gradient is fastest to move currently small
# probabilities, when the objective calls for it, while GD in
# logspace is slowest to move them. EG is in between. Because
# Projected gradient does φ += ε ∂F/∂φ (followed by additive
# renormalization). EG scales the update size by a factor of
# φ, since for small ε, the EG update φ *= exp ε ∂F/∂φ is
# close to φ += ε φ ∂F/∂φ (followed by multiplicative
# renormalization). GD in logspace adds another factor of φ
# (after shifting by E).
# EG can actually be viewed as a projected subgradient method
# using generalized relative entropy (D(x || y) = \sum_i x_i
# log (x_i/y_i) - x_i + y_i ) as the distance function for
# projections (Beck & Teboulle, 2003)
# So most of the non-convex functions I've optimized have been done
# with LBFGS optimization with one crucial change: any time the
# optimizer thinks its converged, dump the history cache and force it
# to flush the current approximation of the inverse hessian and take
# just a normal gradient step. Most of the Berkeley NLP papers since
# 2006 which do LBFGS non-convex optimization have used this trick and
# found it pretty important I believe.
# Pylearn2 represents the while loop itself as a class with three
# things.
# The algorithm, the model, save_path, save_freq, "extensions".
# I think these are the most important.
# The training algorithms are,
# What is interesting to learn is how the problem has been parametized
# and the classes defined quite beautifully. The algorithms have a
# base class, (which specifies only three function continue_training,
# setup, train) and then the default. Then there are learning_rules
# for sgd, The main sgd algorithm is
# pylearn2.training_algorithms.sgd.SGD
# but there is also batch gradient descent.
# The objective is written as a Theano function. The only problem is
# that here I would have a faily large number of parameters. So a
# simple problem could be optimizing a max-ent lm using theano. If I
# can do that using theano. (and pylearn2) then I can do this also.
# For this part I found the following code which I can learn from
# https://github.com/ddahlmeier/neural_lm/blob/master/lbl.py
# https://github.com/ddahlmeier/neural_lm/blob/master/lbl_nce.py
# https://github.com/turian/neural-language-model
# https://github.com/gwtaylor/theano-rnn
# My model has a objective. and during training since I want to
# leverage both supervised and unsupervised data therefore I need to
# write 3 kinds of scores
# 1. score_sto: sentence, tag observed but some of the tags can be
# missing. this is the most general case.
# 2. score_ao:
# 3. score_so:
# (It calculates this using the observed data, and parameters)
# Theano helps me calculate gradients automatically.
# As long as I can write the score (calculated using DP or not) as a
# theano function. It is possible by using scan in theano.
# I need two types of gradients. The gradient I calculate given a
# supervised data point and the gradient with unsuperrvised datapoint.
# My model has a method to give gradient (wrt parameter)
# (Given observed data only, and parameters) # Calculate the gradients using Theano
# Train the model using pylearn2/scipy.optimize
# pylearn2 allows me to quickly switch between different optimization
# methods. along with tricks of the trade that are a part of
# optimization business.
# Perform prediction using whatever framework.
# My model can predict hidden var. (Given observed var, parameters)
# The prediction method would be used in calculating the objective/score
# anyway. This would be necessary because the objective would be an expectation.
# So the parameters are an intrinsic part of the model.
# Probability of tag given the previous sequence of words.
# And we are assuming that the parametrization is the appropriate
# parametrization to use. The important thing is that we can
# "efficiently" predict. so we need to maintain that. Infact code for
# that.
# So generate toy data. and call make on that target all the time.
# Train the model and then use it as a LM, (calculate the probability
# efficiently) and get the perpelexity, also make an eval method that
# gives a tagged sequence, and write code to evaluate it.
# TODO
# 1.[X] Make a function that can serialize and deerialize a model and load parameters from file.
# 2.[X] Implement the training and prediction part
# 3.[ ] Test that the score function is correct.
# 4.[ ] Discuss how we can update, locally normalized models through SGD ?
# I mean the problem is that let's say we have a 0 order HMM. It is locally normalize, However when we stream through the data there is no guarantee that taking a gradient step to update the probabilities would still keep the model locally normalized. so should we normalize at each step ?
# Also sampling from log transformed probabilities.
# Show that
# self = tag_order0hmm("lbl10", "LL", "L1", 0.01, 0.5, False, None, dict(t0=0, t1=1, t2=2), dict(w0=0, w1=1, w2=2))
# self.score_ao([1], [1])
# self.score_so([1], [-1])
# self.gradient_ao([1], [1])
# self.gradient_so([1], [-1])
# According to Apps Hungarian
# mo = model(object)
# it = iterator
# cl = class
# Perform training and testing using the small test data on 3 different types of models.
# all: # Train_train > Predict_train > Evaluate_train > Tune.parameters + Profile/debug.code (Loop) > Train_train > Predict_dev
# theano.config.compute_test_value = 'warn'
# x = theano.tensor.dvector('x')
# f = theano.function([x], x * 5)
# x_printed = theano.printing.Print('this is a very important value')(x)
# f_with_print = theano.function([x], x_printed * 5)
# pp, debugprint, printing.pydotprint
# Every Apply node will be printed out, along with its position in the graph
# the arguments to the functions perform or c_code and the output it computed
# you can do something like
# numpy.isnan(output[0]).any()
# def inspect_inputs(i, node, fn):
# print i, node, "input(s) value(s):", [input[0] for input in fn.inputs],
# def inspect_outputs(i, node, fn):
# print "output(s) value(s):", [output[0] for output in fn.outputs]
# mode=theano.compile.MonitorMode(
# pre_func=inspect_inputs,
# post_func=inspect_outputs)
take many steps along the supervised and then interleave between supervised and unsupervised.
Run a bigram model and see the differences I get ff
Add the idea to th google doc that we can use the tag vector with atensor (tag vector may be one hot) and thenuse those to predict the tags.
also the bigram i s
Implement two taggers. One is a zero order word based probability.
Then build an accuracy evaluator.
Also add tuning support, proper training support, and then dev.
Another is an HMM based tagger. Use Theano for both. do it so that you
get comfortable with the api etc.
WRITE A MAXENT LM IN THEANO.
We present enriched generative models for tagged sentences and
dependency-parsed sentences. There's been a lot of discussion in the
literature of CRFs and MEMMs. These are discriminative models, which
are rich and robust to model misspecification but also cannot be
trained on unsupervised examples. Standard generative alternatives
such as HMMs are less rich. We show how generative models can be
enriched and still be efficient for use with dynamic programming
(i.e., we do not go all the way to a globally normalized joint model
although we discuss the possibility). In particular, we can condition
on all the words of the input sentence just as for CRFs, but our
generative approach allows semi-supervised training.
This class of models has not been entirely missed before, as
generative models with latent tag and parse variables have been
previously investigated in the language modeling community, where the
goal is to predict words. However, they have apparently [check!] not
been used to predict the tags and parses, a setting in which (in
contrast to language modeling) one can do efficient decoding with an
unbounded amount of sentence context. Also, our dependency syntax
model differs from the SLM and related models in the LM literature --
it is top-down generative rather than history-based, and requires a
novel parsing algorithm.
1. [ ] Change the size of tagset in unsupervised tagging.
2. [ ] compare diff model topologies as LM, as Taggers
3. [ ] Train/eval in the same way (regularization, jackknifing, sup-unsup)
4. [ ] /joint prob of tagged words, tagging, LM, using sup data and
unsup data
5. [ ] /cite the paper that was using HMM
6. [ ] The main clam is that our work gives richer models exist for
tagging/parsing and they can work better for tagging on small datasets.
7. [ ] Cite multi conditional learning.
8. [ ] Read the paper disitrbuted, lexical semantic, syntactic LM by
shaojun wang, for backoff strategies.
9. [ ] The tree parse version
10. [ ] Cite multi conditional learnng by maccallum
11. [ ] Vary window length
12. [ ] We will need to have unbounded istory of at least for tagging
to make our model sufficiently different from the one in the paper
by shaoun wang
13. [ ] Ask TIM for features for CRFs that model unbounded contexts
14. [ ] Look at multi floor chinese restaurant process/franchise
15. [ ] Non parametric intepolation -> Seuqnce memoizer
16. [ ] Back off smoothing is justified as a small approximation to
the hierarchical bayesian stuff.
17. [ ] Hierarchical bayesian frank wood, (multi floor chinese restaurant process)
http://www.stats.ox.ac.uk/~teh/research/compling/WooGasArc2011a.pdf
http://ilk.uvt.nl/~stehouwer/files/ICGI2010.pdf
http://www.di.ens.fr/~fbach/anil_emnlp.pdf
The Data Sheet
I still have to use this model for tagging and then to get an accuracy.
I also have to get the gradients of either the likelihood or the objective.
I have not added a tree like hierarchy for fast enumeration of the
partitions function nor have I implemented NCE yet.
Also currently I am treating the embeddings as fixed not as parameters
to be learnt ? Should I learn them ?
Dependency Shift reduce (generative vs discriminative)
then the NCE and how does it fare against importance sampling
This leads to a thought for us: Train a hierarchy of
stepwise-generation models where each model generates the sentence in
successively larger steps.
1. One bit at a time (using a clustering of words).
2. One word or tag at a time.
Write paper.
Write pseudo code of project - in terms of building blocks like earley
parser, inside outside and forward backward.
Implement CRF / MEMM / HMM
NEXT STEPS: 1a and 1b in parallel, 1c less important, 2a depends on 1a and 1b
1a. Figure out what is meant by tree structured sparsity and
Nellakanti's work. Specifically How does it apply to our project for
selecting the model ? How can sparsity help set the order of the HMM
dynamically ? This would affect how I build the later models
1b. we decided that I should build at least the following pipeline
(since parts this pipeline would be used for many other model
topologies)
1. Use the model to tag words
2. Compute accuracy of the model
3. Use Chokkan crfsuite as the baseline CRF tagger.
4. Clean up the test suite
5. Build GMEMM likelihood as well
6. Use Auto-Diff to calculate the gradients.
7. Learn parameters of the model
1c. Look at different ways of formulating the RNN structure, (Note that word2vec uses a simpler model, It is flat and uses previous 15 words and learns a single matrix for all of them.)
2a. Compare Different model topologies ( HMM, Ngram, HNMM) (NGram, MEMM, GMEMM)
a. as LM
b. as tagger
c. as unsupervised tagger (This is not a numeric comparison, we want to understand what the model topology learns)
- Train/Eval all in the same way
a. Regularization
b. Jackknifing
c . Sup-Unsup LL mix
d. Different sup/unsup training set sizes
e. Different sizes of target tagset in unsupervised tagging case.
- Parametrization
Log Linear (The features are the words/tags being predcited, n_w previous words, n_t previous generated tags)
Prune if frequency not enough