Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combining data_pipeline and simple_example #7

Closed
hedgy123 opened this issue Apr 20, 2017 · 4 comments
Closed

Combining data_pipeline and simple_example #7

hedgy123 opened this issue Apr 20, 2017 · 4 comments

Comments

@hedgy123
Copy link

Hi Egil,

Thank you so much for making your code available! This is really great stuff.

So in trying to understand better how it all works I tried using the tensorflow.log-extracted data (as in your data_pipeline notebook) as inputs to the network (same config as in your simple_example). Unfortunately I got all nan's as losses:

Model summary:

 init_alpha:  -785.866918162
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
gru_1 (GRU)                  (None, 101, 1)            18        
_________________________________________________________________
dense_1 (Dense)              (None, 101, 2)            4         
_________________________________________________________________
activation_1 (Activation)    (None, 101, 2)            0         
=================================================================
Total params: 22.0
Trainable params: 22.0
Non-trainable params: 0.0  

Results of running model.fit:

Train on 72 samples, validate on 24 samples
Epoch 1/75
2s - loss: nan - val_loss: nan
....

I was wondering if you've tried doing the same experiment and if so, whether it worked for you? Thanks so much!

@ragulpr
Copy link
Owner

ragulpr commented Apr 21, 2017

Hi there,
Thanks for reaching out! I need to be clearer about this, I haven't had time to join together the two scripts yet. I'll get back to you ASAP with an updated answer but for now:
init_alpha: -785.866918162 is an error. (<0)

Note that for big magnitutes of alpha mean of tte is same as the complex estimate using log etc

Furthermore

  • Initialization is important. Gradients explode if you're too far off. More censored data leads to higher probability of exploding grad initially.
  • Learning rate is dependent on data and can be in magnitudes you didn't expect
  • Are you feeding in masked steps? Varying length sequences has no clean implementation atm, haven't had time to get masking layer to work. Current solution: set n_timesteps = None and run training step with one input sequence with something like:

OBS NOT TESTED:

def epoch():
    for i in xrange(n_samples):
        model.fit(x_train[i,:seq_length[i],:], y_train[i,:seq_length[i],:],
                  epochs=1,
                  batch_size=1,
                  verbose=2
                  )

But even better debug-mode initially is to simply transform the data to [n_non_masked_samples,1,n_features] (feed in only seen timesteps) to a simple ANN and when that works test the RNN.

Would love to see forks!

@ragulpr
Copy link
Owner

ragulpr commented Apr 24, 2017

There's multiple reasons for NANs to show up but just found a very important:

shift_discrete_padded_features is currently broken which is supposed to hide target but apparently doesn't. This means that if input is "event" then it's possible to make a perfect prediction, causing exploding gradient

I'm trying to fix it asap

@NataliaVConnolly
Copy link

Hi Egil,

Thanks for the update! Here's a fork with the notebook Combined_data_pipeline_and_analysis in examples/keras.

  https://github.com/NataliaVConnolly/wtte-rnn-1

The last cell shows an example of training with just one input sequence. It does result in a non-NaN loss, although a very large one (but I didn't optimize the initial alpha or the network config much).

Cheers,
Natalia (aka hedgy123 :))

@ragulpr
Copy link
Owner

ragulpr commented May 6, 2017

@NataliaVConnolly Sorry for the wait. It took me some time to figure out what was wrong!

  • Too much censoring leads to instability. Works when using more frequent committers, <50% censoring. In the example I use only those who committed at least 10 days.
  • You train on one subject but initialize alpha using the mean over all subjects. This leads to high probability of exploding gradient.
  • As mentioned above, if it was done before the fix of shift_discrete_padded_features that would also lead to NaN (perfect fit) after some training.

Check out the new data_pipeline and let me know if you have more questions! :)

@ragulpr ragulpr closed this as completed May 6, 2017
ragulpr referenced this issue May 6, 2017
- Check it out. 
- Poor performance atm 
- TODO will add masking/batchsize>1 support soon.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants