Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XLNET Base for Malay and Indonesian languages (not an issue) #160

Closed
huseinzol05 opened this issue Jul 14, 2019 · 23 comments
Closed

XLNET Base for Malay and Indonesian languages (not an issue) #160

huseinzol05 opened this issue Jul 14, 2019 · 23 comments

Comments

@huseinzol05
Copy link

Hi! This is not an issue, I just want to say XLNET is really great and I successfully pretrained XLNET from scratch for Malay and Indonesian languages. You can read comparison and download pretrained from here, https://github.com/huseinzol05/Malaya/tree/master/xlnet

I am planning to release XLNET Large for these languages!

@stefan-it
Copy link

Thanks for sharing it (and the parameters you used for pre-training) 👍 Could you say something to the loss you achieved after 700K epochs?

@huseinzol05
Copy link
Author

My loss around 2.XX after 700k steps, never get into 1.XX during entire training sessions

@3NFBAGDU
Copy link

Hi, can you tell what hardware did you use for training and how long it took on base and small models.
Thank You.

@huseinzol05
Copy link
Author

Single Tesla V100 16GB vram, 700k steps took around 4 days for base, and 2 days for small models.

@3NFBAGDU
Copy link

Thank you for your response, and how many sentences do you have ?

@abhi060698
Copy link

Is there one for English as well?

@huseinzol05
Copy link
Author

@3NFBAGDU , around 600k plus, total size 1.21GB pure text. @abhi060698 I believe you can download from original repository?

@abhi060698
Copy link

@huseinzol05 From what I saw there's only the Large model. Do you have a link to the Base model?

@huseinzol05
Copy link
Author

Nopeeeee :(

@3NFBAGDU
Copy link

3NFBAGDU commented Jul 17, 2019

Hi, i have trained XLNET model 1 weak ago.
I had 1,600,000 sentences, my previous parameters was sudo python3 train_gpu.py --record_info_dir=forum+books/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=6 --d_model=764 --d_embed=764 --n_head=16 --d_head=64 --d_inner=2048 --untie_r=True --model_dir=forum+books_model --uncased=False --num_predict=85 --mask_alpha=6 --mask_beta=1 --train_steps=700000 --iterations=500, and it had a bad results.

Now i want to train base model with 3,000,000 sentences, and i want to change my train_gpu parameters foloowing: sudo python3 train_gpu.py --record_info_dir=forum+books/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=12 --d_model=512 --d_embed=512 --n_head=16 --d_head=64 --d_inner=2048 --untie_r=True --model_dir=forum+books_model --uncased=False --num_predict=85 --mask_alpha=6 --mask_beta=1 --num_passes=20 --train_steps=700000 --iterations=10 --learning_rate=2.5e-5 .

I think in second example number of iterations is too few, what do you think what parameter should i choose to make a base model?

I want to make a sentence embeddings, i have used sent2vec for this and had a good results, but i want to improve it with XLNET.

@huseinzol05
Copy link
Author

You can increase train_steps if you are not satisfied with 700k. If your loss lower than 3.0, that is more than enough.

@3NFBAGDU
Copy link

Hello, Do you know how can i get sentence-embedding vector for some sentence.
for example: i want to get "Hello, how are you" sentence-embedding vector.

@huseinzol05
Copy link
Author

huseinzol05 commented Jul 18, 2019

Actually if you read from run_classifier.py, you can get the answer,

X = tf.placeholder(tf.int32, [None, None])
segment_ids = tf.placeholder(tf.int32, [None, None])
input_masks = tf.placeholder(tf.float32, [None, None])

xlnet_model = xlnet.XLNetModel(
            xlnet_config=xlnet_config,
            run_config=xlnet_parameters,
            input_ids=tf.transpose(self.X, [1, 0]),
            seg_ids=tf.transpose(self.segment_ids, [1, 0]),
            input_mask=tf.transpose(self.input_masks, [1, 0]))
        
summary = xlnet_model.get_pooled_out("last", True) # your vector for 'Hello, how are you'.

@luv4me
Copy link

luv4me commented Jul 22, 2019

@ @huseinzol05 hi I have a question , Everytime I run it , I get the different vector for the same sentence . It drives me crazy

@vanh17
Copy link

vanh17 commented Jul 22, 2019

@huseinzol05, when you pretrained your model, did you see error like this #168. Thank you

@huseinzol05
Copy link
Author

@luv4me , lol, obviously, we have finite space, our neural network will cut off some floating points. So, give its break.

@huseinzol05
Copy link
Author

@vanh17 , sounds like your data is an empty array. Do you checked your data is not empty?

@vanh17
Copy link

vanh17 commented Jul 22, 2019

@huseinzol05 how could we really check for it? I ran the data_utils.py on the txt file where each sentence is on one line and there is an empty line at the end of each document before the next document is being inserted.
I checked the train-0-0.bsz-8.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-100.tfrecords, looks like it was filled with bunch of human-not-readable characters so I do not know how it could be empty.

@OmriPi
Copy link

OmriPi commented Jul 23, 2019

@huseinzol05 I have the exact same problem as @luv4me , each time I'm getting a drastically different vector for the same sentence, the values are not even remotely similar. Is it really the fault of floating points? It seems to me like something is either inconsistent there or the model is training (even if I set is_training=False).

EDIT: OK, I found out that I can get the same vector consistently if I set the random seed of TensorFlow with: tf.compat.v1.random.set_random_seed(42)
This is quite strange, where is randomization used in inference and why? Usually the output of predicting should be consistent afaik

@huseinzol05
Copy link
Author

You should put is_training equal to False, or else the dropout layer randomly applied zaro masking.

@huseinzol05
Copy link
Author

Even after you put is_training is False the different is quite big?

@OmriPi
Copy link

OmriPi commented Jul 24, 2019

@huseinzol05 yes, I set is_training and is_finetune to False both. I am still getting random outputs.
Here is my code:

    FLAGS(sys.argv)
    sp_model.Load(FLAGS.spiece_model_file)
    xlnet_config = xlnet.xlnet.XLNetConfig(json_path=FLAGS.model_config_path)
    run_config = xlnet.xlnet.create_run_config(is_training=False, is_finetune=False, FLAGS=FLAGS)
.
.
.
    with tf.Session() as sess:
        xlnet_model = xlnet.xlnet.XLNetModel(xlnet_config=xlnet_config, run_config=run_config,
                                             input_ids=np.expand_dims(sentence_features.input_ids, 1).astype('int32'),
                                             seg_ids=np.expand_dims(sentence_features.segment_ids, 1).astype('int32'),
                                             input_mask=np.expand_dims(sentence_features.input_mask, 1).astype('float32'))
        init_from_checkpoint(FLAGS, True)
        summary = xlnet_model.get_pooled_out(summary_type="last")
        sess.run(tf.global_variables_initializer())
        seq_out = xlnet_model.get_sequence_output()
        nps = summary.eval()
        print(np.sum(np.abs(summary.eval()[0])))

@3NFBAGDU
Copy link

3NFBAGDU commented Jul 25, 2019

How many node is in the layers if you know, and is it fully connected ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants