XLNET Base for Malay and Indonesian languages (not an issue) #160

huseinzol05 · 2019-07-14T10:31:20Z

Hi! This is not an issue, I just want to say XLNET is really great and I successfully pretrained XLNET from scratch for Malay and Indonesian languages. You can read comparison and download pretrained from here, https://github.com/huseinzol05/Malaya/tree/master/xlnet

I am planning to release XLNET Large for these languages!

stefan-it · 2019-07-14T21:10:35Z

Thanks for sharing it (and the parameters you used for pre-training) 👍 Could you say something to the loss you achieved after 700K epochs?

huseinzol05 · 2019-07-15T06:22:52Z

My loss around 2.XX after 700k steps, never get into 1.XX during entire training sessions

3NFBAGDU · 2019-07-15T13:51:06Z

Hi, can you tell what hardware did you use for training and how long it took on base and small models.
Thank You.

huseinzol05 · 2019-07-15T14:27:00Z

Single Tesla V100 16GB vram, 700k steps took around 4 days for base, and 2 days for small models.

3NFBAGDU · 2019-07-15T15:09:41Z

Thank you for your response, and how many sentences do you have ?

abhi060698 · 2019-07-15T18:03:59Z

Is there one for English as well?

huseinzol05 · 2019-07-15T18:30:27Z

@3NFBAGDU , around 600k plus, total size 1.21GB pure text. @abhi060698 I believe you can download from original repository?

abhi060698 · 2019-07-15T18:35:02Z

@huseinzol05 From what I saw there's only the Large model. Do you have a link to the Base model?

huseinzol05 · 2019-07-15T19:03:00Z

Nopeeeee :(

3NFBAGDU · 2019-07-17T09:28:33Z

Hi, i have trained XLNET model 1 weak ago.
I had 1,600,000 sentences, my previous parameters was sudo python3 train_gpu.py --record_info_dir=forum+books/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=6 --d_model=764 --d_embed=764 --n_head=16 --d_head=64 --d_inner=2048 --untie_r=True --model_dir=forum+books_model --uncased=False --num_predict=85 --mask_alpha=6 --mask_beta=1 --train_steps=700000 --iterations=500, and it had a bad results.

Now i want to train base model with 3,000,000 sentences, and i want to change my train_gpu parameters foloowing: sudo python3 train_gpu.py --record_info_dir=forum+books/tfrecords --train_batch_size=32 --seq_len=512 --reuse_len=256 --mem_len=384 --perm_size=256 --n_layer=12 --d_model=512 --d_embed=512 --n_head=16 --d_head=64 --d_inner=2048 --untie_r=True --model_dir=forum+books_model --uncased=False --num_predict=85 --mask_alpha=6 --mask_beta=1 --num_passes=20 --train_steps=700000 --iterations=10 --learning_rate=2.5e-5 .

I think in second example number of iterations is too few, what do you think what parameter should i choose to make a base model?

I want to make a sentence embeddings, i have used sent2vec for this and had a good results, but i want to improve it with XLNET.

huseinzol05 · 2019-07-17T13:16:15Z

You can increase train_steps if you are not satisfied with 700k. If your loss lower than 3.0, that is more than enough.

3NFBAGDU · 2019-07-18T11:56:44Z

Hello, Do you know how can i get sentence-embedding vector for some sentence.
for example: i want to get "Hello, how are you" sentence-embedding vector.

huseinzol05 · 2019-07-18T14:12:09Z

Actually if you read from run_classifier.py, you can get the answer,

X = tf.placeholder(tf.int32, [None, None])
segment_ids = tf.placeholder(tf.int32, [None, None])
input_masks = tf.placeholder(tf.float32, [None, None])

xlnet_model = xlnet.XLNetModel(
            xlnet_config=xlnet_config,
            run_config=xlnet_parameters,
            input_ids=tf.transpose(self.X, [1, 0]),
            seg_ids=tf.transpose(self.segment_ids, [1, 0]),
            input_mask=tf.transpose(self.input_masks, [1, 0]))
        
summary = xlnet_model.get_pooled_out("last", True) # your vector for 'Hello, how are you'.

luv4me · 2019-07-22T07:58:26Z

@ @huseinzol05 hi I have a question , Everytime I run it , I get the different vector for the same sentence . It drives me crazy

vanh17 · 2019-07-22T10:06:37Z

@huseinzol05, when you pretrained your model, did you see error like this #168. Thank you

huseinzol05 · 2019-07-22T10:56:03Z

@luv4me , lol, obviously, we have finite space, our neural network will cut off some floating points. So, give its break.

huseinzol05 · 2019-07-22T10:56:59Z

@vanh17 , sounds like your data is an empty array. Do you checked your data is not empty?

vanh17 · 2019-07-22T13:07:12Z

@huseinzol05 how could we really check for it? I ran the data_utils.py on the txt file where each sentence is on one line and there is an empty line at the end of each document before the next document is being inserted.
I checked the train-0-0.bsz-8.seqlen-512.reuse-256.bi.alpha-6.beta-1.fnp-100.tfrecords, looks like it was filled with bunch of human-not-readable characters so I do not know how it could be empty.

OmriPi · 2019-07-23T10:32:02Z

@huseinzol05 I have the exact same problem as @luv4me , each time I'm getting a drastically different vector for the same sentence, the values are not even remotely similar. Is it really the fault of floating points? It seems to me like something is either inconsistent there or the model is training (even if I set is_training=False).

EDIT: OK, I found out that I can get the same vector consistently if I set the random seed of TensorFlow with: tf.compat.v1.random.set_random_seed(42)
This is quite strange, where is randomization used in inference and why? Usually the output of predicting should be consistent afaik

huseinzol05 · 2019-07-23T17:15:46Z

You should put is_training equal to False, or else the dropout layer randomly applied zaro masking.

huseinzol05 · 2019-07-23T17:17:28Z

Even after you put is_training is False the different is quite big?

OmriPi · 2019-07-24T08:36:23Z

@huseinzol05 yes, I set is_training and is_finetune to False both. I am still getting random outputs.
Here is my code:

    FLAGS(sys.argv)
    sp_model.Load(FLAGS.spiece_model_file)
    xlnet_config = xlnet.xlnet.XLNetConfig(json_path=FLAGS.model_config_path)
    run_config = xlnet.xlnet.create_run_config(is_training=False, is_finetune=False, FLAGS=FLAGS)
.
.
.
    with tf.Session() as sess:
        xlnet_model = xlnet.xlnet.XLNetModel(xlnet_config=xlnet_config, run_config=run_config,
                                             input_ids=np.expand_dims(sentence_features.input_ids, 1).astype('int32'),
                                             seg_ids=np.expand_dims(sentence_features.segment_ids, 1).astype('int32'),
                                             input_mask=np.expand_dims(sentence_features.input_mask, 1).astype('float32'))
        init_from_checkpoint(FLAGS, True)
        summary = xlnet_model.get_pooled_out(summary_type="last")
        sess.run(tf.global_variables_initializer())
        seq_out = xlnet_model.get_sequence_output()
        nps = summary.eval()
        print(np.sum(np.abs(summary.eval()[0])))

3NFBAGDU · 2019-07-25T13:29:36Z

How many node is in the layers if you know, and is it fully connected ?

huseinzol05 mentioned this issue Jul 22, 2019

How can I ues the release model of xlnet for get the embedding? #148

Open

huseinzol05 closed this as completed Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XLNET Base for Malay and Indonesian languages (not an issue) #160

XLNET Base for Malay and Indonesian languages (not an issue) #160

huseinzol05 commented Jul 14, 2019

stefan-it commented Jul 14, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 15, 2019

abhi060698 commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

abhi060698 commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 17, 2019 •

edited

huseinzol05 commented Jul 17, 2019

3NFBAGDU commented Jul 18, 2019

huseinzol05 commented Jul 18, 2019 •

edited

luv4me commented Jul 22, 2019

vanh17 commented Jul 22, 2019

huseinzol05 commented Jul 22, 2019

huseinzol05 commented Jul 22, 2019

vanh17 commented Jul 22, 2019 •

edited

OmriPi commented Jul 23, 2019 •

edited

huseinzol05 commented Jul 23, 2019

huseinzol05 commented Jul 23, 2019

OmriPi commented Jul 24, 2019

3NFBAGDU commented Jul 25, 2019 •

edited

XLNET Base for Malay and Indonesian languages (not an issue) #160

XLNET Base for Malay and Indonesian languages (not an issue) #160

Comments

huseinzol05 commented Jul 14, 2019

stefan-it commented Jul 14, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 15, 2019

abhi060698 commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

abhi060698 commented Jul 15, 2019

huseinzol05 commented Jul 15, 2019

3NFBAGDU commented Jul 17, 2019 • edited

huseinzol05 commented Jul 17, 2019

3NFBAGDU commented Jul 18, 2019

huseinzol05 commented Jul 18, 2019 • edited

luv4me commented Jul 22, 2019

vanh17 commented Jul 22, 2019

huseinzol05 commented Jul 22, 2019

huseinzol05 commented Jul 22, 2019

vanh17 commented Jul 22, 2019 • edited

OmriPi commented Jul 23, 2019 • edited

huseinzol05 commented Jul 23, 2019

huseinzol05 commented Jul 23, 2019

OmriPi commented Jul 24, 2019

3NFBAGDU commented Jul 25, 2019 • edited

3NFBAGDU commented Jul 17, 2019 •

edited

huseinzol05 commented Jul 18, 2019 •

edited

vanh17 commented Jul 22, 2019 •

edited

OmriPi commented Jul 23, 2019 •

edited

3NFBAGDU commented Jul 25, 2019 •

edited