TSDAE + GPL and TAS-B + GPL #7

Yuan0320 · 2022-05-28T13:52:07Z

@kwang2049 Hi, thanks for your amazing work!

I wonder if the TSDAE + GPL mentioned in the paper refers to: fine-tuning the distilbert-base PLM with training order of (1)TSDAE on {dataset} -> (2) GPL on {dataset} ?

Thx.

kwang2049 · 2022-06-01T23:34:31Z

Hi @Yuan0320, thanks for your attention!

Sorry that the description about this in the paper is kinda misleading. It is composed of three training stages:
(1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO -> (3) GPL on ${dataset};.

It can also be understood as the TSDAE baseline (which was also trained on MS MARCO) in Table 1 + GPL training.

And actually, omitting step (2) leads to very little difference (cf. Table 4 in the paper). And from my observation, the major difference comes from Robust04 (around -1.0 nDCG@10 points).

Yuan0320 · 2022-06-06T03:19:19Z

@kwang2049 Thanks for the detailed response!

Is there no public code about (1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO in this repo?

Thx.

kwang2049 · 2022-06-08T20:21:54Z

Sorry that there is currently no one-step solution for this. To reproduce it, please run these one by one:
(1) Train a TSDAE model on the target corpus: https://github.com/UKPLab/sentence-transformers/tree/master/examples/unsupervised_learning/TSDAE. Note that if one wants to start from distilbert-base-uncased (i.e. the setting in the GPL paper), one needs some plug-in (since sadly the original distilbert does not support being as a decoder): UKPLab/sentence-transformers#962 (comment).
(2) Continue training with MarginMSE loss. This can be done with our GPL repo:

wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/msmarco.zip
unzip msmarco.zip
export dataset="msmarco"
python -m gpl.train \
    --path_to_generated_data "./$dataset" \
    --base_ckpt "YOUR_TSDAE_CHECKPOINT" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 140000 \
    --output_dir "output/$dataset" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --use_train_qrels
    # --use_amp   # Use this for efficient training if the machine supports AMP

# One can run `python -m gpl.train --help` for the information of all the arguments
# Notice: Please do not set `qgen_prefix` and leave it as None by default.

So bascially, it will skip the query generation step (by setting use_train_qrels=True) and use the existing train qrels from MS MARCO.

Yuan0320 · 2022-06-16T05:07:46Z

@kwang2049 Thanks for the detailed pipeline for TSDAE + GPL!

BTW, if you have a backup of gpl-training-data.tsv file of MarginMSE on MSMARCO, could you share it with me if possible?

Thanks.

kwang2049 · 2022-06-16T10:45:05Z

You can use this file https://sbert.net/datasets/msmarco-hard-negatives.jsonl.gz instead. The format is different from our gpl-training-data.tsv, but thankfully @nreimers has ever wrapped the code of training this zero-shot baseline in a single file: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py. Actually, I also used similar code at the very beginning of the GPL project.

Yuan0320 · 2022-06-18T18:34:51Z

@kwang2049 Thanks!

Regarding the best method GPL/${dataset}-distilbert-tas-b-gpl-self_miner, is the following script correct? (Only change the base_ckpt and retrievers to 'msmarco-distilbert-base-tas-b', compared to the usage in your repo README.md)

export dataset="MYDATASET"
python -m gpl.train \
    --path_to_generated_data "./$dataset" \
    --base_ckpt "msmarco-distilbert-base-tas-b" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 140000 \
    --new_size -1 \
    --queries_per_passage -1 \
    --output_dir "output/$dataset" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-tas-b"  \
    --retriever_score_functions "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "qgen" \
    --do_evaluation \

kwang2049 · 2022-06-22T21:13:13Z

I think you just need to change --retriever_score_functions "cos_sim" "cos_sim" into --retriever_score_functions "dot", since you only have one negative miner and the miner is the TAB-B model, which was trained with dot product. Other part looks good to me.

ArtemisDicoTiar · 2022-08-24T03:05:21Z

Hello, @kwang2049 !
I appreciate your lovely work.
While reading the GPL paper, I was quite confused what TSDAE and TSDAE + {something} on Table 9 of the paper.
So I have searched about TSDAE on this repo and found this issue.
Even I read the thread of this issue, I am still a bit confusing about TSDAE itself.
Is 'TSDAE' that mentioned on Table9 is trained in the following pipeline?
(1) TSDAE on ${dataset} -> (2) MarginMSE on MSMARCO
and therefore, the TSDAE mentioned on Table 9 is corresponding to the Table 1's TSDAE (target → MS-MARCO)?

kwang2049 · 2022-08-24T09:18:48Z

Hi @ArtemisDicoTiar, your understanding is correct. "TSDAE"s in table 1 and 9 mean the same method, i.e. TSDAE (target → MS-MARCO). "Target" here means a certain dataset from the target domain (and we just trained with TSDAE on the corresponding unlabeled corpus).

GuodongFan · 2023-01-05T12:28:36Z

In the paper, I found that TSDAE is used to train the retrievers. (domain adaptation for dense retrieval)
Is the TSDEA used for --base_ckpt? or --retrievers or can both?
Thanks!

Yuan0320 changed the title ~~TSADE + GPL~~ TSDAE + GPL May 28, 2022

Yuan0320 changed the title ~~TSDAE + GPL~~ TSDAE + GPL and TAS-B + GPL Jun 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSDAE + GPL and TAS-B + GPL #7

TSDAE + GPL and TAS-B + GPL #7

Yuan0320 commented May 28, 2022 •

edited

Loading

kwang2049 commented Jun 1, 2022

Yuan0320 commented Jun 6, 2022

kwang2049 commented Jun 8, 2022 •

edited

Loading

Yuan0320 commented Jun 16, 2022

kwang2049 commented Jun 16, 2022

Yuan0320 commented Jun 18, 2022 •

edited

Loading

kwang2049 commented Jun 22, 2022 •

edited

Loading

ArtemisDicoTiar commented Aug 24, 2022

kwang2049 commented Aug 24, 2022 •

edited

Loading

GuodongFan commented Jan 5, 2023 •

edited

Loading

TSDAE + GPL and TAS-B + GPL #7

TSDAE + GPL and TAS-B + GPL #7

Comments

Yuan0320 commented May 28, 2022 • edited Loading

kwang2049 commented Jun 1, 2022

Yuan0320 commented Jun 6, 2022

kwang2049 commented Jun 8, 2022 • edited Loading

Yuan0320 commented Jun 16, 2022

kwang2049 commented Jun 16, 2022

Yuan0320 commented Jun 18, 2022 • edited Loading

kwang2049 commented Jun 22, 2022 • edited Loading

ArtemisDicoTiar commented Aug 24, 2022

kwang2049 commented Aug 24, 2022 • edited Loading

GuodongFan commented Jan 5, 2023 • edited Loading

Yuan0320 commented May 28, 2022 •

edited

Loading

kwang2049 commented Jun 8, 2022 •

edited

Loading

Yuan0320 commented Jun 18, 2022 •

edited

Loading

kwang2049 commented Jun 22, 2022 •

edited

Loading

kwang2049 commented Aug 24, 2022 •

edited

Loading

GuodongFan commented Jan 5, 2023 •

edited

Loading