-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] Comet kiwi architecture #216
Comments
You are right. The diagram is correct but the hparams are confusing because that flag is actually not used for this model. Contrary to RegressionMetric models where the different pooling options influence the sentence embedding computation, in UnifiedMetric models. we always use the same pooling technique (The CLS token) |
The config in the YAML is just there because all classes inherit from CometModel |
but why ? |
This was something I did a couple of tests and it was not worth it for models where we perform cross-encoding. With a model like cometkiwi where target and source are encoded together and the self-attention can look at the same time to both sentences, the representations that are captured in the CLS token are superior than performing average pooling across the entire input. Another thing we tried was to just gather the embeddings of the target (which already received attention from the source) and average those... the result is very similar to use the CLS token only and it complicates code a bit because you have to keep track of the separator tokens in the middle of the sentence. So the decision was based on performance and simplicity... This is not the case for other models where there is no attention between sentences... for those models we saw benefits in doing average pooling. Btw our experiments seem to validate some finding from retrieval tasks where there is this long debate about cross-encoding vs dual encoding with average pooling. |
last ablation question. did you try the same arch without the layerwise_attention ? does it bring a lot ? |
I did, it's basically the same performance. For some tasks different layers can give you different results and some layers might be better than others. The idea behind using the layerwise_attention was to reduce the need of that search when doing hyper-parameter search and I found out it worked well... additionally, we could eventually prune top layers if needed but we end up not doing it. We describe the layer pruning here. Anyway, training a model without the layerwise_attention will eventually lead to similar results and its not an absolute need. |
thanks for this. Another question: I scored a dataset with cometkiwi-XL and I trained a xlm-roberta-large based on the dataset / scores from cometkiwi-XL It barely improves the "original" wmt22-cometkiwi-da model. It means that it is quite difficult to distillate the cometkiwi-XL to a smaller model. did you observe the same? |
Yes, it's hard to distil an XXL/XL model into a large model.... I believe this is the case because the large model is already close to the XL and XXL models. There is not a lot of improvements with scale. I had a student working on distillation that had nice results distilling XL/XXL into a model based on MiniLM V2. The resulting model is fast and has good performance... Its a bit better than training with the annotations from WMT |
hmm I am surprised in my tests XL is much better than Large. I have not tested XXL but based on your last paper it seems marginal improvement to XL. |
It's true. The improvements from XL to Large are marginal. You notice a bit bigger improvements when going to XXL but for it's size, the improvements are not that big. I think this is the case because InfoXLM is a really strong encoder for its size. XLM-R XL and XXL are also undertrained for their size. They have not been trained enough.... unfortunately no one seems to be interested in training large multilingual encoders anymore |
I think we are not saying the same thing:) |
My understanding based on this
![image](https://private-user-images.githubusercontent.com/15141326/327397661-77204628-f998-41cf-8dee-94e3e5300af6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTkxMDk3NDQsIm5iZiI6MTcxOTEwOTQ0NCwicGF0aCI6Ii8xNTE0MTMyNi8zMjczOTc2NjEtNzcyMDQ2MjgtZjk5OC00MWNmLThkZWUtOTRlM2U1MzAwYWY2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIzVDAyMjQwNFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2OWM2ZTQ4NmRmNzQ1YWQyYTA1NjZjOWE5ZDA2Nzk2ZDI0ZGJjZjRhODkxNWY0YjczZWRlYzU0ZjQ5YmVkMjQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.CmxIRblxhEeHDALZAVHz4z9_sACT9PpQVDfiNHtHdpo)
and this:
https://github.com/Unbabel/COMET/blob/master/comet/models/multitask/unified_metric.py#L473
is that for wmt22(wmt23-cometkiwi models you take only the first token as the sentence embedding to compute the score through the layerwise + feedforward layers.
This setting in the hparams is a bit confusing: https://huggingface.co/Unbabel/wmt23-cometkiwi-da-xl/blob/main/hparams.yaml#L30
However what triggered the choice of first token vs average pooling in the case of comet kiwi ?
Thanks
The text was updated successfully, but these errors were encountered: