You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I executed the source code of DiffSCE on my computational resource (Tesla V100-SXM2-32GB), using the identical configuration as specified in the run_diffcse.sh file. I obtained the following results, which differ from the results reported in your paper and on your GitHub repository. To illustrate, there is a 3.24-point difference (78.49 - 75.25 = 3.24) in average STS accuracy between your reported results and the results I obtained.
Do you have any insights or suggestions regarding the source of this disparity in performance when running the code to generate results? (@voidism)
[INFO|trainer.py:358] 2023-09-21 19:27:21,467 >> Using amp fp16 backend
09/21/2023 19:27:21 - INFO - __main__ - *** Evaluate ***
tasks: ['STSBenchmark', 'SICKRelatedness', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16', 'MR', 'CR', 'SUBJ', 'MPQA', 'SST2', 'MRPC', 'TREC']
./SentEval/senteval/sts.py:42: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
sent1 = np.array([s.split() forsin sent1])[not_empty_idx]
./SentEval/senteval/sts.py:43: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
sent2 = np.array([s.split() forsin sent2])[not_empty_idx]
09/21/2023 19:27:54 - INFO - root - Generating sentence embeddings
09/21/2023 19:28:02 - INFO - root - Generated sentence embeddings
09/21/2023 19:28:02 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with (inner) 5-fold cross-validation
09/21/2023 19:28:10 - INFO - root - Best param found at split 1: l2reg = 0.001 with score 82.31
09/21/2023 19:28:20 - INFO - root - Best param found at split 2: l2reg = 0.001 with score 81.99
09/21/2023 19:28:32 - INFO - root - Best param found at split 3: l2reg = 0.0001 with score 82.27
09/21/2023 19:28:42 - INFO - root - Best param found at split 4: l2reg = 0.01 with score 81.54
09/21/2023 19:28:53 - INFO - root - Best param found at split 5: l2reg = 0.0001 with score 82.04
09/21/2023 19:28:54 - INFO - root - Generating sentence embeddings
09/21/2023 19:28:56 - INFO - root - Generated sentence embeddings
09/21/2023 19:28:56 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with (inner) 5-fold cross-validation
09/21/2023 19:28:59 - INFO - root - Best param found at split 1: l2reg = 1e-05 with score 87.81
09/21/2023 19:29:03 - INFO - root - Best param found at split 2: l2reg = 0.0001 with score 88.15
09/21/2023 19:29:07 - INFO - root - Best param found at split 3: l2reg = 1e-05 with score 87.32
09/21/2023 19:29:11 - INFO - root - Best param found at split 4: l2reg = 1e-05 with score 87.05
09/21/2023 19:29:15 - INFO - root - Best param found at split 5: l2reg = 0.0001 with score 87.25
09/21/2023 19:29:15 - INFO - root - Generating sentence embeddings
09/21/2023 19:29:23 - INFO - root - Generated sentence embeddings
09/21/2023 19:29:23 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with (inner) 5-fold cross-validation
09/21/2023 19:29:32 - INFO - root - Best param found at split 1: l2reg = 0.001 with score 95.22
09/21/2023 19:29:42 - INFO - root - Best param found at split 2: l2reg = 1e-05 with score 95.51
09/21/2023 19:29:52 - INFO - root - Best param found at split 3: l2reg = 0.0001 with score 95.31
09/21/2023 19:30:01 - INFO - root - Best param found at split 4: l2reg = 0.001 with score 95.45
09/21/2023 19:30:09 - INFO - root - Best param found at split 5: l2reg = 0.0001 with score 95.46
09/21/2023 19:30:10 - INFO - root - Generating sentence embeddings
09/21/2023 19:30:12 - INFO - root - Generated sentence embeddings
09/21/2023 19:30:12 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with (inner) 5-fold cross-validation
09/21/2023 19:30:21 - INFO - root - Best param found at split 1: l2reg = 0.001 with score 89.16
09/21/2023 19:30:29 - INFO - root - Best param found at split 2: l2reg = 1e-05 with score 88.19
09/21/2023 19:30:37 - INFO - root - Best param found at split 3: l2reg = 0.001 with score 88.91
09/21/2023 19:30:45 - INFO - root - Best param found at split 4: l2reg = 0.001 with score 88.44
09/21/2023 19:30:54 - INFO - root - Best param found at split 5: l2reg = 0.001 with score 88.93
09/21/2023 19:30:55 - INFO - root - Computing embedding for train
09/21/2023 19:31:22 - INFO - root - Computed train embeddings
09/21/2023 19:31:22 - INFO - root - Computing embedding for dev
09/21/2023 19:31:23 - INFO - root - Computed dev embeddings
09/21/2023 19:31:23 - INFO - root - Computing embedding fortest
09/21/2023 19:31:24 - INFO - root - Computed test embeddings
09/21/2023 19:31:24 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with standard validation..
09/21/2023 19:31:36 - INFO - root - [('reg:1e-05', 87.73), ('reg:0.0001', 87.84), ('reg:0.001', 87.61), ('reg:0.01', 86.93)]
09/21/2023 19:31:36 - INFO - root - Validation : best param found is reg = 0.0001 with score 87.84
09/21/2023 19:31:36 - INFO - root - Evaluating...
09/21/2023 19:31:39 - INFO - root - ***** Transfer task : MRPC *****
09/21/2023 19:31:39 - INFO - root - Computing embedding for train
09/21/2023 19:31:45 - INFO - root - Computed train embeddings
09/21/2023 19:31:45 - INFO - root - Computing embedding fortest
09/21/2023 19:31:47 - INFO - root - Computed test embeddings
09/21/2023 19:31:47 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with 5-fold cross-validation
09/21/2023 19:31:51 - INFO - root - [('reg:1e-05', 74.85), ('reg:0.0001', 74.85), ('reg:0.001', 74.93), ('reg:0.01', 74.07)]
09/21/2023 19:31:51 - INFO - root - Cross-validation : best param found is reg = 0.001 with score 74.93
09/21/2023 19:31:51 - INFO - root - Evaluating...
09/21/2023 19:31:52 - INFO - root - ***** Transfer task : TREC *****
09/21/2023 19:31:54 - INFO - root - Computed train embeddings
09/21/2023 19:31:54 - INFO - root - Computed test embeddings
09/21/2023 19:31:54 - INFO - root - Training pytorch-MLP-nhid0-rmsprop-bs128 with 5-fold cross-validation
09/21/2023 19:32:00 - INFO - root - [('reg:1e-05', 84.15), ('reg:0.0001', 84.02), ('reg:0.001', 83.47), ('reg:0.01', 76.76)]
09/21/2023 19:32:00 - INFO - root - Cross-validation : best param found is reg = 1e-05 with score 84.15
09/21/2023 19:32:00 - INFO - root - Evaluating...
09/21/2023 19:32:00 - INFO - __main__ - ***** Eval results *****
09/21/2023 19:32:00 - INFO - __main__ - STS12 = 0.6466070114897755
09/21/2023 19:32:00 - INFO - __main__ - STS13 = 0.7940081784855644
09/21/2023 19:32:00 - INFO - __main__ - STS14 = 0.7106309581907064
09/21/2023 19:32:00 - INFO - __main__ - STS15 = 0.8022190201969241
09/21/2023 19:32:00 - INFO - __main__ - STS16 = 0.7800045550188356
09/21/2023 19:32:00 - INFO - __main__ - eval_CR = 87.52
09/21/2023 19:32:00 - INFO - __main__ - eval_MPQA = 88.73
09/21/2023 19:32:00 - INFO - __main__ - eval_MR = 82.03
09/21/2023 19:32:00 - INFO - __main__ - eval_MRPC = 74.93
09/21/2023 19:32:00 - INFO - __main__ - eval_SST2 = 87.84
09/21/2023 19:32:00 - INFO - __main__ - eval_SUBJ = 95.39
09/21/2023 19:32:00 - INFO - __main__ - eval_TREC = 84.15
09/21/2023 19:32:00 - INFO - __main__ - eval_avg_sts = 0.7525457395203998
09/21/2023 19:32:00 - INFO - __main__ - eval_avg_transfer = 85.79857142857144
09/21/2023 19:32:00 - INFO - __main__ - eval_sickr_spearman = 0.734116144071677
09/21/2023 19:32:00 - INFO - __main__ - eval_stsb_spearman = 0.8002343091893147
The text was updated successfully, but these errors were encountered:
I executed the source code of DiffSCE on my computational resource (Tesla V100-SXM2-32GB), using the identical configuration as specified in the run_diffcse.sh file. I obtained the following results, which differ from the results reported in your paper and on your GitHub repository. To illustrate, there is a 3.24-point difference (78.49 - 75.25 = 3.24) in average STS accuracy between your reported results and the results I obtained.
Do you have any insights or suggestions regarding the source of this disparity in performance when running the code to generate results? (@voidism)
The text was updated successfully, but these errors were encountered: