Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conll perl script refusing to score because of "too many repeated mentions (>10) in the response" #37

Closed
ritwikmishra opened this issue Nov 15, 2022 · 9 comments

Comments

@ritwikmishra
Copy link

I ran the preparation scripts successfully.

Downloaded the roberta checkpoint from dropbox link, and placed it in data folder.

Ran the command: python calculate_conll.py roberta test 20

I noticed some errors due to subprocess because I was using python3.6 instead of python3.7.

Error was: unexpected keyword argument 'capture_output'

Fixed the issue with this

But then I got an error: 'NoneType' object has no attribute 'group' origin of error --> line 15

I ran the perl script directly in bash: perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

MUC came out to be 86 (f1) but while calculating b3, I got this error: Found too many repeated mentions (> 10) in the response, so refusing to score. Please fix the output

I think it is because of this error only that the line 15 above was throwing that error (because output was empty).

How to proceed forward now? How to evaluate the results?

@ritwikmishra
Copy link
Author

UPDATE:

I ran the bash command for each metric

perl reference-coreference-scorers/scorer.pl <muc/bcub/ceafe> <keys_file> <response_file> none

I got 86, 79, and 76 f1 for muc, bcub, and ceafe respectively. Average = 80.3 which is ~ 81 claimed in the paper

But still I am not able to get why calculate_conll.py gives error...

@vdobrovolskii
Copy link
Owner

Please, see this solution
#4

@ritwikmishra
Copy link
Author

ritwikmishra commented Nov 16, 2022

@vdobrovolskii I replaced the loop mechanism as suggested here.

And I commented out the error throwing condition in the perl script as suggested here

The perl script runs fine through bash:

perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

But the python file still shows error:

$ python calculate_conll.py roberta test 20
Traceback (most recent call last):
  File "calculate_conll.py", line 40, in <module>
    extract_f1(subprocess.run(part_a + [metric] + part_b, **kwargs)))
  File "calculate_conll.py", line 15, in extract_f1
    return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))
AttributeError: 'NoneType' object has no attribute 'group'

Is there any other way to fix calculate_conll.py ?

@vdobrovolskii
Copy link
Owner

I believe the output is a bit different than expected for at least one of the perl scripts. Can you send me the outputs (just the last two lines) for the perl script with "muc", "ceafe" and "bcub" as metrics?

@ritwikmishra

This comment was marked as off-topic.

@vdobrovolskii
Copy link
Owner

I mean, can you send me the outputs of the perl script?

The calculate_conll.py reads the stdout of the perl script and searches for metrics there.

@ritwikmishra
Copy link
Author

Here is the output of perl reference-coreference-scorers/scorer.pl all data/conll_logs/roberta_test_e20.gold.conll data/conll_logs/roberta_test_e20.pred.conll none

Output

version: 8.01 /media/data_dump/Ritwik/git/wl-coref/reference-coreference-scorers/lib/CorScorer.pm                                                                                                                        
                                                                                                                                                                                                                         
METRIC muc:                                                                                                                                                                                                              
Repeated mention in the response: 231, 237 2121                                                                                                                                                                          
Repeated mention in the response: 119, 122 3030                                                                                                                                                                          
Repeated mention in the response: 158, 160 44                                                                                                                                                                            
Repeated mention in the response: 57, 62 1515                                                                                                                                                                            
Repeated mention in the response: 154, 158 55                                                                                                                                                                            
Repeated mention in the response: 76, 78 1313                                                                                                                                                                            
                                                                                                                                                                                                                         
====== TOTALS =======                                                                                                                                                                                                    
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               
Coreference: Recall: (13376 / 15232) 87.81%     Precision: (13376 / 15760) 84.87%       F1: 86.31%                                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               
                                                                                                                                                                                                                         
METRIC bcub:                                                                                                                                                                                                             
Repeated mention in the response: 154, 158 55                                                                                                                                                                            
Repeated mention in the response: 57, 62 1515                                                                                                                                                                            
Repeated mention in the response: 119, 122 3030                                                                                                                                                                          
Repeated mention in the response: 158, 160 44                                                                                                                                                                            
Repeated mention in the response: 76, 78 1313                                                                                                                                                                            
Repeated mention in the response: 231, 237 2121                                                                                                                                                                          
                                                                                                                                                                                                                         
====== TOTALS =======                                                                                                                                                                                                    
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%                                                                                                       
--------------------------------------------------------------------------                                                                                                                                               
Coreference: Recall: (16320.8804761468 / 19764) 82.57%  Precision: (15765.8216227679 / 20351) 77.46%    F1: 79.94%
--------------------------------------------------------------------------

METRIC ceafm:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (16541 / 19764) 83.69%     Precision: (16541 / 20351) 81.27%       F1: 82.46%
--------------------------------------------------------------------------

METRIC ceafe:
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------
Coreference: Recall: (3495.57012386207 / 4532) 77.13%   Precision: (3495.57012386207 / 4591) 76.13%     F1: 76.63%
--------------------------------------------------------------------------

METRIC blanc:
Repeated mention in the response: 76, 78 1313
Repeated mention in the response: 154, 158 55
Repeated mention in the response: 57, 62 1515
Repeated mention in the response: 158, 160 44
Repeated mention in the response: 119, 122 3030
Repeated mention in the response: 231, 237 2121

====== TOTALS =======
Identification of Mentions: Recall: (17786 / 19764) 89.99%      Precision: (17786 / 20350) 87.4%        F1: 88.67%
--------------------------------------------------------------------------

Coreference:
Coreference links: Recall: (98009 / 111931) 87.56%      Precision: (98009 / 121567) 80.62%      F1: 83.94%
--------------------------------------------------------------------------
Non-coreference links: Recall: (703839 / 883032) 79.7%  Precision: (703839 / 925055) 76.08%     F1: 77.85%
--------------------------------------------------------------------------
BLANC: Recall: (0.836345287908459 / 1) 83.63%   Precision: (0.783537821987766 / 1) 78.35%       F1: 80.9%
--------------------------------------------------------------------------

@vdobrovolskii
Copy link
Owner

Hmm. Each line that is supposed to be fed to the script is matched correctly:
image

Then it might be the case that when calling each metric separately the output is different...

I could investigate it further. Could you kindly modify the extract_f1 function as follows as run the script again? Then send me the output.

def extract_f1(proc: subprocess.CompletedProcess) -> float:
    prev_line = ""
    curr_line = ""
    for line in str(proc.stdout).splitlines():
        prev_line = curr_line
        curr_line = line
    print(repr(prev_line))
    return float(re.search(r"F1:\s*([0-9.]+)%", prev_line).group(1))

@ritwikmishra
Copy link
Author

The issue was in the way you were converting bytes to string. As stated here; simply typecasting bytes to string using str() will give you unintended results. Bytes should be decoded in order to get appropriate strings.

Changing the for loop for line in str(proc.stdout).splitlines(): ---> for line in (proc.stdout).decode('utf-8').splitlines(): worked!

Output:

muc 86.31
ceafe 76.63
bcub 79.94
avg 80.96

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants