Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about argument detection evaluation #3

Closed
xixy opened this issue Mar 14, 2020 · 5 comments
Closed

Confusion about argument detection evaluation #3

xixy opened this issue Mar 14, 2020 · 5 comments

Comments

@xixy
Copy link

xixy commented Mar 14, 2020

Thank you very much for releasing source code about this paper.

However, I notice you used func/f_score to calculate argument detection performance, which basically consider if predicted roles and gold roles match. The event types is ignored in evaluation. I think there is something wrong considering the criteria is as follows:

An argument is correctly classified if its offsets, role, related trigger type and trigger’s offsets exactly match a reference argument.

There are some cases you probably miss:

  1. if a trigger is mislabeled as 'None', this trigger and its arguments are ignored because you only select instances whose predicted event type is not 'None' for dev/test set. (as model/DMCNN/process_data_for_argument shows)
  2. If a trigger is mislabeled as another event type, the predicted argument roles are meaningless and should be marked as wrong (If I didn't misunderstand the criteria). However, the code ignores the wrong event type and treat the detection as right if predicted role matches gold role. (as func/f_score shows)

Correct me if I'm wrong and thank you again.

@wzq016
Copy link
Collaborator

wzq016 commented Mar 19, 2020

For second case, in our original experiments, we used tuple, i.e. (event type, role), to do evaluation, and classify N/A class only and if only event type is N/A or role is N/A. This code is fully rewritten because the origin code is quite dirty and I apologized that I missed this point in this rewritten code. Now I fixed this point , please check commit history to find out the modification, I also tested the modified codes and the results remained same as the paper.

Just a reminder, once you use tuple to do classification, since the evaluation is more strict, you need to do some more process in dataset to reach 53.5 in DMCNN, such as cutting sentences if its length is more than a threshold, please do such process by yourself. Also, from the experiments results, once DMCNN reaches 53.5 using these two evaluation metrics separately, then other models will perform similarly using these two evaluation metrics separately, which means it is just a trick in data preprocess, and this is the one of main disadvantages of ACE2005.

For first case, when a trigger is classified with N/A, it is not necessarily to do EAE stage with this trigger because all entities are N/A in such situation and are totally noises when testing EAE. Also, actually all models are tested in this way, if you want to see what will happen if you take the first case into consideration, you can simply comment line 167,168 in models.py,i.e.
dev = [np.take(d,dev_slices,axis=0) for d in dev]
test = [np.take(d,test_slices,axis=0) for d in test]
You will see that there will be a dramatically drop in performance and DMCNN will have no way to reach 53.5 since there are too many meaningless noises. From this perspective, hope you can understand what happened here.

@xixy
Copy link
Author

xixy commented Mar 19, 2020

Thank you for your response.

I'm okay with the second case and thank you for modification.

For the first case, if we mislabel a trigger as None type, the trigger and its arguments are not included in EAE stage for performance calculation. I got your point but it doesn't looks like a right way.

Thank you again for modification.

@xixy xixy closed this as completed Mar 19, 2020
@xixy xixy reopened this Mar 19, 2020
@xixy
Copy link
Author

xixy commented Mar 19, 2020

Another question is what event detection model is used when you report HMEAE(CNN) and HMEAE(BERT) in Table 3/4? Is DMCNN used as ED model? The experimental setting is confusing for me since you used two event detection models:

As our work does not involve the event detection stage, we conduct the ar- gument role classification based on the event detection models in Chen et al. (2015) and Wang et al. (2019) for HMEAE (CNN) and HMEAE (BERT) respectively.

Or HMEAE(CNN) used DMCNN as ED model and HMEAE(BERT) used model from Wang et al. (2019) as ED model?

Thank you.

@wzq016
Copy link
Collaborator

wzq016 commented Mar 19, 2020

Second one, i.e. DMCNN as ED + HMEAE(CNN) as EAE, DMBERT as ED + HMEAE(BERT) as EAE.

@xixy
Copy link
Author

xixy commented Mar 19, 2020

Got it and Thanks!

@xixy xixy closed this as completed Mar 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants