Confusion about argument detection evaluation #3

xixy · 2020-03-14T06:19:17Z

Thank you very much for releasing source code about this paper.

However, I notice you used func/f_score to calculate argument detection performance, which basically consider if predicted roles and gold roles match. The event types is ignored in evaluation. I think there is something wrong considering the criteria is as follows:

An argument is correctly classified if its offsets, role, related trigger type and trigger’s offsets exactly match a reference argument.

There are some cases you probably miss:

if a trigger is mislabeled as 'None', this trigger and its arguments are ignored because you only select instances whose predicted event type is not 'None' for dev/test set. (as model/DMCNN/process_data_for_argument shows)
If a trigger is mislabeled as another event type, the predicted argument roles are meaningless and should be marked as wrong (If I didn't misunderstand the criteria). However, the code ignores the wrong event type and treat the detection as right if predicted role matches gold role. (as func/f_score shows)

Correct me if I'm wrong and thank you again.

wzq016 · 2020-03-19T02:57:05Z

For second case, in our original experiments, we used tuple, i.e. (event type, role), to do evaluation, and classify N/A class only and if only event type is N/A or role is N/A. This code is fully rewritten because the origin code is quite dirty and I apologized that I missed this point in this rewritten code. Now I fixed this point , please check commit history to find out the modification, I also tested the modified codes and the results remained same as the paper.

Just a reminder, once you use tuple to do classification, since the evaluation is more strict, you need to do some more process in dataset to reach 53.5 in DMCNN, such as cutting sentences if its length is more than a threshold, please do such process by yourself. Also, from the experiments results, once DMCNN reaches 53.5 using these two evaluation metrics separately, then other models will perform similarly using these two evaluation metrics separately, which means it is just a trick in data preprocess, and this is the one of main disadvantages of ACE2005.

For first case, when a trigger is classified with N/A, it is not necessarily to do EAE stage with this trigger because all entities are N/A in such situation and are totally noises when testing EAE. Also, actually all models are tested in this way, if you want to see what will happen if you take the first case into consideration, you can simply comment line 167,168 in models.py,i.e.
dev = [np.take(d,dev_slices,axis=0) for d in dev]
test = [np.take(d,test_slices,axis=0) for d in test]
You will see that there will be a dramatically drop in performance and DMCNN will have no way to reach 53.5 since there are too many meaningless noises. From this perspective, hope you can understand what happened here.

xixy · 2020-03-19T03:43:32Z

Thank you for your response.

I'm okay with the second case and thank you for modification.

For the first case, if we mislabel a trigger as None type, the trigger and its arguments are not included in EAE stage for performance calculation. I got your point but it doesn't looks like a right way.

Thank you again for modification.

xixy · 2020-03-19T12:54:29Z

Another question is what event detection model is used when you report HMEAE(CNN) and HMEAE(BERT) in Table 3/4? Is DMCNN used as ED model? The experimental setting is confusing for me since you used two event detection models:

As our work does not involve the event detection stage, we conduct the ar- gument role classification based on the event detection models in Chen et al. (2015) and Wang et al. (2019) for HMEAE (CNN) and HMEAE (BERT) respectively.

Or HMEAE(CNN) used DMCNN as ED model and HMEAE(BERT) used model from Wang et al. (2019) as ED model?

Thank you.

wzq016 · 2020-03-19T12:56:48Z

Second one, i.e. DMCNN as ED + HMEAE(CNN) as EAE, DMBERT as ED + HMEAE(BERT) as EAE.

xixy · 2020-03-19T13:09:43Z

Got it and Thanks!

xixy closed this as completed Mar 19, 2020

xixy reopened this Mar 19, 2020

xixy closed this as completed Mar 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about argument detection evaluation #3

Confusion about argument detection evaluation #3

xixy commented Mar 14, 2020

wzq016 commented Mar 19, 2020 •

edited

Loading

xixy commented Mar 19, 2020

xixy commented Mar 19, 2020

wzq016 commented Mar 19, 2020

xixy commented Mar 19, 2020

Confusion about argument detection evaluation #3

Confusion about argument detection evaluation #3

Comments

xixy commented Mar 14, 2020

wzq016 commented Mar 19, 2020 • edited Loading

xixy commented Mar 19, 2020

xixy commented Mar 19, 2020

wzq016 commented Mar 19, 2020

xixy commented Mar 19, 2020

wzq016 commented Mar 19, 2020 •

edited

Loading