New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train/dev/test split #20
Comments
The NELL dataset consisted of a train/test split. The dev was created for hyperparameter tuning. The preprocessing script is not complete. There was another script used to remove out the duplicates and inverse duplicates. You can use the dev set we created (https://github.com/shehzaadzd/MINERVA/blob/master/datasets/data_preprocessed/nell/dev.txt). |
Thank you for your response. I understand you split NELL train data into train and dev sets, Would you please let me know what was the proportion of the train/dev split you used in your paper? Because I am trying to reproduce the experimental results on your paper. I notice you didn't mention it on your paper. Thank you. |
We tried to extract 20% but after removing duplicates (and inverse duplicates) and removing triples which contained the only occurrence of an entity, we were left with ~500 triples. You could use https://github.com/shehzaadzd/MINERVA/blob/master/datasets/data_preprocessed/nell/dev.txt to reproduce our results. |
I see, appreciate it. |
Why your dev triples are included in training data?
code/data/preprocessing_scripts/nell.py:
out_file.write(e1+'\t'+r+'\t'+e2+'\n')
if np.random.normal() > 0.2:
----dev.write(e1+'\t'+r+'\t'+e2+'\n')
Theoretically you are supposed to split it into 2 datasets (train/test) or 3 (train/dev/test) without overlaps. Please explain the reason behind this. Thank you.
The text was updated successfully, but these errors were encountered: