Venue Prediction with bag-of-words + heterogenous information as features using sklearn SGDClassifier
training: https://www.dropbox.com/s/rrbksqvvoefrr4p/training.txt?dl=0
validation: https://www.dropbox.com/s/tw094y2xfcoosv3/validation.txt?dl=0
Dataset describe:
Paper_Id \tab Paper_title \tab Publication_venue \tab Cited_Papers \tab Cited_Papers_Venues
python3
sklearn
pandas
numpy
pickle
mkdir input # Create input directory
<Download training, validation dataset on the link above and move into input directory>
python3 ./src/clean_data.py --input ./input/training.txt --output ./input/cleaned_training.txt
python3 ./src/clean_data.py --input ./input/validation.txt --output ./input/cleaned_validation.txt
python3 ./src/create_data_example.py --train ./input/cleaned_training.txt --validation ./input/cleaned_validation.txt
python3 ./src/train_classifier.py --train ./input/cleaned_training.txt --validation ./input/cleaned_validation.txt
bag-of-word dimension: 3000
classifier: sklearn SGDClassifier (default)
Feature | F1-micro | F1-macro | Accuracy |
---|---|---|---|
title info. | 0.266 | 0.172 | 0.267 |
title + cited_venue info. | 0.982 | 0.758 | 0.981 |