First, download and decompress data/NYTimes_orig, data/VOA, and NLP_toolbox.
Our pipeline requires several preprocessing steps, from other preexisting works:
-
Step 0: Prepare raw data into a parsed format standard across the two datasets, NYTimes and VOA
-
Step 1: Bert tokenization for textual summarization features [paper] [code]
-
Step 2: Bottom-Up-Attention visual semantic feature extraction [paper] [code]
-
Step 3: Building the IE/KG given the news article [paper] [code]
We are porting in the code for the above components into our repo so they can be run via the following commands.
dataset=NYTimes
## For the first time running the dataset..
if [ "$dataset" == "NYTimes" ]; then
python scripts/get_NYTimes_data.py #step 0
fi
sh scripts/preproc_bert.sh "" "" ${dataset} #step 1a
sh scripts/preproc_bert.sh "" caption ${dataset} #step 1b
sh scripts/preproc_bert.sh "" title ${dataset} #step 1c
sh scripts/preproc_bua.sh ${dataset} #step 2
sh scripts/preproc_IE.sh ${dataset} #step 3
python data_preproc/prepare_indicator_factors.py ${dataset}
# git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . && cd .. && rm -rf apex
# git clone https://github.com/NVIDIA/apex.git
# cd apex
# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
# pip install .
# cd ..
But this is still in beta development phase. If you encounter set-up or runtime issues, please directly check out and run the original preprocessing source code and documentations linked above!
# Example usage for doc-level detection task:
python code/engine.py --task doc-level --data_dir data/${dataset}/ --lrate 5e-6 --num_epochs 5 --ckpt_name ${dataset}
Example usage for KE-level detection task:
# python code/engine.py --task KE-level --data_dir data/VOA/ --lrate 0.001 --ckpt_name VOA
The NYTimes dataset orignated from GoodNews, and Tan et al., 2020 added in multimedia NeuralNews.
The pristine/unmanipulated VOA news articles used in our data was originally collected by Manling Li. Many thanks to her.
If you would like to view a jupyter notebook running in the remote server from your local machine, do sth along the lines of
jupyter notebook --no-browser --port=5050 # in the server
ssh -N -f -L localhost:5051:localhost:5050 username@server-entry-address # from local machine