Skip to content

yrf1/InfoSurgeon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

InfoSurgeon

Get the Datasets

First, download and decompress data/NYTimes_orig, data/VOA, and NLP_toolbox.

Preprocess Data

Our pipeline requires several preprocessing steps, from other preexisting works:

  • Step 0: Prepare raw data into a parsed format standard across the two datasets, NYTimes and VOA

  • Step 1: Bert tokenization for textual summarization features [paper] [code]

  • Step 2: Bottom-Up-Attention visual semantic feature extraction [paper] [code]

  • Step 3: Building the IE/KG given the news article [paper] [code]

We are porting in the code for the above components into our repo so they can be run via the following commands.

dataset=NYTimes

## For the first time running the dataset..
if [ "$dataset" == "NYTimes" ]; then
    python scripts/get_NYTimes_data.py  #step 0
fi
sh scripts/preproc_bert.sh "" "" ${dataset}  #step 1a
sh scripts/preproc_bert.sh "" caption ${dataset}  #step 1b
sh scripts/preproc_bert.sh "" title ${dataset}  #step 1c
sh scripts/preproc_bua.sh ${dataset}  #step 2
sh scripts/preproc_IE.sh ${dataset}  #step 3
python data_preproc/prepare_indicator_factors.py ${dataset}

# git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . && cd .. && rm -rf apex
# git clone https://github.com/NVIDIA/apex.git
# cd apex
# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
# pip install .
# cd ..

But this is still in beta development phase. If you encounter set-up or runtime issues, please directly check out and run the original preprocessing source code and documentations linked above!

Run Misinformation Detection

# Example usage for doc-level detection task:
python code/engine.py --task doc-level --data_dir data/${dataset}/ --lrate 5e-6 --num_epochs 5 --ckpt_name ${dataset}

Example usage for KE-level detection task:
# python code/engine.py --task KE-level --data_dir data/VOA/ --lrate 0.001 --ckpt_name VOA

Credits & Acknowledgements

The NYTimes dataset orignated from GoodNews, and Tan et al., 2020 added in multimedia NeuralNews.

The pristine/unmanipulated VOA news articles used in our data was originally collected by Manling Li. Many thanks to her.

General Tips

If you would like to view a jupyter notebook running in the remote server from your local machine, do sth along the lines of

jupyter notebook --no-browser --port=5050  # in the server
ssh -N -f -L localhost:5051:localhost:5050 username@server-entry-address  # from local machine

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published