InfoSurgeon

Get the Datasets

First, download and decompress data/NYTimes_orig, data/VOA, and NLP_toolbox.

Preprocess Data

Our pipeline requires several preprocessing steps, from other preexisting works:

Step 0: Prepare raw data into a parsed format standard across the two datasets, NYTimes and VOA
Step 1: Bert tokenization for textual summarization features [paper] [code]
Step 2: Bottom-Up-Attention visual semantic feature extraction [paper] [code]
Step 3: Building the IE/KG given the news article [paper] [code]

We are porting in the code for the above components into our repo so they can be run via the following commands.

dataset=NYTimes

## For the first time running the dataset..
if [ "$dataset" == "NYTimes" ]; then
    python scripts/get_NYTimes_data.py  #step 0
fi
sh scripts/preproc_bert.sh "" "" ${dataset}  #step 1a
sh scripts/preproc_bert.sh "" caption ${dataset}  #step 1b
sh scripts/preproc_bert.sh "" title ${dataset}  #step 1c
sh scripts/preproc_bua.sh ${dataset}  #step 2
sh scripts/preproc_IE.sh ${dataset}  #step 3
python data_preproc/prepare_indicator_factors.py ${dataset}

# git clone https://github.com/NVIDIA/apex.git && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" . && cd .. && rm -rf apex
# git clone https://github.com/NVIDIA/apex.git
# cd apex
# pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .
# pip install .
# cd ..

But this is still in beta development phase. If you encounter set-up or runtime issues, please directly check out and run the original preprocessing source code and documentations linked above!

Run Misinformation Detection

# Example usage for doc-level detection task:
python code/engine.py --task doc-level --data_dir data/${dataset}/ --lrate 5e-6 --num_epochs 5 --ckpt_name ${dataset}

Example usage for KE-level detection task:
# python code/engine.py --task KE-level --data_dir data/VOA/ --lrate 0.001 --ckpt_name VOA

Credits & Acknowledgements

The NYTimes dataset orignated from GoodNews, and Tan et al., 2020 added in multimedia NeuralNews.

The pristine/unmanipulated VOA news articles used in our data was originally collected by Manling Li. Many thanks to her.

General Tips

If you would like to view a jupyter notebook running in the remote server from your local machine, do sth along the lines of

jupyter notebook --no-browser --port=5050  # in the server
ssh -N -f -L localhost:5051:localhost:5050 username@server-entry-address  # from local machine

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
code		code
data_preproc		data_preproc
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InfoSurgeon

Get the Datasets

Preprocess Data

Run Misinformation Detection

Credits & Acknowledgements

General Tips

About

Releases

Packages

Languages

yrf1/InfoSurgeon

Folders and files

Latest commit

History

Repository files navigation

InfoSurgeon

Get the Datasets

Preprocess Data

Run Misinformation Detection

Credits & Acknowledgements

General Tips

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages