This is a fork from the HuggingFace's Pytorch implementation of BERT. Please see the original README at https://github.com/huggingface/pytorch-pretrained-BERT
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.
Install prerequisites as specified in the original README.
Copy the Twitter15/16 dataset under the project folder ./data/raw_data
Below is the content under data folder:
--data/
----raw_data/
------twitter15/
------twitter16/
------tweet_details.json
------user_details.json
- Navigate to
examples/
folder
- Run
preprocess_rumdect_data_concat_tweets.py
to preprocess files for LinearBERT model.
python preprocess_rumdect_data_concat_tweets.py
-
In
split_data.py
, change variabledata_mode = linear
-
Run
split_data.py
.
python split_data.py
This will produce data for 5 splits, under data/processed_data/linear_structure/twitter15/split_data/
- Run
preprocess_rumdect_data_hierarchical.py
to preprocess files for HierarchicalBERT model.
python preprocess_rumdect_data_hierarchical.py
-
In
split_data.py
, change variabledata_mode = hierarchical
-
Run
split_data.py
python split_data.py
This will produce data for 5 splits, under data/processed_data/hierarchical_structure/twitter15/split_data/
This section will list commands to train the classifier for Linear and Hierarchical model of the Twitter15/16 dataset.
Run the following to train linear model, first fold.
Change $DATA_DIR
to the correct location.
export DATA_DIR=/opt/src/rumor_lstm/data/processed_data/linear_structure/twitter15/split_data
python run_classifier.py \
--task_name twitter-1516-linear \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $DATA_DIR/split_0/ \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 8.0 \
--output_dir ../logs/twitter15_split_0/
To train subsequence folds (fold 1-4), at the python command, change parameter data_dir
and output_dir
accordingly.
Change $DATA_DIR
to the correct location.
export DATA_DIR=/opt/src/rumor_lstm/data/processed_data/hierarchical_structure/twitter15/split_data
python run_classifier.py \
--task_name twitter-1516-2segments \
--do_train \
--do_eval \
--do_lower_case \
--data_dir $DATA_DIR/split_0/ \
--bert_model bert-base-uncased \
--max_seq_length 128 \
--train_batch_size 32 \
--learning_rate 2e-5 \
--num_train_epochs 8.0 \
--output_dir ../logs/twitter15_split_0/
To train subsequence folds (fold 1-4), at the python command, change parameter data_dir
and output_dir
accordingly.
The following code will save attention weights of the last layer of BERT, while summing up all attention heads of the last layer.
- In
Interpret_BERT.py
, specify the trained model location formodel_fn
.
model_fn = '../logs/hierarchical_models/twitter15_split_0/pytorch_model.bin' # Example. The model location is to be changed accordingly
- In
Interpret_BERT.py
, specify the directory of the test data. Note that if in step 1, the model chosen is coming from split_0, it is important to use the same split for the data.
# Example. The data location is to be changed accordingly
data_dir = 'C:/git/rumor-lstm/data/processed_data/hierarchical_structure/twitter15/split_data/split_0/'
- In
Interpret_BERT.py
, change variabletask_name
's value to"twitter-1516-2segments"
if running hierarchical model. Change to"twitter-1516-linear"
to run linear model.
task_name = "twitter-1516-2segments" # value can be "twitter-1516-2segments" or "twitter-1516-linear"
- Run
Interpret_BERT.py
python Interpret_BERT.py
The heatmaps are saved as .png
images, under folder ./heatmap_output/
.
File naming convention is of the form {test-Index}_{ground-truth}_{actual-output}.png
.
For example, 6_non-rumor_unverified.png
means the test sample # 6 has "Non-rumor" as label, but is predicted as "Unverified".