Winter 2024 - Evidence Augmented LLMs For Misinformation Detection

Project Description

This model proposes a novel approach to fact-checking by leveraging Large Language Models (LLMs) within a multi-model pipeline to provide both veracity labels and informative explanations for claims. Building upon previous research, we integrate various predictive AI models and external evidence from reliable sources to enhance the contextuality and accuracy of our predictions.

Project Websites

Files

Our predictive models can be trained by running the following scripts:

clickbait.py (clickbait model)
context_veracity.py (context veracity model)
fallacy_detection.py (logical fallacy model`
political_bias.py (political bias model)
source_reliable.py (source reliability model)
spam.py (spam model)
text_manipulation.py (textual manipulation model)

The final pipeline (including the generative model) is found in final_pipeline.py.

Most of our experiments can be found in the Experiment_scripts folder.

Running Instructions

To install the dependencies, run the following command from the root directory of the project: pip install -r requirements.txt
To get the model running, run python final_pipeline.py
To get the web app running locally, run streamlit run app.py

Data Usage

LIAR-PLUS: an expanded iteration derived from the foundational LIAR dataset, encompassing a comprehensive array of 16 distinct features. Among these features are notable elements such as historical evaluations, subject matter, 3 party affiliation, justification, and various others. Comprising a training set with 10,242 instances, a validation set containing 1,284 instances, and a test set comprising 1,267 instances, the LIAR-Plus dataset presents a well-structured resource for training, validating, and testing.
Data Scraped from PolitiFact.com: collected dataset contains 25,615 elements and ten attributes including statements, summaries, historical evaluations, and other features. This dataset is used in building our predictive models for credibility, spam, source reliability, etc. We also use this dataset to evaluate the full pipeline, in conjunction with LIAR-PLUS.
Data from Kaggle.com: consists of 32,000 rows with two columns: the first containing headlines from diverse news sites and the second featuring numerical labels, indicating clickbait status (1 for clickbait, 0 for non-clickbait).
POLUSA Dataset: a large dataset of news articles which we used for evidence retrieval. This dataset contains approximately 0.9M articles covering political topics published between Jan. 2017 and Aug. 2019 by 18 news outlets.
Entity-Manipulated Text Dataset: a large dataset to allow us predict text manipulation within the context consisting of training, validating, and testing subsets. Text and label are the two main features that we apply in training our style (text manipulation) model.

Notes

Make sure to run: pip install --upgrade --no-cache-dir gdown for downloading large models in drive

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Experiment_scripts		Experiment_scripts
models		models
original_files		original_files
.DS_Store		.DS_Store
.gitignore		.gitignore
Dockerfile		Dockerfile
NPR_archive_scraping.ipynb		NPR_archive_scraping.ipynb
README.md		README.md
app.py		app.py
clickbait.py		clickbait.py
context_veracity.py		context_veracity.py
credibility.py		credibility.py
fallacy_detection.py		fallacy_detection.py
final_pipeline.py		final_pipeline.py
generative_pipeline_draft.ipynb		generative_pipeline_draft.ipynb
pipeline_draft.ipynb		pipeline_draft.ipynb
political_bias.py		political_bias.py
predict_liar_factors.ipynb		predict_liar_factors.ipynb
predictive_pipeline.ipynb		predictive_pipeline.ipynb
requirements.txt		requirements.txt
sentiment.py		sentiment.py
source_reliable.py		source_reliable.py
spam.py		spam.py
text_manipulation.py		text_manipulation.py

seanjiang-0416/DSC-180B-pipeline

Folders and files

Latest commit

History

Repository files navigation

Winter 2024 - Evidence Augmented LLMs For Misinformation Detection

Project Description

Project Websites

Files

Running Instructions

Data Usage

Notes

About

Resources

Stars

Watchers

Forks

Languages