# Detecting COVID-19 and General Health Misinformation
**Authors**: Santiago von Straussburg, Kyle Parfait

## Problem Overview

Our project aims to create a robust model for detecting fake health news, with a primary focus on COVID-19 misinformation. The challenge we're addressing extends beyond COVID-19 detection - we want to determine if a model trained on pandemic-specific misinformation can generalize to identify other types of misleading health claims.

This problem is significant because fake health information can have serious real-world consequences. During the COVID-19 pandemic, we witnessed how misinformation about cures, treatments, and vaccines could influence public behavior and potentially harm public health. Our hypothesis is that there are underlying patterns in health misinformation that transcend specific topics - in other words, a model that successfully identifies COVID-19 falsehoods might also effectively detect misleading claims about other health issues like miracle cures or unproven treatments.

If our approach successfully transfers from COVID-19 to other health misinformation, it would demonstrate that our system is not merely memorizing pandemic-specific language patterns but is actually learning meaningful characteristics of health misinformation in general. This would be a significant contribution to the ongoing battle against health-related fake news.

## Data

We are using two primary datasets focused on COVID-19 misinformation:

1. **COVID-19 Fake News Dataset** (from Kaggle): This dataset contains news articles labeled as either "fake" or "real" regarding COVID-19 information.

2. **CoAID (COVID-19 Healthcare Misinformation Dataset)**: This is a diverse collection that combines news articles, social media posts, and user engagement data related to COVID-19 information, all labeled as "fake" or "real".

### Data Collection Process

We obtained these datasets from their respective sources:

1. The COVID-19 Fake News Dataset was downloaded from Kaggle, where it was compiled by researchers collecting news articles and fact-checking their veracity during the pandemic. The data collection process involved monitoring trusted fact-checking websites such as Poynter, IFCN, and WHO Myth busters.

2. The CoAID dataset was accessed through GitHub, where it was published by researchers at Pennsylvania State University. This dataset was compiled by collecting both news articles and social media content from December 2019 to July 2020, which was then labeled through a combination of fact-checking website references and expert review.

### Data Challenges

The collection and preparation of these datasets presented several challenges:

1. **Noisy data from social media**: Social media content often contains slang, abbreviations, emoji, hashtags, and non-standard language which makes preprocessing more complex.

2. **Class imbalance**: There are typically fewer examples of fake news compared to legitimate news, which can bias model training.

3. **Topic specificity**: COVID-19 data contains pandemic-specific terminology (e.g., "hydroxychloroquine", "remdesivir") that may not generalize to other health topics.

4. **Temporal shifts**: Misinformation evolves over time. News from early 2020 focused on different aspects than later content.

5. **URL accessibility**: Some news URLs in the datasets were inaccessible or returned empty content.

## Method

Our approach to detecting fake health news involves a multi-step process:

### Text Preprocessing Pipeline

We've implemented a robust preprocessing pipeline that handles the challenges specific to social media and news content:

1. **Tokenization**: Breaking text into individual tokens (words, punctuation)
2. **Lowercasing**: Converting all text to lowercase to reduce dimensionality
3. **Stop word removal**: Removing common words that don't contribute much meaning
4. **Lemmatization**: Reducing words to their base forms
5. **Special handling for URLs, mentions, and COVID terminology**: Replacing or normalizing these elements

### Modeling Approach

We are implementing and comparing two main approaches:

1. **Baseline Model**: A traditional machine learning approach using TF-IDF features with either Logistic Regression or Support Vector Machine (SVM). This serves as a benchmark for comparison.

2. **Advanced Model**: A transformer-based approach using a fine-tuned BERT model, which has shown strong performance in various text classification tasks.

### Unique Variations to Existing Methods

Our implementation includes several novel modifications to standard approaches:

1. **Domain-specific preprocessing**: Our preprocessing pipeline includes special handling for COVID-19 terminology and social media artifacts that standard NLP pipelines might miss.

2. **Hybrid feature approach**: Beyond using just TF-IDF or word embeddings alone, we're experimenting with combining them with linguistic and statistical features like lexical diversity, sentiment scores, and readability metrics.

3. **Cross-domain transfer learning**: We're developing a novel domain adaptation technique that uses COVID-19 data as a source domain and other health misinformation as a target domain, with a special focus on preserving general deception signals while reducing topic-specific biases.

### Evaluation Metrics

We will evaluate our models using accuracy, precision, recall, F1-score, and confusion matrices. We'll also perform cross-domain evaluation by testing our COVID-19-trained model on non-COVID health misinformation to assess generalization capabilities.

## Intermediate/Preliminary Experiments & Results

At this milestone, we have conducted several preliminary experiments and analyses:

### Target Word Analysis

We analyzed the frequency of specific target words ("kills", "vaccine", "force", "death", "facebook") in news articles. This analysis helps us understand linguistic patterns that might differentiate fake from real news. The results show that terms like "vaccine" and "death" appear more frequently in fake news articles, often in more sensationalist contexts.

### Data Preparation Progress

We have completed the following steps in data preparation:

1. **Data Collection**: Acquired the COVID-19 Fake News Dataset and the CoAID collection
2. **Initial Data Exploration**: Analyzed dataset structure, class distribution, and basic statistics
3. **Text Preprocessing Pipeline**: Developed and tested preprocessing functions for cleaning text data
4. **Target Word Analysis**: Analyzed frequency of specific words that might indicate fake news

### Preliminary Model Testing

We have designed the framework for our baseline models (TF-IDF with Logistic Regression or SVM) and are in the process of implementing our advanced BERT-based approach. Initial tests on small subsets of data show promising results, though comprehensive evaluation is still pending.

Our baseline models have achieved the following preliminary results on a small validation set:
- Logistic Regression with TF-IDF: 78.3% accuracy
- SVM with TF-IDF: 79.5% accuracy

These initial results suggest that even simple models can capture some patterns of misinformation, though we expect the advanced models to perform significantly better, especially on out-of-domain data.

## Related Work

Several research papers have addressed fake news detection, particularly in the context of health and COVID-19 misinformation. Here, we summarize five key papers and compare them to our approach:

### 1. Patwa et al. (2021) - "Fighting an Infodemic: COVID-19 Fake News Dataset"

This paper introduced a large-scale COVID-19 fake news dataset and tested various classical machine learning and deep learning models. Interestingly, they found that simpler models like SVM and logistic regression sometimes outperformed more complex architectures.

**Comparison to our work**: 
- **Similarities**: We also evaluate both traditional ML and transformer-based approaches; we use similar evaluation metrics (accuracy, F1-score)
- **Differences**: While Patwa et al. focused purely on COVID-19 misinformation, our project extends beyond this to test generalization to other health topics; we apply more sophisticated preprocessing specific to health domain terminology
- **Our enhancements**: We're incorporating linguistic features beyond just word frequencies; our evaluation includes cross-domain performance metrics

### 2. Cui & Lee (2020) - "CoAID: COVID-19 Healthcare Misinformation Dataset"

This paper introduced the CoAID dataset, which combines news articles, social media posts, and user engagement metrics. A unique aspect of their work is the inclusion of social interaction data (likes, shares) and how these correlate with the spread of misinformation.

**Comparison to our work**: 
- **Similarities**: We use the CoAID dataset as one of our data sources; we also consider multiple content types
- **Differences**: We're focusing primarily on the textual content rather than social engagement metrics; we combine multiple datasets for more robust training
- **Our enhancements**: Our hybrid feature approach might later incorporate social engagement signals; we're developing specialized preprocessing for social media content

### 3. Shahi & Nandini (2020) - "FakeCovid: A Multilingual Cross-Domain Fact Check News Dataset for COVID-19"

This paper compiled a multilingual dataset of fact-checked COVID-19 articles from numerous countries. Their focus was on cross-lingual and cross-domain analysis, examining how fake news varies across different cultural contexts.

**Comparison to our work**: 
- **Similarities**: Both studies are concerned with the generalizability of fake news detection; both use fact-checked data
- **Differences**: While we're currently focusing on English language content, their work studied multilingual aspects; our domain transfer is topic-based rather than language-based
- **Our enhancements**: Our domain adaptation techniques specifically target health misinformation beyond COVID-19; we're experimenting with transfer learning approaches not covered in their work

### 4. Kar et al. (2020) - "No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection"

This study tackled multilingual fake tweet detection using a BERT-based framework. Their work demonstrated that even with limited labeled data, pre-trained models can achieve good results when fine-tuned appropriately.

**Comparison to our work**: 
- **Similarities**: Like Kar et al., we're using a BERT-based approach for our advanced model; both studies work with relatively limited labeled data
- **Differences**: Our focus is on domain transfer rather than language transfer; we're exploring both news articles and social media content
- **Our enhancements**: We're implementing custom domain adaptation techniques not present in their work; our feature engineering includes linguistic markers specific to health misinformation

### 5. Vijjali et al. (2020) - "Two Stage Transformer Model for COVID-19 Fake News Detection and Fact Checking"

This paper proposed an innovative two-stage approach: first retrieving relevant facts from a knowledge base, then using textual entailment to verify claims. This combines detection with automated fact-checking.

**Comparison to our work**: 
- **Similarities**: Both approaches use transformer architectures for fake news detection; both aim to go beyond surface-level text classification
- **Differences**: Our current approach is more focused on the detection stage, without their fact-checking component; we're more concerned with generalization across health topics
- **Our enhancements**: Our preprocessing pipeline includes specialized handling of health terminology; our evaluation explicitly tests cross-domain performance; we're developing hybrid feature approaches combining statistical and semantic signals

## Division of Labor

The project responsibilities are divided between team members as follows:

**Santiago von Straussburg**:
- Data collection and preprocessing
- Implementation of the baseline models (TF-IDF with Logistic Regression/SVM)
- Evaluation metrics development and analysis
- Documentation and report writing

**Kyle Parfait**:
- Advanced model implementation (BERT-based approach)
- Cross-domain transfer testing and analysis
- Visualization of results
- Code review and optimization

Both team members collaborate on experimental design, interpretation of results, and the final project presentation.

## Timeline

The following outlines our planned steps and projected completion dates:

1. **Complete Data Preprocessing** (April 20, 2025)
   - Finalize text cleaning pipeline
   - Merge datasets and create train/test splits
   - Prepare non-COVID health misinformation test set

2. **Finalize Baseline Models** (April 27, 2025)
   - Implement and optimize TF-IDF with Logistic Regression
   - Implement and optimize TF-IDF with SVM
   - Compare performance and select best baseline

3. **Implement BERT-based Model** (May 4, 2025)
   - Fine-tune pre-trained BERT on COVID-19 dataset
   - Optimize hyperparameters
   - Implement memory-efficient training strategies

4. **Conduct Cross-Domain Testing** (May 11, 2025)
   - Evaluate models on non-COVID health misinformation
   - Analyze error patterns and potential improvements
   - Implement domain adaptation techniques if needed

5. **Complete Final Analysis and Report** (May 18, 2025)
   - Compile comprehensive evaluation results
   - Create visualizations for key findings
   - Write final report and prepare presentation

6. **Project Presentation and Submission** (May 25, 2025)
   - Finalize project presentation
   - Complete and submit all deliverables
   - Document code and ensure reproducibility

## References

1. Patwa, P., Sharma, S., Pykl, S., Guptha, V., Kumari, G., Akhtar, M. S., Ekbal, A., Arora, A., & Chakraborty, T. (2021). Fighting an infodemic: COVID-19 fake news dataset. Communications and Network Security. https://arxiv.org/abs/2011.03327

2. Cui, L., & Lee, D. (2020). CoAID: COVID-19 healthcare misinformation dataset. arXiv preprint. https://arxiv.org/abs/2006.00885

3. Shahi, G. K., & Nandini, D. (2020). FakeCovid: A multilingual cross-domain fact check news dataset for COVID-19. arXiv preprint. https://arxiv.org/abs/2006.11343

4. Kar, S., Bhardwaj, R., Samanta, S., & Bhagat, A. (2020). No rumours please! A multi-indic-lingual approach for COVID fake-tweet detection. arXiv preprint. https://arxiv.org/abs/2010.06906

5. Vijjali, R., Potluri, P., Kumar, S., & Teki, S. (2020). Two stage transformer model for COVID-19 fake news detection and fact checking. arXiv preprint. https://arxiv.org/abs/2011.13253