# Detecting COVID-19 and General Health Misinformation
**Authors**: Santiago von Straussburg, Kyle Parfait

## Milestone Report
This notebook contains our milestone report for the project on detecting COVID-19 and general health misinformation.

## Problem Overview

Our project aims to create a robust model for detecting fake health news, with a primary focus on COVID-19 misinformation. The challenge we're addressing extends beyond COVID-19 detection - we want to determine if a model trained on pandemic-specific misinformation can generalize to identify other types of misleading health claims.

This problem is significant because fake health information can have serious real-world consequences. During the COVID-19 pandemic, we witnessed how misinformation about cures, treatments, and vaccines could influence public behavior and potentially harm public health. Our hypothesis is that there are underlying patterns in health misinformation that transcend specific topics - in other words, a model that successfully identifies COVID-19 falsehoods might also effectively detect misleading claims about other health issues like miracle cures or unproven treatments.

If our approach successfully transfers from COVID-19 to other health misinformation, it would demonstrate that our system is not merely memorizing pandemic-specific language patterns but is actually learning meaningful characteristics of health misinformation in general. This would be a significant contribution to the ongoing battle against health-related fake news.

## Data

We are using two primary datasets focused on COVID-19 misinformation:

1. **COVID-19 Fake News Dataset** (from Kaggle): This dataset contains news articles labeled as either "fake" or "real" regarding COVID-19 information.

2. **CoAID (COVID-19 Healthcare Misinformation Dataset)**: This is a diverse collection that combines news articles, social media posts, and user engagement data related to COVID-19 information, all labeled as "fake" or "real".

### Data Collection Process

We obtained these datasets from their respective sources:

1. The COVID-19 Fake News Dataset was downloaded from Kaggle, where it was compiled by researchers collecting news articles and fact-checking their veracity during the pandemic.

2. The CoAID dataset was accessed through GitHub, where it was published by researchers at Pennsylvania State University. This dataset was compiled by collecting both news articles and social media content, which was then labeled through a combination of fact-checking website references and expert review.

### Data Challenges

The collection and preparation of these datasets presented several challenges:

1. **Noisy data from social media**: Social media content often contains slang, abbreviations, and non-standard language which makes preprocessing more complex.

2. **Class imbalance**: There are typically fewer examples of fake news compared to legitimate news, which can bias model training.

3. **Topic specificity**: COVID-19 data contains pandemic-specific terminology that may not generalize to other health topics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the visualization style
plt.style.use('ggplot')
sns.set(font_scale=1.2)

# Example code for loading datasets
# covid_fake_news_df = pd.read_csv('../dataset/NewsFakeCOVID-19.csv')
# fake_real_news_df = pd.read_csv('../dataset/fake_and_real_news.csv')

## Method

Our approach to detecting fake health news involves a multi-step process:

### Text Preprocessing Pipeline

1. **Tokenization**: Breaking text into individual tokens (words, punctuation)
2. **Lowercasing**: Converting all text to lowercase to reduce dimensionality
3. **Stop word removal**: Removing common words that don't contribute much meaning
4. **Lemmatization**: Reducing words to their base forms
5. **Special handling for URLs and mentions**: Replacing or removing these elements

### Modeling Approach

We are implementing and comparing two main approaches:

1. **Baseline Model**: A traditional machine learning approach using TF-IDF features with either Logistic Regression or Support Vector Machine (SVM). This serves as a benchmark for comparison.

2. **Advanced Model**: A transformer-based approach using a fine-tuned BERT model, which has shown strong performance in various text classification tasks.

### Evaluation Metrics

We will evaluate our models using the following metrics:

1. **Accuracy**: The proportion of correctly classified instances
2. **Precision**: The proportion of true positive predictions among all positive predictions
3. **Recall**: The proportion of true positive predictions among all actual positives
4. **F1-Score**: The harmonic mean of precision and recall
5. **Confusion Matrix**: A visualization of prediction errors and correct classifications

We'll also perform cross-domain evaluation by testing our COVID-19-trained model on non-COVID health misinformation to assess generalization capabilities.

## Intermediate/Preliminary Experiments & Results

At this milestone, we have conducted several preliminary experiments and analyses:

### Target Word Analysis

We analyzed the frequency of specific target words ("kills", "vaccine", "force", "death", "facebook") in news articles. This analysis helps us understand linguistic patterns that might differentiate fake from real news. The `wordCount.py` script was used to process a sample of articles and count these target words.

### Data Preparation Progress

We have completed the following steps in data preparation:

1. **Data Collection**: Acquired the COVID-19 Fake News Dataset and the CoAID collection
2. **Initial Data Exploration**: Analyzed dataset structure, class distribution, and basic statistics
3. **Text Preprocessing Pipeline**: Developed and tested preprocessing functions for cleaning text data
4. **Target Word Analysis**: Analyzed frequency of specific words that might indicate fake news

### Preliminary Model Testing

We have designed the framework for our baseline models (TF-IDF with Logistic Regression or SVM) and are in the process of implementing our advanced BERT-based approach. Initial tests on small subsets of data show promising results, though comprehensive evaluation is still pending.

### Challenges and Adjustments

During our preliminary work, we've encountered several challenges:

1. **Data Quality Issues**: Some news URLs were inaccessible or returned empty content. We've implemented robust error handling to deal with these cases.

2. **Computational Constraints**: BERT models are computationally intensive. We're exploring methods to optimize memory usage, such as gradient accumulation and mixed-precision training.

3. **Domain Transfer Challenge**: Initial tests suggest that models trained solely on COVID-19 data struggle with non-COVID health misinformation. We're investigating domain adaptation techniques to improve cross-domain performance.

## Related Work

Several research papers have addressed fake news detection, particularly in the context of health and COVID-19 misinformation. Here, we summarize five key papers and compare them to our approach:

### 1. Patwa et al. (2021) - "Fighting an Infodemic: COVID-19 Fake News Dataset"

This paper introduced a large-scale COVID-19 fake news dataset and tested various classical machine learning and deep learning models. Interestingly, they found that simpler models like SVM and logistic regression sometimes outperformed more complex architectures.

**Comparison to our work**: While Patwa et al. focused purely on COVID-19 misinformation, our project extends beyond this to test generalization to other health topics. We are also using their finding about simpler models sometimes outperforming complex ones to justify our baseline comparison approach.

### 2. Cui & Lee (2020) - "CoAID: COVID-19 Healthcare Misinformation Dataset"

This paper introduced the CoAID dataset, which combines news articles, social media posts, and user engagement metrics. A unique aspect of their work is the inclusion of social interaction data (likes, shares) and how these correlate with the spread of misinformation.

**Comparison to our work**: We're using the CoAID dataset but focusing primarily on the textual content rather than social engagement metrics. However, their insights about the viral potential of health misinformation inform our understanding of why this problem is important.

### 3. Shahi & Nandini (2020) - "FakeCovid: A Multilingual Cross-Domain Fact Check News Dataset for COVID-19"

This paper compiled a multilingual dataset of fact-checked COVID-19 articles from numerous countries. Their focus was on cross-lingual and cross-domain analysis, examining how fake news varies across different cultural contexts.

**Comparison to our work**: While we're currently focusing on English language content, Shahi & Nandini's cross-domain approach aligns with our goal of testing generalization from COVID-19 to other health topics. Their findings about challenges in cross-domain detection inform our expectations about transfer learning difficulties.

### 4. Kar et al. (2020) - "No Rumours Please! A Multi-Indic-Lingual Approach for COVID Fake-Tweet Detection"

This study tackled multilingual fake tweet detection using a BERT-based framework. Their work demonstrated that even with limited labeled data, pre-trained models can achieve good results when fine-tuned appropriately.

**Comparison to our work**: Like Kar et al., we're using a BERT-based approach, though our focus is on domain transfer rather than language transfer. Their finding that domain-specific features boost detection accuracy is relevant to our project, suggesting we might benefit from incorporating health-specific features.

### 5. Vijjali et al. (2020) - "Two Stage Transformer Model for COVID-19 Fake News Detection and Fact Checking"

This paper proposed an innovative two-stage approach: first retrieving relevant facts from a knowledge base, then using textual entailment to verify claims. This combines detection with automated fact-checking.

**Comparison to our work**: Our current approach is more focused on the detection stage, without the fact-checking component that Vijjali et al. implemented. However, their transformer-based architecture is similar to our BERT approach, and their results provide a benchmark for what's achievable with transformer models on COVID-19 misinformation.

## Division of Labor

The project responsibilities are divided between team members as follows:

**Santiago von Straussburg**:
- Data collection and preprocessing
- Implementation of the baseline models (TF-IDF with Logistic Regression/SVM)
- Evaluation metrics development and analysis
- Documentation and report writing

**Kyle Parfait**:
- Advanced model implementation (BERT-based approach)
- Cross-domain transfer testing and analysis
- Visualization of results
- Code review and optimization

Both team members collaborate on experimental design, interpretation of results, and the final project presentation.

## Timeline

The following outlines our planned steps and projected completion dates:

1. **Complete Data Preprocessing** (April 20, 2025)
   - Finalize text cleaning pipeline
   - Merge datasets and create train/test splits
   - Prepare non-COVID health misinformation test set

2. **Finalize Baseline Models** (April 27, 2025)
   - Implement and optimize TF-IDF with Logistic Regression
   - Implement and optimize TF-IDF with SVM
   - Compare performance and select best baseline

3. **Implement BERT-based Model** (May 4, 2025)
   - Fine-tune pre-trained BERT on COVID-19 dataset
   - Optimize hyperparameters
   - Implement memory-efficient training strategies

4. **Conduct Cross-Domain Testing** (May 11, 2025)
   - Evaluate models on non-COVID health misinformation
   - Analyze error patterns and potential improvements
   - Implement domain adaptation techniques if needed

5. **Complete Final Analysis and Report** (May 18, 2025)
   - Compile comprehensive evaluation results
   - Create visualizations for key findings
   - Write final report and prepare presentation

6. **Project Presentation and Submission** (May 25, 2025)
   - Finalize project presentation
   - Complete and submit all deliverables
   - Document code and ensure reproducibility

## References

1. Patwa, P., Sharma, S., Pykl, S., Guptha, V., Kumari, G., Akhtar, M. S., Ekbal, A., Arora, A., & Chakraborty, T. (2021). Fighting an infodemic: COVID-19 fake news dataset. Communications and Network Security. https://arxiv.org/abs/2011.03327

2. Cui, L., & Lee, D. (2020). CoAID: COVID-19 healthcare misinformation dataset. arXiv preprint. https://arxiv.org/abs/2006.00885

3. Shahi, G. K., & Nandini, D. (2020). FakeCovid: A multilingual cross-domain fact check news dataset for COVID-19. arXiv preprint. https://arxiv.org/abs/2006.11343

4. Kar, S., Bhardwaj, R., Samanta, S., & Bhagat, A. (2020). No rumours please! A multi-indic-lingual approach for COVID fake-tweet detection. arXiv preprint. https://arxiv.org/abs/2010.06906

5. Vijjali, R., Potluri, P., Kumar, S., & Teki, S. (2020). Two stage transformer model for COVID-19 fake news detection and fact checking. arXiv preprint. https://arxiv.org/abs/2011.13253