In [9]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

# Abstract

# 1. Introduction

## 1.1 The Growing Need for Fact-Checking in Social Media

In the digital era, social media platforms have become dominant sources of information for millions of users. However, the open nature and virality of these platforms have also made them fertile ground for the spread of misinformation and false claims. 

As noted by Zeng et al. <cite id="h0n81"><a href="#zotero%7C79625%2FZ62QSGRH">[1]</a></cite> "the rise of social media has drastically increased the speed and reach of information dissemination, often bypassing traditional gatekeepers of truth such as journalists and editors." This shift has created a pressing demand for scalable, automated systems capable of evaluating the factual accuracy of content in real time. The objective is not merely to flag content post hoc but to build proactive mechanisms that can assist users, journalists, and platforms in identifying misleading information before it goes viral.

Automated fact-checking has emerged as a promising solution to this challenge. 

According to Thorne & Vlachos <cite id="6hhjr"><a href="#zotero%7C79625%2FYFIM6W6S">[2]</a></cite>, automated fact-checking systems typically consist of three core stages: claim detection, evidence retrieval, and claim verification. While many recent advances in this area rely on deep learning techniques, classical mathematical methods such as **TF-IDF** and **cosine similarity** still play a crucial role in the early stages—particularly in retrieving semantically relevant candidate texts efficiently. 

As Chen et al. <cite id="8kosb"><a href="#zotero%7C79625%2FZ792DB78">[3]</a></cite> demonstrated in their open-domain QA system, TF-IDF-based retrieval remains a reliable and computationally lightweight approach to identifying supporting information from large knowledge sources like Wikipedia. These findings support the use of TF-IDF in educational and prototype systems where interpretability, speed, and mathematical transparency are critical.




## 1.2 The Research Question

In the context of growing misinformation on social media, the need for tools that allow rapid verification of factual claims is increasingly pressing. This project is driven by the following research question:

> **How can classical mathematical methods be used to assess semantic similarity between texts for the purpose of preliminary fact-checking?**

This project explores the application of **TF-IDF** vectorisation and **cosine similarity** as methods for evaluating semantic closeness between social media posts and a database of verified claims. The rationale for using **TF-IDF** vectorisation and **cosine similarity** is described in the following section.

## 1.3 Foundations of Fact-Checking and Semantic Comparison

Fact-checking is a structured process aimed at evaluating the truthfulness of a given claim based on available evidence. As outlined by *Thorne and Vlachos (2018)*, most computational fact-checking pipelines can be broken down into three fundamental stages:  
1. **Claim detection**, where a potentially check-worthy statement is identified  
2. **Evidence retrieval**, where supporting or contradicting information is located from trusted sources  
3. **Claim verification**, where the semantic relationship between the claim and the evidence is assessed to determine the truth value <cite id="rl1xa"><a href="#zotero%7C79625%2FZ62QSGRH">[1]</a></cite> <cite id="sw4j6"><a href="#zotero%7C79625%2FYFIM6W6S">[2]</a></cite>.

This pipeline reflects not only the logical reasoning humans engage in when evaluating information but also the operational structure that automated systems must emulate to perform reliable verification.

A critical step in this pipeline is evidence retrieval, which depends heavily on the ability to identify **semantically similar** texts. The task is nontrivial: unlike keyword search, semantic similarity measures aim to capture conceptual overlap even when different surface forms are used. For example, the claim *“The president visited Berlin in 2023”* should be recognized as semantically related to a sentence like *“In 2023, the German capital welcomed the head of state.”* Such tasks demand methods that go beyond syntactic comparison and measure meaning, even in the absence of shared vocabulary.

While modern deep learning models such as BERT have advanced the state of the art in semantic understanding, **classical mathematical approaches remain essential tools**, especially in the evidence retrieval phase. **TF-IDF** (Term Frequency–Inverse Document Frequency) is one of the most commonly used techniques for text vectorization, offering a computationally efficient way to represent documents numerically. When combined with **cosine similarity**, TF-IDF enables systems to retrieve texts that are most relevant to a given claim, by quantifying how closely related their term distributions are. This approach was notably used by Chen et al. <cite id="zpxoj"><a href="#zotero%7C79625%2FZ792DB78">[3]</a></cite> in their open-domain QA system, where TF-IDF was applied to quickly filter a large corpus of Wikipedia articles before deeper analysis.

Zeng et al.<cite id="5d20n"><a href="#zotero%7C79625%2FZ62QSGRH">[1]</a></cite> highlight that even with the rise of neural models, TF-IDF and cosine similarity continue to serve as effective first-pass methods in hybrid systems. They offer a transparent and resource-efficient alternative, especially valuable in real-time environments or early-stage prototypes where training complex models is not feasible. These methods also align well with educational contexts, where the focus is on understanding underlying mathematical principles rather than relying entirely on pretrained models. As such, semantic comparison via TF-IDF is not only relevant but often preferable in settings where speed, interpretability, and low computational cost are priorities.


## 1.4 Methods for Semantic Text Comparison

### 1.4.1 Classical approach: TF-IDF and Cosine similarity

One of the most widely used classical techniques for representing textual data numerically is **TF-IDF** (Term Frequency–Inverse Document Frequency). This method assigns weight to each term in a document based on its frequency within the document (TF) and the inverse of its frequency across all documents in a corpus (IDF). The IDF component penalizes common words (e.g., “the”, “is”) and amplifies more informative, discriminative terms. This formulation, introduced by Spärck Jones<cite id="p0gle"><a href="#zotero%7C79625%2FYS2A7YKH">[4]</a></cite> and mathematically refined by Robertson<cite id="ee99k"><a href="#zotero%7C79625%2FQPZSI7AS">[5]</a></cite>, enables a simple yet powerful transformation of unstructured text into vectors suitable for quantitative analysis.

#### 1.4.1.1 Term Frequency (TF)

Measures how frequently a term appears in a document relative to the total number of terms in that document:

$$
TF(t, d) = \frac{f_{t,d}}{\sum_k f_{k,d}}
$$

Where:  
- \( f_{t,d} \) = frequency of term \( t \) in document \( d \)  
- \( \sum_k f_{k,d} \) = total number of terms in document \( d \)

#### 1.4.1.2 Inverse Document Frequency (IDF)

Evaluates how "unique" or "rare" a term is across the entire document collection:

$$
IDF(t, D) = \log\left( \frac{N}{1 + |\{d \in D : t \in d\}|} \right)
$$

Where:  
- \( N \) = total number of documents in the collection  
- \( |\{d \in D : t \in d\}| \) = number of documents in which term \( t \) appears  

> **Note:** The "+1" in the denominator is used to avoid division by zero.

#### 1.4.1.3 TF-IDF

Combines the two metrics to evaluate how important a term is to a specific document, while accounting for how common or rare it is across the full corpus:

$$
TF\text{-}IDF(t, d, D) = TF(t,d) \times IDF(t,D)
$$

Or,

$$
TF\text{-}IDF(t, d, D) = TF(t,d) \times \log\left( \frac{N}{1 + df(t)} \right)
$$

After text is converted into TF-IDF vectors, **cosine similarity** is commonly used to measure the semantic closeness between two documents. It calculates the cosine of the angle between two vectors in an n-dimensional space:

$$
\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \cdot ||\vec{B}||}
$$

A cosine value close to 1 indicates high similarity. This approach is computationally efficient, easy to implement, and interpretable—making it a popular baseline in many information retrieval and natural language processing (NLP) applications. Mihalcea et al. <cite id="pdr5j"><a href="#zotero%7C79625%2FMDDH62P9">[6]</a></cite> demonstrated that TF-IDF and cosine similarity are effective in a wide range of tasks, including summarization, question answering, and semantic similarity estimation. More recently, Marcińczuk et al. <cite id="v2rio"><a href="#zotero%7C79625%2F2P77V7I4">[7]</a></cite> showed that, in some constrained environments, TF-IDF methods can perform comparably or even outperform modern embedding-based techniques such as Word2Vec and BERT.


### 1.4.2 Comparison to the more modern Embedding methods

Modern alternatives to TF-IDF include **neural embedding methods** like Word2Vec, GloVe, Doc2Vec, and most notably BERT. These techniques rely on pretrained language models that capture context, syntax, and semantics by mapping words and sentences into dense, high-dimensional vectors based on their surrounding text. For example, BERT uses deep bidirectional transformers trained on massive corpora to understand nuanced context and polysemy, allowing for superior performance in many downstream NLP tasks.

However, this power comes at a cost. Embedding methods are **computationally intensive**, often requiring specialized hardware (e.g., GPUs), large datasets, and familiarity with machine learning frameworks. Moreover, they are less interpretable than TF-IDF. While TF-IDF clearly highlights which terms contribute most to similarity, embeddings operate in a latent space that is difficult to intuitively understand or audit. As Chandrasekaran & Mago <cite id="vs5yt"><a href="#zotero%7C79625%2FMLYXVA9L">[8]</a></cite> point out in their comprehensive survey, embeddings offer significant advantages in terms of contextualization, but the trade-offs in terms of transparency and resource demands must be considered carefully.


### 1.4.3 Why TF-IDF is used in this project

Given the mathematical focus of this course and the goal of building a demonstrable prototype within a limited time frame, **TF-IDF combined with cosine similarity is a well-justified and appropriate choice**. It enables a rigorous yet transparent application of mathematical concepts such as logarithmic scaling, vector norms, and inner products—all foundational to the field of machine learning. In contrast, embedding methods—while powerful—abstract away many of the underlying mathematical details, making them less pedagogically valuable in an early-stage AI engineering curriculum.

Additionally, the interpretability of TF-IDF provides a clear advantage in educational contexts. It is easy to trace how each word contributes to a similarity score, which not only aids in debugging but also facilitates deeper understanding of the connection between language and mathematics. The method is also **scalable, fast, and language-agnostic**, making it ideal for the limited dataset and controlled setting envisioned in this course project.

For these reasons, TF-IDF will serve as the primary method for semantic comparison in this project, laying a solid mathematical foundation while leaving the door open to future enhancements using more complex NLP models in later stages of the AI engineering program.



# 2. Building a Mini-Dataset and Preprocessing

## 2.1 Selecting the set of Social media claims

For the purposes of this educational project, I will construct a mini-dataset of around 8 to 12 short text claims, simulating real-world social media posts. These claims should be labeled as either "True" or "False", with 4–6 examples per class. The goal is not quantity but clarity and control. The dataset should contain concise, varied claims with known veracity.

Criteria for selecting the media posts for the mini-dataset:
* Each post should be under 300 characters (to fit with traditional 'tweets', media headlines, comments in comments section, etc.
* Use of informal phrasing where appropriate, to simulate real-world postings.
* The "False" claims are verifiably incorrect, based on trusted sources.

## 2.2 Text preprocessing

Once your mini-dataset is compiled, you will need to clean and tokenize the text to prepare it for vectorization. The goal is to reduce noise and standardize the structure while retaining meaning.

Preprocessing steps:
* Lowercasing – Convert all text to lowercase for uniformity.
* Punctuation removal – Strip out punctuation marks (periods, commas, quotes, etc.) using regular expressions.
* Tokenization – Split each text into a list of individual words or "tokens". Recommended tool: nltk.word_tokenize() or Python’s built-in split().
* Stop word removal – Filter out common non-informative words (e.g., “the”, “is”, “and”). Recommended: Use NLTK’s built-in English stopword list:
from nltk.corpus import stopwords
* (Optional) Lemmatization or stemming – Reduces words to their base or root form. Use nltk.stem.WordNetLemmatizer() for lemmatization if needed.


# 3. Mathematical Analysis and Visualization

## 3.1 Calculating TF, IDF, and TF-IDF

### 3.1.1 Mathematical Formulas

**Term Frequency (TF):**

$$
TF(t, d) = \frac{f_{t,d}}{\sum_k f_{k,d}}
$$

**Inverse Document Frequency (IDF):**

$$
IDF(t, D) = \log\left( \frac{N}{1 + |\{d \in D : t \in d\}|} \right)
$$

**TF-IDF:**

$$
TF\text{-}IDF(t, d, D) = TF(t,d) \times IDF(t,D)
$$

### 3.1.2 Manual Calculation (Demonstration)

- Choose 2–3 short sample documents (3–5 words each).
- Show manual calculation of:
  - Term frequency for each word
  - IDF for each term in the corpus
  - Resulting TF-IDF values for each word in each document
- Present results in a simple table for clarity.


### 3.1.3 Programmatic TF-IDF sing `TfidfVectorizer`

In [None]:
corpus = [
    "Climate change is real",
    "The earth is flat",
    "Global warming is a serious issue",
    "The moon landing was faked"
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

# References

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|79625/Z62QSGRH"></i>
    <div class="csl-left-margin">[1]</div><div class="csl-right-inline">X. Zeng, A. S. Abumansour, and A. Zubiaga, “Automated fact-checking: A survey,” <i>Language and Linguistics Compass</i>, vol. 15, no. 10, p. e12438, 2021, doi: <a href="https://doi.org/10.1111/lnc3.12438">10.1111/lnc3.12438</a>.</div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/YFIM6W6S"></i>
    <div class="csl-left-margin">[2]</div><div class="csl-right-inline">J. Thorne and A. Vlachos, “Automated Fact Checking: Task Formulations, Methods and Future Directions,” in <i>Proceedings of the 27th International Conference on Computational Linguistics</i>, Santa Fe, New Mexico, USA, Aug. 2018, pp. 3346–3359. Accessed: May 24, 2025. [Online]. Available: <a href="https://aclanthology.org/C18-1283/">https://aclanthology.org/C18-1283/</a></div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/Z792DB78"></i>
    <div class="csl-left-margin">[3]</div><div class="csl-right-inline">D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia to Answer Open-Domain Questions,” in <i>Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)</i>, Vancouver, Canada, Jul. 2017, pp. 1870–1879. doi: <a href="https://doi.org/10.18653/v1/P17-1171">10.18653/v1/P17-1171</a>.</div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/YS2A7YKH"></i>
    <div class="csl-left-margin">[4]</div><div class="csl-right-inline">K. Spärck Jones, “A Statistical Interpretation of Term Specifity and Its Application in Retrieval,” <i>Journal of Documentation</i>, vol. 28, no. 1, pp. 11–21, Jan. 1972, doi: <a href="https://doi.org/10.1108/eb026526">10.1108/eb026526</a>.</div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/QPZSI7AS"></i>
    <div class="csl-left-margin">[5]</div><div class="csl-right-inline">S. Robertson, “Understanding inverse document frequency: on theoretical arguments for IDF,” <i>Journal of Documentation</i>, vol. 60, no. 5, pp. 503–520, Jan. 2004, doi: <a href="https://doi.org/10.1108/00220410410560582">10.1108/00220410410560582</a>.</div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/MDDH62P9"></i>
    <div class="csl-left-margin">[6]</div><div class="csl-right-inline">R. Mihalcea, “Corpus-based and Knowledge-based Measures of Text Semantic Similarity”.</div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/2P77V7I4"></i>
    <div class="csl-left-margin">[7]</div><div class="csl-right-inline">M. Marcińczuk, M. Gniewkowski, T. Walkowiak, and M. Będkowski, “Text Document Clustering: Wordnet vs. TF-IDF vs. Word Embeddings,” in <i>Proceedings of the 11th Global Wordnet Conference</i>, University of South Africa (UNISA), Jan. 2021, pp. 207–214. Accessed: May 24, 2025. [Online]. Available: <a href="https://aclanthology.org/2021.gwc-1.24/">https://aclanthology.org/2021.gwc-1.24/</a></div>
  </div>
  <div class="csl-entry"><i id="zotero|79625/MLYXVA9L"></i>
    <div class="csl-left-margin">[8]</div><div class="csl-right-inline">D. Chandrasekaran and V. Mago, “Evolution of Semantic Similarity—A Survey,” <i>ACM Comput. Surv.</i>, vol. 54, no. 2, p. 41:1-41:37, Feb. 2021, doi: <a href="https://doi.org/10.1145/3440755">10.1145/3440755</a>.</div>
  </div>
</div>
<!-- BIBLIOGRAPHY END -->