In [4]:
import numpy as np

# 

---

## 🧭 1.1 Формулиране на изследователски въпрос

> **Как можем да използваме математически модели за измерване на семантично сходство между текстове, с цел предварителна фактологична проверка (fact-checking)?**

Целта на този проект е да приложи класически математически инструменти (TF-IDF, векторно представяне, косинусова прилика) за сравнение на текстове и оценка на тяхната прилика. Проектът се фокусира върху математиката, а не върху машинното обучение.

---

## 🧠 1.2 Методи за семантично сравнение на текст

### 🔹 Term Frequency (TF)

\[
TF(t, d) = \frac{f_{t,d}}{\sum_k f_{k,d}}
\]

- \( f_{t,d} \): честота на дума _t_ в документ _d_  
- \( \sum_k f_{k,d} \): общ брой думи в документа

---

### 🔹 Inverse Document Frequency (IDF)

\[
IDF(t, D) = \log\left( \frac{N}{1 + |\{d \in D : t \in d\}|} \right)
\]

- \( N \): общ брой документи  
- \( |\{d \in D : t \in d\}| \): брой документи, съдържащи думата _t_

---

### 🔹 TF–IDF

\[
TF\text{-}IDF(t, d, D) = TF(t,d) \times IDF(t,D)
\]

Тази стойност оценява колко важна е дадена дума за конкретен документ в контекста на цялата колекция.

---

## 📐 1.3 Векторно представяне и косинусова прилика

След изчисляване на TF-IDF, всеки документ може да бъде представен като вектор. Сходството между два вектора \( \vec{A} \) и \( \vec{B} \) се измерва чрез:

\[
\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \cdot ||\vec{B}||}
\]

- \( \vec{A} \cdot \vec{B} \): скаларно произведение  
- \( ||\vec{A}|| \): Евклидова норма на вектор _A_

---

## 🧮 1.4 Математически основи

| Концепт                  | Приложение                                                 |
|--------------------------|------------------------------------------------------------|
| Вектори                  | Представяне на документи като числови редове               |
| Скаларно произведение    | Оценка на близост между текстове                          |
| Евклидова норма          | Нормализиране и изчисление на косинусова прилика           |
| Логаритми                | Компонент от формулата за IDF                             |
| Матрици                  | Представяне на множество документи (TF-IDF матрица)        |

---

## 🧱 1.5 Структура на проекта

1. **Въведение**  
   - Мотивация  
   - Проблематика  
   - Цел и въпроси

2. **Математическа теория**  
   - TF, IDF, TF-IDF  
   - Косинусова прилика  
   - Векторно пространство

3. **Примерен корпус от текстове**

4. **Изчисления и визуализации**

5. **Пример за проверка на нов текст**

6. **Заключение и възможности за надграждане**

---

🧭 _Фаза 1 завършва с ясно формулирана теоретична рамка. Във Фаза 2 ще се пристъпи към създаване и обработка на малък корпус от текстове._


# 1. Introduction

## 1.1 The Growing Need for Fact-Checking in Social Media

In the digital era, social media platforms have become dominant sources of information for millions of users. However, the open nature and virality of these platforms have also made them fertile ground for the spread of misinformation and false claims. 

As noted by Zeng et al. <cite id="h0n81"><a href="#zotero%7C79625%2FZ62QSGRH">[1]</a></cite> "the rise of social media has drastically increased the speed and reach of information dissemination, often bypassing traditional gatekeepers of truth such as journalists and editors." This shift has created a pressing demand for scalable, automated systems capable of evaluating the factual accuracy of content in real time. The objective is not merely to flag content post hoc but to build proactive mechanisms that can assist users, journalists, and platforms in identifying misleading information before it goes viral.

Automated fact-checking has emerged as a promising solution to this challenge. 

According to *Thorne & Vlachos (2018)*, automated fact-checking systems typically consist of three core stages: claim detection, evidence retrieval, and claim verification. While many recent advances in this area rely on deep learning techniques, classical mathematical methods such as **TF-IDF** and **cosine similarity** still play a crucial role in the early stages—particularly in retrieving semantically relevant candidate texts efficiently. As *Chen et al. (2017)* demonstrated in their open-domain QA system, TF-IDF-based retrieval remains a reliable and computationally lightweight approach to identifying supporting information from large knowledge sources like Wikipedia. These findings support the use of TF-IDF in educational and prototype systems where interpretability, speed, and mathematical transparency are critical.






## 1.1 Formulating the Research Question

In the context of growing misinformation on social media, the need for tools that allow rapid verification of factual claims is increasingly pressing. This project is driven by the following research question:

> **How can classical mathematical methods be used to assess semantic similarity between texts for the purpose of preliminary fact-checking?**

This project explores the application of **TF-IDF** vectorization and **cosine similarity** as methods for evaluating semantic closeness between social media posts and a database of verified claims.

---

## 1.2 Foundations of Fact-Checking and Semantic Comparison

Fact-checking typically involves three stages:  
- identifying a claim,  
- retrieving relevant evidence, and  
- logically comparing the claim and evidence (Thorne & Vlachos, 2018).

In automated systems, **semantic text similarity** methods play a key role in retrieving relevant information. Even in modern systems, **TF-IDF and cosine similarity** are frequently used in the retrieval phase for their speed, simplicity, and interpretability (Chen et al., 2017; Zeng et al., 2021).

---

## 1.3 Methods for Semantic Text Comparison

### 🔹 Classical Approach: TF-IDF and Cosine Similarity

TF-IDF (Term Frequency - Inverse Document Frequency) is a widely used technique for representing text as vectors. It combines the local importance of a word in a document (TF) with its inverse frequency across a corpus (IDF)  
(Spärck Jones, 1972; Robertson, 2004):

\[
TF\text{-}IDF(t, d, D) = TF(t,d) \times \log\left( \frac{N}{1 + df(t)} \right)
\]

Once text is vectorized, semantic similarity can be measured using **cosine similarity**:

\[
\cos(\theta) = \frac{\vec{A} \cdot \vec{B}}{||\vec{A}|| \cdot ||\vec{B}||}
\]

Mihalcea et al. (2006) showed that such classical similarity methods are highly effective across various NLP tasks. Marcinczuk et al. (2021) further demonstrated that **TF-IDF can sometimes perform as well as or better than embedding-based methods** like Word2Vec and BERT, especially in constrained or low-resource settings.

### 🔹 Comparison to Embedding Methods

Embedding methods (e.g., BERT, Word2Vec) offer deeper contextual understanding but require more resources and data. According to Chandrasekaran & Mago (2021), **TF-IDF remains a well-justified choice** in early-stage engineering projects and educational settings due to its transparency and effectiveness.

---

## 📚 References (APA Style)

- Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). *Reading Wikipedia to answer open-domain questions*. Proceedings of ACL 2017. [Link](https://aclanthology.org/P17-1171/)
- Chandrasekaran, D., & Mago, V. (2021). *Evolution of semantic similarity – A survey*. ACM Computing Surveys, 54(2). [Link](https://arxiv.org/abs/2004.13820)
- Marcinczuk, M., et al. (2021). *Text document clustering: WordNet vs. TF-IDF vs. Word embeddings*. GWC 2021. [Link](https://aclanthology.org/2021.gwc-1.24.pdf)
- Mihalcea, R., Corley, C., & Strapparava, C. (2006). *Corpus-based and knowledge-based measures of text semantic similarity*. AAAI 2006. [Link](https://aaai.org/Papers/AAAI/2006/AAAI06-123.pdf)
- Robertson, S. (2004). *Understanding inverse document frequency: On theoretical arguments for IDF*. Journal of Documentation, 60(5). [Link](https://www.staff.city.ac.uk/~sbrp622/idfpapers/Robertson_idf_JDoc.pdf)
- Spärck Jones, K. (1972). *A statistical interpretation of term specificity and its application in retrieval*. Journal of Documentation, 28(1). [Link](https://www.staff.city.ac.uk/~sbrp622/idfpapers/ksj_orig.pdf)
- Thorne, J., & Vlachos, A. (2018). *Automated fact checking: Task formulations, methods and future directions*. COLING 2018. [Link](https://aclanthology.org/C18-1283/)
- Zeng, X., Abumansour, A. S., & Zubiaga, A. (2021). *Automated fact-checking: A survey*. Language and Linguistics Compass, 15(10). [Link](https://arxiv.org/abs/2109.11427)

---

_Next step: creating and preprocessing a small corpus of sample texts (Phase 2)._


# References

<!-- BIBLIOGRAPHY START -->
<div class="csl-bib-body">
  <div class="csl-entry"><i id="zotero|79625/Z62QSGRH"></i>
    <div class="csl-left-margin">[1]</div><div class="csl-right-inline">X. Zeng, A. S. Abumansour, and A. Zubiaga, “Automated fact-checking: A survey,” <i>Language and Linguistics Compass</i>, vol. 15, no. 10, p. e12438, 2021, doi: <a href="https://doi.org/10.1111/lnc3.12438">10.1111/lnc3.12438</a>.</div>
  </div>
</div>
<!-- BIBLIOGRAPHY END -->