### Project: Domain Adaptation of Portuguese SLMs via Self-Supervised Fine-Tuning with LoRA
MO436C - Introduction to Self-Supervised Learning (SSRL)

**Team Members:**
- Alejandro N√∫√±ez Arroyo. <a href="mailto:a299215@dac.unicamp.br">a299215@dac.unicamp.br</a>  
- Daniel da Costa Nunes Resende Neto. <a href="mailto:d169408@dac.unicamp.br">d169408@dac.unicamp.br</a>  
- Jos√© Augusto de Almeida Neto. <a href="mailto:j299218@dac.unicamp.br">j299218@dac.unicamp.br</a>  

*Instituto de Computa√ß√£o (IC), Universidade Estadual de Campinas (UNICAMP)*  
*Campinas, November 2025*

---

#### License

This notebook and its source code are released under the **GNU General Public License v3.0 (GPLv3)**.  
You are free to use, modify, and redistribute this work under the following terms:

> **GNU General Public License v3.0**  
> Copyright ¬© 2025 The Authors listed above  
>
> This program is free software: you can redistribute it and/or modify  
> it under the terms of the GNU General Public License as published by  
> the Free Software Foundation, either version 3 of the License, or  
> (at your option) any later version.  
>
> This program is distributed in the hope that it will be useful,  
> but **without any warranty**; without even the implied warranty of  
> merchantability or fitness for a particular purpose. See the  
> GNU General Public License for more details.  
>
> You should have received a copy of the GNU General Public License  
> along with this program. If not, see  
> [https://www.gnu.org/licenses/gpl-3.0.en.html](https://www.gnu.org/licenses/gpl-3.0.en.html).

---

# Notebook 1: Data

This notebook focuses on the data **acquisition, exploration, and preprocessing** stage of the project.  
It is responsible for **collecting, cleaning, and organizing datasets** that will later be used for modeling and experimentation.

---

**Overview**

The main objectives of this notebook are:

1. **Setup & Imports**  
   Load all core Python libraries (NumPy, Pandas, PyTorch, etc.) and configure the environment.

2. **Load and Analyze the MMLU Dataset**  
   - Import the **Portuguese (PT-BR)** version of the *Massive Multitask Language Understanding* (MMLU) dataset from Hugging Face.
   - Map each subject area to a higher-level **macrodomain** (e.g., Law, Medicine, Economics).
   - Generate statistics, verify mapping coverage, and perform **train/test splits** by macrodomain.

3. **Load the Wikipedia PT-BR Dataset**  
   - Download and concatenate multiple `.parquet` files containing Wikipedia articles in Portuguese.
   - Use **keyword-based filtering** and **semantic embedding similarity** to select articles relevant to Law, Governance, and Ethics.
   - Apply efficient filtering with multithreading for scalability.
   - Combine results from keyword and embedding approaches.
   - Remove duplicates and generate the final curated dataset `wiki_final.csv`.

**Output Artifacts**  
   - `mmlu_train.csv` / `mmlu_test.csv` ‚Äî split subsets of the MMLU dataset.
   - `wiki_keyword.csv` ‚Äî articles filtered by keywords.
   - `wiki_final.csv` ‚Äî combined final dataset.


## Summary

* [Part 1: Setup & Imports](#1-setup--imports)
* [Part 2: Massive Multitask Language Understanding (MMLU)](#2-massive-multitask-language-understanding-mmlu)
  - [2.1 Loading the Dataset](#21-loading-the-dataset)
  - [2.2 Mapping Subjects to Macrodomains](#22-mapping-subjects-to-macrodomains)
  - [2.3 Sampling Questions by Macrodomain](#23-sampling-questions-by-macrodomain)
  - [2.4 Train/Test Split](#24-traintest-split)
* [Part 3: Wikipedia PT-BR](#3-wikipedia-pt-br)
  - [3.1 Loading the Dataset](#31-loading-the-dataset)
  - [3.2 Keyword-Based Filtering](#32-keyword-based-filtering)
  - [3.3 Semantic Filtering with Embeddings](#33-semantic-filtering-with-embeddings)
  - [3.4 Final Dataset](#34-final-dataset)

<!-- ## 1. Setup & Imports -->
## 1. Setup & Imports <a id="part_01"></a>


Here we load the main libraries that will be used throughout this notebook.

In [1]:
import glob
import os
import re
from collections import Counter
from concurrent.futures import ThreadPoolExecutor, as_completed
from functools import partial

import numpy as np
import pandas as pd
import torch
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
from sklearn.model_selection import train_test_split
from tqdm import tqdm

# Create output directory
output_path = "data/"
os.makedirs(output_path, exist_ok=True)

## 2. Massive Multitask Language Understanding (MMLU) <a id="part_02"></a>

This section loads and prepares the **Portuguese (PT-BR)** version of the [MMLU dataset](https://huggingface.co/datasets/openai/MMMLU), a benchmark for evaluating general knowledge and reasoning across multiple academic subjects.

**Main steps:**
1. Load the MMLU PT-BR dataset from Hugging Face.  
2. Map each subject to a broader **macrodomain** (e.g., Law, Medicine, Psychology).  
3. Generate summary statistics and verify coverage.  
4. Perform a **train/test split** stratified by subject.  
5. Save the processed files for downstream training (`mmlu_train.csv`, `mmlu_test.csv`).

> This step ensures the MMLU dataset is clean, organized, and aligned with higher-level knowledge domains.


### 2.1 Loading the Dataset

Import the **MMLU PT-BR** CSV file and inspect its structure and distribution of subjects.

In [2]:
# Load the MMLU Portuguese (Brazil) dataset
mmmlu_pt_path = "https://drive.google.com/uc?export=download&id=1WwhkiRUZaDSj-3aWm0iTujMIFn4KUGhh"
df_mmmlu_pt = pd.read_csv(mmmlu_pt_path)
df_mmmlu_pt.head()

Unnamed: 0.1,Unnamed: 0,Question,A,B,C,D,Answer,Subject
0,0,Encontre o √¢ngulo para a extens√£o de campo dad...,0,4,2,6,B,abstract_algebra
1,1,"Considere p = (1, 2, 5, 4)(2, 3) em S_5. Encon...",8,2,24,120,C,abstract_algebra
2,2,Encontre todos os zeros no campo finito indica...,0,1,01,04,D,abstract_algebra
3,3,Declara√ß√£o 1 | Um grupo quociente de um grupo ...,"Verdadeiro, Verdadeiro","Falso, Falso","Verdadeiro, Falso","Falso, Verdadeiro",B,abstract_algebra
4,4,Encontre o produto dos polin√¥mios dados no ane...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract_algebra


In [3]:
# Count rows for each category in the 'Subject' column
subject_counts = df_mmmlu_pt['Subject'].value_counts()
subject_counts

Subject
professional_law                       1534
moral_scenarios                         895
miscellaneous                           783
professional_psychology                 612
high_school_psychology                  545
high_school_macroeconomics              390
elementary_mathematics                  378
moral_disputes                          346
prehistory                              324
philosophy                              311
high_school_biology                     310
nutrition                               306
professional_accounting                 282
professional_medicine                   272
high_school_mathematics                 270
clinical_knowledge                      265
security_studies                        245
high_school_microeconomics              238
high_school_world_history               237
conceptual_physics                      235
marketing                               234
human_aging                             223
high_school_statistics  

### 2.2 Mapping Subjects to Macrodomains

In this step, we grouped the detailed **MMLU subjects** into broader thematic **macrodomains**  
(e.g., *Law, Governance, and Ethics*, *Medicine, Health, and Life Sciences*, *Psychology, Human Behavior, and Society*).

To ensure a coherent and well-balanced structure, the **subject‚Äìmacrodomain mapping** was created with the assistance of **ChatGPT 5**,  
which analyzed the subject list and organized them into meaningful high-level categories.

---

#### Prompt used with ChatGPT 5

> ‚ÄúHelp me group the following subjects into coherent *macrodomains*. 
> Each macrodomain should represent a broad conceptual or thematic area that can be mapped to Wikipedia categories or article collections, allowing me to gather related Wikipedia data for fine-tuning a specialized language model (SLM). 
> 
> The subjects I‚Äôm providing come from my QA dataset. Please group them in a way that maximizes semantic cohesion and domain relevance for model fine-tuning ‚Äî ensuring that each macrodomain corresponds to a consistent knowledge field, minimizes overlap, and can be associated with clear, well-defined Wikipedia subdomains. 
> 
> For each macrodomain you identify, provide: 
> 
> 1. A **macrodomain name** (concise but descriptive). 
> 2. A **short definition or scope note** (what it covers and excludes). 
> 3. The **list of subjects** from my dataset that belong to it.‚Äù
>
> **Subjects:**
> ```text
> professional_law      1534
> moral_scenarios        895
> miscellaneous          783
> professional_psychology 612
> high_school_psychology  545
> ...
> ```

* The evaluated mapping was reviewed by the authors and is used throughout this notebook to ensure consistent subject categorization across all analyses and datasets.


In [4]:
# Mapping dictionary for Subject ‚Üí Macrodomain
subject_to_macrodomain = {
    # --- 1. Law, Governance, and Ethics ---
    'professional_law': 'Law, Governance, and Ethics',
    'international_law': 'Law, Governance, and Ethics',
    'jurisprudence': 'Law, Governance, and Ethics',
    'business_ethics': 'Law, Governance, and Ethics',
    'moral_scenarios': 'Law, Governance, and Ethics',
    'moral_disputes': 'Law, Governance, and Ethics',
    'philosophy': 'Law, Governance, and Ethics',
    'logical_fallacies': 'Law, Governance, and Ethics',

    # --- 2. Psychology, Human Behavior, and Society ---
    'professional_psychology': 'Psychology, Human Behavior, and Society',
    'high_school_psychology': 'Psychology, Human Behavior, and Society',
    'sociology': 'Psychology, Human Behavior, and Society',
    'human_sexuality': 'Psychology, Human Behavior, and Society',
    'human_aging': 'Psychology, Human Behavior, and Society',

    # --- 3. Medicine, Health, and Life Sciences ---
    'professional_medicine': 'Medicine, Health, and Life Sciences',
    'clinical_knowledge': 'Medicine, Health, and Life Sciences',
    'college_medicine': 'Medicine, Health, and Life Sciences',
    'anatomy': 'Medicine, Health, and Life Sciences',
    'medical_genetics': 'Medicine, Health, and Life Sciences',
    'virology': 'Medicine, Health, and Life Sciences',
    'nutrition': 'Medicine, Health, and Life Sciences',
    'college_biology': 'Medicine, Health, and Life Sciences',
    'high_school_biology': 'Medicine, Health, and Life Sciences',

    # --- 4. Economics, Business, and Management ---
    'high_school_microeconomics': 'Economics, Business, and Management',
    'high_school_macroeconomics': 'Economics, Business, and Management',
    'econometrics': 'Economics, Business, and Management',
    'professional_accounting': 'Economics, Business, and Management',
    'management': 'Economics, Business, and Management',
    'marketing': 'Economics, Business, and Management',
    'public_relations': 'Economics, Business, and Management',

    # --- 5. Political Science, Security, and Global Affairs ---
    'high_school_government_and_politics': 'Political Science, Security, and Global Affairs',
    'us_foreign_policy': 'Political Science, Security, and Global Affairs',
    'security_studies': 'Political Science, Security, and Global Affairs',
    'international_law': 'Political Science, Security, and Global Affairs',  # cross-listed

    # --- 6. Natural Sciences and Engineering ---
    'conceptual_physics': 'Natural Sciences and Engineering',
    'high_school_physics': 'Natural Sciences and Engineering',
    'college_physics': 'Natural Sciences and Engineering',
    'high_school_chemistry': 'Natural Sciences and Engineering',
    'college_chemistry': 'Natural Sciences and Engineering',
    'electrical_engineering': 'Natural Sciences and Engineering',
    'astronomy': 'Natural Sciences and Engineering',

    # --- 7. Mathematics, Statistics, and Computer Science ---
    'elementary_mathematics': 'Mathematics, Statistics, and Computer Science',
    'high_school_mathematics': 'Mathematics, Statistics, and Computer Science',
    'high_school_statistics': 'Mathematics, Statistics, and Computer Science',
    'college_mathematics': 'Mathematics, Statistics, and Computer Science',
    'abstract_algebra': 'Mathematics, Statistics, and Computer Science',
    'formal_logic': 'Mathematics, Statistics, and Computer Science',
    'college_computer_science': 'Mathematics, Statistics, and Computer Science',
    'high_school_computer_science': 'Mathematics, Statistics, and Computer Science',
    'machine_learning': 'Mathematics, Statistics, and Computer Science',
    'computer_security': 'Mathematics, Statistics, and Computer Science',

    # --- 8. History, Geography, and Culture ---
    'prehistory': 'History, Geography, and Culture',
    'high_school_world_history': 'History, Geography, and Culture',
    'high_school_european_history': 'History, Geography, and Culture',
    'high_school_us_history': 'History, Geography, and Culture',
    'high_school_geography': 'History, Geography, and Culture',
    'global_facts': 'History, Geography, and Culture',

    # --- 9. Religion and Worldviews ---
    'world_religions': 'Religion and Worldviews',

    # --- 10. Miscellaneous ---
    'miscellaneous': 'Miscellaneous and Cross-domain Knowledge',
}

# Add new column to the DataFrame
df_mmmlu_pt['Macrodomain'] = df_mmmlu_pt['Subject'].map(subject_to_macrodomain)

# Check mapping coverage
unmapped = df_mmmlu_pt[df_mmmlu_pt['Macrodomain'].isna()]['Subject'].unique()
if len(unmapped) > 0:
    print("\nUnmapped subjects detected:", unmapped)
else:
    print("\nAll subjects successfully mapped to macrodomains!")

# Count rows for each Macrodomain
macrodomain_counts = df_mmmlu_pt['Macrodomain'].value_counts()
print(macrodomain_counts)

# Preview updated DataFrame
df_mmmlu_pt.head()


All subjects successfully mapped to macrodomains!
Macrodomain
Law, Governance, and Ethics                        3457
Medicine, Health, and Life Sciences                1871
Psychology, Human Behavior, and Society            1712
Mathematics, Statistics, and Computer Science      1602
Economics, Business, and Management                1471
History, Geography, and Culture                    1228
Natural Sciences and Engineering                   1088
Miscellaneous and Cross-domain Knowledge            783
Political Science, Security, and Global Affairs     659
Religion and Worldviews                             171
Name: count, dtype: int64


Unnamed: 0.1,Unnamed: 0,Question,A,B,C,D,Answer,Subject,Macrodomain
0,0,Encontre o √¢ngulo para a extens√£o de campo dad...,0,4,2,6,B,abstract_algebra,"Mathematics, Statistics, and Computer Science"
1,1,"Considere p = (1, 2, 5, 4)(2, 3) em S_5. Encon...",8,2,24,120,C,abstract_algebra,"Mathematics, Statistics, and Computer Science"
2,2,Encontre todos os zeros no campo finito indica...,0,1,01,04,D,abstract_algebra,"Mathematics, Statistics, and Computer Science"
3,3,Declara√ß√£o 1 | Um grupo quociente de um grupo ...,"Verdadeiro, Verdadeiro","Falso, Falso","Verdadeiro, Falso","Falso, Verdadeiro",B,abstract_algebra,"Mathematics, Statistics, and Computer Science"
4,4,Encontre o produto dos polin√¥mios dados no ane...,2x^2 + 5,6x^2 + 4x + 6,0,x^2 + 1,B,abstract_algebra,"Mathematics, Statistics, and Computer Science"


### 2.3 Sampling Questions by Macrodomain

Display a few representative questions from each **macrodomain** to:
- Visually confirm that the mapping and labeling are coherent.
- Analyze and select the most fitted macrodomain for the experiments

In [None]:
# Print sample questions organized by Macrodomain and Subject
n_samples_per_subject = 2
macrodomain_order = (
    df_mmmlu_pt['Macrodomain']
    .value_counts()
    .index
)

for macrodomain in macrodomain_order:
    group = df_mmmlu_pt[df_mmmlu_pt['Macrodomain'] == macrodomain]

    print("=" * 100)
    print(f"üìò MACRODOMAIN: {macrodomain.upper()} - Total questions: {len(group)}")
    print("=" * 100)
    
    # Iterate over subjects inside each Macrodomain
    for subject, sub_df in group.groupby('Subject'):
        print(f"\nüîπ Subject: {subject}\n" + "-" * 80)
        
        # Take sample questions
        samples = sub_df.sample(n=min(n_samples_per_subject, len(sub_df)), random_state=42)
        
        for i, row in samples.iterrows():
            print(f"Question: {row['Question']}")
            print(f"a) {row['A']}")
            print(f"b) {row['B']}")
            print(f"c) {row['C']}")
            print(f"d) {row['D']}")
            print(f"Correct Answer: {row['Answer']}")
            print("-" * 60)
    print("\n\n")


üìò MACRODOMAIN: LAW, GOVERNANCE, AND ETHICS - Total questions: 3457

üîπ Subject: business_ethics
--------------------------------------------------------------------------------
Question: ________________ o ambiente de trabalho envolve capacitar os empregados, por exemplo, por meio de ‚Äúenriquecimento do cargo‚Äù, pelo qual os empregados recebem um escopo maior para decidir como organizar seu trabalho, ou ‚Äúexpans√£o do cargo‚Äù, por onde os empregados recebem mais tarefas.
a) Revigorar
b) Renovar
c) Revitalizar
d) Reumanizar
Correct Answer: D
------------------------------------------------------------
Question: Conforme Mitchell et al (1997), ___________, a capacidade percebida de uma parte interessada de influenciar a a√ß√£o organizacional, _______________, se a organiza√ß√£o percebe as a√ß√µes da parte interessada como desej√°veis, apropriadas e corretas, e ____________, a imedia√ß√£o da aten√ß√£o que a parte interessada exige, determinam a __________ da parte interessada.
a)

### 2.4 Train/Test Split

Split the dataset for the selected macrodomain (*Law, Governance, and Ethics*) into **training (70%)** and **test (30%)** subsets,  
stratified by subject for balanced evaluation.

In [None]:
# Generate counts of questions per Macrodomain and Subject
subject_counts = df_mmmlu_pt.groupby(['Macrodomain', 'Subject']).size().reset_index(name='Count')
subject_counts_sorted = subject_counts.sort_values(by='Count', ascending=False)

# Select 'Law, Governance, and Ethics' Macrodomain    
subject_counts_sorted[subject_counts_sorted['Macrodomain'] == 'Law, Governance, and Ethics']

Unnamed: 0,Macrodomain,Subject,Count
19,"Law, Governance, and Ethics",professional_law,1534
17,"Law, Governance, and Ethics",moral_scenarios,895
16,"Law, Governance, and Ethics",moral_disputes,346
18,"Law, Governance, and Ethics",philosophy,311
15,"Law, Governance, and Ethics",logical_fallacies,163
14,"Law, Governance, and Ethics",jurisprudence,108
13,"Law, Governance, and Ethics",business_ethics,100


In [None]:
# Split dataset into train (70%) and test (30%) sets for the selected macrodomain
df_train, df_test = train_test_split(
    df_mmmlu_pt[df_mmmlu_pt['Macrodomain'] == 'Law, Governance, and Ethics'],
    test_size=0.3,                  # 30% for the test set (which leaves 70% for the train set)
    random_state=42,                # Use a fixed random state for reproducibility
    stratify=df_mmmlu_pt[df_mmmlu_pt['Macrodomain'] == 'Law, Governance, and Ethics']['Subject'] # Stratify by the 'Subject' column
)

In [None]:
# Verify distribution in the test set
test_subject_counts = df_test.groupby(['Macrodomain', 'Subject']).size().reset_index(name='Count').sort_values(by='Count', ascending=False)
df_test.to_csv(f"{output_path}/mmlu_test.csv", index=False) # Save
test_subject_counts

Unnamed: 0,Macrodomain,Subject,Count
6,"Law, Governance, and Ethics",professional_law,461
4,"Law, Governance, and Ethics",moral_scenarios,269
3,"Law, Governance, and Ethics",moral_disputes,104
5,"Law, Governance, and Ethics",philosophy,93
2,"Law, Governance, and Ethics",logical_fallacies,49
1,"Law, Governance, and Ethics",jurisprudence,32
0,"Law, Governance, and Ethics",business_ethics,30


In [None]:
# Verify distribution in the train set
train_subject_counts = df_train.groupby(['Macrodomain', 'Subject']).size().reset_index(name='Count').sort_values(by='Count', ascending=False)
df_train.to_csv(f"{output_path}/mmlu_train.csv", index=False) # Save
train_subject_counts

Unnamed: 0,Macrodomain,Subject,Count
6,"Law, Governance, and Ethics",professional_law,1073
4,"Law, Governance, and Ethics",moral_scenarios,626
3,"Law, Governance, and Ethics",moral_disputes,242
5,"Law, Governance, and Ethics",philosophy,218
2,"Law, Governance, and Ethics",logical_fallacies,114
1,"Law, Governance, and Ethics",jurisprudence,76
0,"Law, Governance, and Ethics",business_ethics,70


## 3. Portuguese Wikipedia Corpus <a id="part_03"></a>

This section loads, filters, and prepares the **Portuguese Wikipedia (PT-BR)** dataset  
from [HuggingFace ‚Äî *pablo-moreira/wikipedia-pt*](https://huggingface.co/datasets/pablo-moreira/wikipedia-pt).

**Main steps:**
1. Load and inspect the full Wikipedia PT-BR dataset.  
2. Filter articles using domain-specific **keyword groups** related to *Law, Governance, and Ethics*.  
3. Apply **semantic similarity filtering** with multilingual embeddings.  
4. Merge both filtering approaches into a **final curated dataset**.

> This stage builds a focused Wikipedia subset aligned with the *Law, Governance, and Ethics* macrodomain.


### 3.1 Loading the Dataset

Load and concatenate all `.parquet` files from the Wikipedia PT-BR corpus,  
creating a unified DataFrame with article titles and text content.


In [None]:
# Load the dataset
ds = load_dataset("pablo-moreira/wikipedia-pt", "latest", split="train")

# Convert to a pandas DataFrame (optional, for convenience)
df_wiki = pd.DataFrame(ds)

print("Total articles:", len(df_wiki))
df_wiki.head()

(1857355, 3)


Unnamed: 0,id,title,text
0,220,Astronomia,Astronomia\n\nAstronomia √© uma ci√™ncia natural...
1,223,Am√©rica Latina,Am√©rica Latina\n\nA Am√©rica Latina (; ) √© uma ...
2,224,Albino Forjaz de Sampaio,Albino Forjaz de Sampaio\n\nAlbino Maria Perei...
3,226,Anno Domini,Anno Domini\n\nAnno Domini (A.D.) √© uma expres...
4,228,Aquiles,"Aquiles\n\nAquiles (), na mitologia grega, foi..."


### 3.2 Keyword-Based Filtering

Define **keyword groups** to identify relevant articles related to *Law, Governance, and Ethics*  
(e.g., legal terminology, governance topics, ethical concepts, and professional figures).

The keyword strategy and domain taxonomy were refined with the help of **ChatGPT 5**,  
which assisted in expanding lists with relevant Portuguese legal, political, and ethical terminology.

---

#### Prompt 1 ‚Äî Domain Keywords (Law, Governance, Ethics, Business)

> ‚ÄúGenerate comprehensive Portuguese keyword lists related to **Law, Governance, and Ethics**,  
> including legal terms, governance and policy expressions, ethical and moral vocabulary,  
> and business or corporate responsibility terminology.  
> Each list should capture stems or word roots suitable for text pattern matching.‚Äù

These lists formed the basis of `law_keywords`, `governance_keywords`, `ethics_keywords`, and `business_keywords`.

---

#### Prompt 2 ‚Äî Biography Keywords

> ‚ÄúList common **Portuguese words or phrases** that typically appear in biographical or professional  
> descriptions of people connected to **law, governance, politics, or moral philosophy** ‚Äî  
> including professions, roles, and titles (e.g., *advogado*, *fil√≥sofo*, *ativista*).  
> Focus on general descriptors likely to occur in the first sentences of Wikipedia biographies.‚Äù

This prompt generated the `bio_keywords` set used to detect professional or biographical relevance.

---

#### Prompt 3 ‚Äî Named Entities (People and Thinkers)

> ‚ÄúCreate an extensive list of **names of historical and contemporary figures** relevant to  
> *law, governance, ethics, and political or moral philosophy* in Portuguese and global contexts.  
> Include jurists, philosophers, political leaders, sociologists, activists, and thinkers  
> whose Wikipedia articles could contain valuable domain knowledge.‚Äù

This prompt produced the `name_keywords` list used for named-entity matching in the filtering stage.

---

* All three evaluated keyword sets were reviewed, refined, and integrated into the filtering pipeline  
to ensure wide coverage of the *Law, Governance, and Ethics* domain while minimizing irrelevant noise.



In [None]:
# Domain Keywords
law_keywords = [
    "direito", "leis", "lei", "jur√≠dic", "constitui√ß√£o", "justi√ßa", "crime",
    "criminal", "penal", "civil", "advogad", "tribunal", "senten", "norma",
    "lit√≠gio", "processo", "recurso", "constitucional", "supremo", "judicial",
    "magistrad", "jurisprud", "procurador", "ministerio p√∫blico"
]

governance_keywords = [
    "governo", "governan√ßa", "estado", "poder p√∫blico", "administra",
    "pol√≠tica p√∫blica", "pol√≠ticas p√∫blicas", "soberania", "cidadania",
    "direitos humanos", "parlamento", "congresso", "senado", "c√¢mara",
    "prefeit", "corrup√ß√£o", "transpar√™ncia", "accountability", "elei√ß√£o",
    "elei√ß√µes", "partido pol√≠tico", "democracia", "constitucionalismo"
]

ethics_keywords = [
    "√©tica", "moral", "deontolog", "utilitar", "kant", "virtude", "fal√°ci",
    "fal√°cia", "filosofia moral", "dilema", "bio√©tica", "moralidade",
    "responsabilidade", "justi√ßa social", "igualdade", "liberdade",
    "honestidade", "corrup√ß√£o √©tica"
]

business_keywords = [
    "governan√ßa corporativa", "compliance", "responsabilidade social",
    "√©tica empresarial", "conduta profissional", "transpar√™ncia corporativa",
    "corrup√ß√£o corporativa", "responsabilidade socioambiental"
]

# Biography Keywords
bio_keywords = [
    # Legal professions
    "jurista", "advogado", "advogada", "magistrado", "magistrada",
    "procurador", "procuradora", "ministro do supremo", "juiz", "ju√≠za",
    
    # Political/governance roles
    "pol√≠tico", "pol√≠tica", "estadista", "governante", "presidente", 
    "parlamentar", "deputado", "senador", "prefeito", "ministro", "governador",
    
    # Academic / philosophical
    "fil√≥sofo", "fil√≥sofa", "pensador", "pensadora", "te√≥rico", "te√≥rica",
    "professor de filosofia", "moralista", "te√≥logo", "intelectual p√∫blico",
    
    # Ethics / activism
    "ativista", "defensor dos direitos humanos", "reformador", "humanista"
]

# Named Entity Keywords
name_keywords = [
    # --- Classic jurists and legal thinkers ---
    "Rui Barbosa", "Cl√≥vis Bevil√°qua", "Tobias Barreto", "Pontes de Miranda",
    "Miguel Reale", "Miguel Reale J√∫nior", "Paulo Bonavides", "Celso Ant√¥nio Bandeira de Mello",
    "Jos√© Afonso da Silva", "F√°bio Konder Comparato", "Raymundo Faoro", "Carlos Maximiliano",
    "Pimenta Bueno", "Castro Nunes", "Teixeira de Freitas", "Nelson Hungria", "Vicente R√°o",
    "Jos√© Levi Mello do Amaral", "Luis Roberto Barroso", "Gilmar Mendes", "Joaquim Barbosa",
    "C√°rmen L√∫cia", "Alexandre de Moraes", "Ayres Britto", "Sep√∫lveda Pertence", "Eros Grau",
    "Marco Aur√©lio Mello", "S√©rgio Moro", "Maria Berenice Dias", "Lu√≠s Felipe Salom√£o",

    # --- Political and constitutional figures ---
    "Ulysses Guimar√£es", "Tancredo Neves", "Jos√© Sarney", "Itamar Franco",
    "Fernando Henrique Cardoso", "Lula", "Luiz In√°cio Lula da Silva", "Dilma Rousseff",
    "Michel Temer", "Jair Bolsonaro", "Get√∫lio Vargas", "Juscelino Kubitschek",
    "Jo√£o Goulart", "Castelo Branco", "Costa e Silva", "Ernesto Geisel", "Figueiredo",
    "Jos√© Bonif√°cio de Andrada e Silva", "Afonso Arinos de Melo Franco", "Teot√¥nio Vilela",
    "Leonel Brizola", "Darcy Ribeiro", "Celso Furtado", "Celso Lafer", "Mario Covas",
    "Eduardo Campos", "Marina Silva", "Ciro Gomes", "Fernando Haddad", "S√©rgio Cabral",
    "Antonio Carlos Magalh√£es", "Jos√© Serra", "A√©cio Neves", "Geraldo Alckmin",

    # --- Philosophers, sociologists, educators, and thinkers ---
    "Paulo Freire", "Florestan Fernandes", "Roberto Mangabeira Unger", "Marilena Chau√≠",
    "Renato Janine Ribeiro", "Leandro Konder", "Luiz Felipe Pond√©", "Olavo de Carvalho",
    "Mario S√©rgio Cortella", "Cl√≥vis de Barros Filho", "Vladimir Safatle", "Emir Sader",
    "Milton Santos", "Gilberto Freyre", "S√©rgio Buarque de Holanda", "Caio Prado J√∫nior",
    "Darcy Ribeiro", "Euclides da Cunha", "Jos√© Guilherme Merquior", "Antonio Candido",
    "Nelson Rodrigues", "Nina Rodrigues", "Gilberto Amado", "Afr√¢nio Coutinho",

    # --- Activists, journalists, and human rights defenders ---
    "Maria da Penha", "Marielle Franco", "Zilda Arns", "Herbert de Souza", "Betinho",
    "Dom H√©lder C√¢mara", "Chico Mendes", "Carlos Drummond de Andrade", "Clarice Lispector",
    "Ariano Suassuna", "Alda√≠za Sposati", "Nise da Silveira", "Heleieth Saffioti",
    "L√©lia Gonzalez", "Djamila Ribeiro", "Carolina Maria de Jesus", "Sueli Carneiro",
    "Abdias do Nascimento", "Milton Santos", "Leonardo Boff", "Frei Betto",
    "Rubem Alves", "Henfil", "Mill√¥r Fernandes", "Ferreira Gullar", "Patr√≠cia Campos Mello"

    # --- Classical philosophy ---
    "S√≥crates", "Plat√£o", "Arist√≥teles", "Epicuro", "C√≠cero", "S√™neca", "Agostinho de Hipona",
    "Tom√°s de Aquino", "Maquiavel", "Francis Bacon", "Ren√© Descartes", "Baruch Spinoza",
    "David Hume", "John Locke", "Thomas Hobbes", "Jean-Jacques Rousseau",
    "Montesquieu", "Voltaire", "Immanuel Kant", "Georg Wilhelm Friedrich Hegel",
    "Karl Marx", "Friedrich Engels", "John Stuart Mill", "Jeremy Bentham",
    "Alexis de Tocqueville", "Giambattista Vico", "Edmund Burke", "Thomas Paine",

    # --- Modern philosophy & ethics ---
    "Friedrich Nietzsche", "Arthur Schopenhauer", "Auguste Comte", "Emile Durkheim",
    "Max Weber", "Karl Popper", "Ludwig Wittgenstein", "Hannah Arendt",
    "Michel Foucault", "J√ºrgen Habermas", "John Rawls", "Robert Nozick",
    "Alasdair MacIntyre", "Peter Singer", "Amartya Sen", "Martha Nussbaum",
    "Simone de Beauvoir", "Jean-Paul Sartre", "Albert Camus", "Paul Ricoeur",
    "Emmanuel Levinas", "Charles Taylor", "Isaiah Berlin", "Erich Fromm",

    # --- Legal theory & jurisprudence ---
    "Hans Kelsen", "Herbert Hart", "Ronald Dworkin", "Lon Fuller", "Carl Schmitt",
    "Cesare Beccaria", "Gustav Radbruch", "Michel Villey", "Roscoe Pound",
    "Oliver Wendell Holmes", "Benjamin Cardozo", "John Austin", "Jeremy Bentham",
    "John Finnis", "Joseph Raz", "Richard Posner", "Cass Sunstein",

    # --- Political theory, governance, economics ---
    "John Maynard Keynes", "Friedrich Hayek", "Karl Polanyi", "Antonio Gramsci",
    "Ernesto Laclau", "Chantal Mouffe", "Robert Dahl", "Giovanni Sartori",
    "Max Weber", "David Easton", "John Dewey", "Niccol√≤ Machiavelli", "Raymond Aron",
    "Hannah Pitkin", "Elinor Ostrom", "Amartya Sen", "Douglass North",

    # --- Moral and applied ethics ---
    "Peter Singer", "Philippa Foot", "Judith Jarvis Thomson", "Elizabeth Anscombe",
    "Bernard Williams", "Thomas Nagel", "Christine Korsgaard", "Charles Taylor",
    "Susan Wolf", "Michael Sandel", "Onora O'Neill", "Iris Murdoch",

    # --- Human rights and social thought ---
    "Eleanor Roosevelt", "Nelson Mandela", "Martin Luther King", "Mahatma Gandhi",
    "Desmond Tutu", "Vaclav Havel", "Aung San Suu Kyi", "Malala Yousafzai",
    "Noam Chomsky", "Cornel West", "bell hooks", "Angela Davis", "Judith Butler"
]

This cell defines and compiles **regular expression patterns** used to identify relevant Wikipedia articles.

It first wraps keyword lists (law, governance, ethics, business, biography, and names) into **compiled regex objects**  
for efficient text search. These patterns detect domain-specific terms, named entities, and biography markers.  

A separate `exclude_re` pattern filters out irrelevant topics (e.g., sports, music).  
Finally, all domain keywords are merged into a single `domain_pattern` for global keyword counting.

In [4]:
def compile_pattern(words):
    return re.compile(r"\b(?:" + "|".join(map(re.escape, words)) + r")\b", flags=re.IGNORECASE)

# Base keyword groups
law_re              = compile_pattern(law_keywords)
governance_re       = compile_pattern(governance_keywords)
ethics_re           = compile_pattern(ethics_keywords)
business_re         = compile_pattern(business_keywords)
bio_re              = compile_pattern(bio_keywords)
name_re             = re.compile("|".join(map(re.escape, name_keywords)), flags=re.IGNORECASE)
exclude_re          = re.compile(r"(?i)\b(lista|campeonato|futebol|√°lbum|m√∫sica|temporada|condado|jogos?)\b")

# Combined domain pattern for efficient counting
all_domain_keywords = law_keywords + governance_keywords + ethics_keywords + business_keywords
domain_pattern      = compile_pattern(all_domain_keywords)

This cell implements the core **filtering logic** applied to each Wikipedia chunk.

The function `process_chunk()` scans article titles and texts to create boolean masks  
based on multiple inclusion and exclusion rules:

1. **Named-entity rule** ‚Äî selects articles mentioning known domain figures.  
2. **Biography rule** ‚Äî detects professional or biographical indicators in the introduction.  
3. **Domain keyword rule** ‚Äî keeps articles exceeding a keyword threshold.  
4. **Exclusion rule** ‚Äî removes entertainment or unrelated topics.

It returns masks and statistics summarizing how many articles matched each rule.

In [5]:
def process_chunk(chunk_df, threshold=5):
    """
    Process a chunk of the dataframe and return boolean mask for inclusion.
    
    Args:
        chunk_df: DataFrame chunk with 'title' and 'text' columns
        threshold: Minimum domain keyword count for inclusion
    
    Returns:
        numpy array of boolean values indicating which rows to include
    """
    titles = chunk_df["title"].fillna("").astype(str)
    texts = chunk_df["text"].fillna("").astype(str)
    
    # Rule 1: Named-entity rule (check title or multiple mentions in text)
    mask_name_title = titles.str.contains(name_re, na=False, regex=True)
    
    # For text, count occurrences and select if >= 4 mentions
    name_counts = texts.str.count(name_re)
    mask_name_text = name_counts >= 4
    mask_name = mask_name_title | mask_name_text
    
    # Rule 2: Biography rule (profession keyword in first 200 chars)
    intro_snippet = texts.str.slice(0, 200)
    mask_bio = intro_snippet.str.contains(bio_re, na=False, regex=True)
    
    # Rule 3: Domain keyword scoring (count all domain keywords)
    domain_counts = texts.str.count(domain_pattern)
    mask_domain = domain_counts >= threshold
    
    # Combine inclusion rules
    mask_include = mask_name | mask_bio | mask_domain
    
    # Apply exclusion filter
    mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
    
    # Final mask
    final_mask = mask_include & ~mask_exclude

    return (final_mask.values, 
            mask_exclude.sum(), 
            mask_name.sum(), 
            mask_bio.sum(), 
            mask_domain.sum())


This cell defines `filter_wikipedia_parallel()`, which executes the filtering pipeline in **parallel threads**  
for scalability on large Wikipedia datasets.

It splits the DataFrame into chunks, processes each using `process_chunk()`,  
and aggregates results while tracking detailed metrics ‚Äî including total matches, exclusions,  
and overall reduction percentage.

The function prints a summary report and returns both the filtered dataset and a statistics dictionary.

In [6]:
def filter_wikipedia_parallel(df, threshold=5, n_workers=None, chunk_size=10000):
    """
    Filter Wikipedia dataframe in parallel using ThreadPoolExecutor.
    
    Args:
        df: Input DataFrame with 'title' and 'text' columns
        threshold: Minimum domain keyword count
        n_workers: Number of worker threads (default: CPU count)
        chunk_size: Number of rows per chunk
    
    Returns:
        tuple: (filtered_df, stats_dict)
    """
    print(f"Processing {len(df):,} rows with {n_workers or 'auto'} workers...")
    
    # Split dataframe into chunks
    chunks = [df.iloc[i:i + chunk_size].copy() for i in range(0, len(df), chunk_size)]
    n_chunks = len(chunks)
    print(f"Split into {n_chunks} chunks of ~{chunk_size:,} rows each")
    
    # Process chunks in parallel
    results = []
    process_fn = partial(process_chunk, threshold=threshold)
    
    # Initialize statistics accumulators
    total_excluded = 0
    total_name = 0
    total_bio = 0
    total_domain = 0
    
    with ThreadPoolExecutor(max_workers=n_workers) as executor:
        # Submit all chunks
        future_to_idx = {executor.submit(process_fn, chunk): i 
                        for i, chunk in enumerate(chunks)}
        
        # Collect results as they complete
        for i, future in enumerate(tqdm(as_completed(future_to_idx), total=n_chunks, desc="Processing chunks")):
            chunk_idx = future_to_idx[future]
            try:
                mask, excluded, name_cnt, bio_cnt, domain_cnt = future.result()
                results.append((chunk_idx, mask))
                
                # Accumulate statistics
                total_excluded += excluded
                total_name += name_cnt
                total_bio += bio_cnt
                total_domain += domain_cnt
                
                print(f"Chunk {i}/{n_chunks} (idx {chunk_idx}): "
                      f"{mask.sum():,} selected | "
                      f"name={name_cnt} bio={bio_cnt} domain={domain_cnt} excluded={excluded}")
            except Exception as e:
                print(f"Error processing chunk {chunk_idx}: {e}")
                raise
    
    # Sort results by original chunk order and concatenate masks
    results.sort(key=lambda x: x[0])
    full_mask = np.concatenate([mask for _, mask in results])
    
    # Apply mask to original dataframe
    df_filtered = df[full_mask].copy()
    
    # Calculate overlap statistics (rules can overlap)
    total_selected = len(df_filtered)
    
    # Compile statistics
    stats = {
        "total_rows": len(df),
        "selected_rows": total_selected,
        "reduction_pct": (1 - total_selected / len(df)) * 100,
        "excluded_by_filter": total_excluded,
        "matched_name_rule": total_name,
        "matched_bio_rule": total_bio,
        "matched_domain_rule": total_domain,
        "threshold_used": threshold
    }
    
    # Print summary
    print(f"\n{'='*60}")
    print(f"FILTERING COMPLETE")
    print(f"{'='*60}")
    print(f"Original rows:        {stats['total_rows']:>10,}")
    print(f"Filtered rows:        {stats['selected_rows']:>10,}")
    print(f"Reduction:            {stats['reduction_pct']:>10.1f}%")
    print(f"\nRule Matches (with overlaps):")
    print(f"  Named entities:     {stats['matched_name_rule']:>10,}")
    print(f"  Biography keywords: {stats['matched_bio_rule']:>10,}")
    print(f"  Domain keywords:    {stats['matched_domain_rule']:>10,}")
    print(f"\nExcluded (sport/entertainment): {stats['excluded_by_filter']:>6,}")
    print(f"Domain threshold used:          {stats['threshold_used']:>6}")
    print(f"{'='*60}\n")
    
    return df_filtered, stats

This cell executes the full **parallel filtering process** and exports the resulting dataset.

It calls `filter_wikipedia_parallel()` with the configured parameters (e.g., threshold, number of workers),  
then computes an additional metric: the **average number of rule matches per article**.  

Finally, it saves the filtered output to `wiki_keyword.csv`, providing the curated Wikipedia subset  
aligned with the *Law, Governance, and Ethics* domain.

> ‚ö†Ô∏è **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  
> Runtime and cell outputs are reported below.


In [None]:
if __name__ == "__main__":
    # Filter with parallel processing
    df_filtered, stats = filter_wikipedia_parallel(
        df_wiki, 
        threshold=5,
        n_workers=10,  # or None for auto
        chunk_size=20000
    )
    
    # Access statistics
    print("Additional Analysis:")
    print(f"Average rules matched per article: "
          f"{(stats['matched_name_rule'] + stats['matched_bio_rule'] + stats['matched_domain_rule']) / stats['selected_rows']:.2f}")
    
    # Save results
    df_filtered.to_csv(f"{output_path}/wiki_keyword.csv", index=False)

df_filtered

Processing 1,857,355 rows with 10 workers...
Split into 93 chunks of ~20,000 rows each


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   1%|          | 1/93 [06:22<9:47:07, 382.90s/it]

Chunk 0/93 (idx 6): 958 selected | name=129 bio=554 domain=626 excluded=435


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   2%|‚ñè         | 2/93 [07:43<5:11:00, 205.06s/it]

Chunk 1/93 (idx 9): 932 selected | name=112 bio=416 domain=592 excluded=495


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   3%|‚ñé         | 3/93 [08:17<3:10:40, 127.12s/it]

Chunk 2/93 (idx 8): 1,232 selected | name=158 bio=539 domain=865 excluded=383


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   4%|‚ñç         | 4/93 [08:42<2:08:19, 86.51s/it] 

Chunk 3/93 (idx 4): 1,221 selected | name=181 bio=411 domain=951 excluded=950


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   5%|‚ñå         | 5/93 [09:45<1:54:24, 78.00s/it]

Chunk 4/93 (idx 5): 1,450 selected | name=208 bio=519 domain=1093 excluded=1405


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   6%|‚ñã         | 6/93 [10:07<1:25:34, 59.02s/it]

Chunk 5/93 (idx 2): 1,453 selected | name=212 bio=379 domain=1202 excluded=247


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   8%|‚ñä         | 7/93 [10:08<57:43, 40.27s/it]  

Chunk 6/93 (idx 7): 2,010 selected | name=215 bio=1032 domain=1262 excluded=734


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:   9%|‚ñä         | 8/93 [11:23<1:12:25, 51.13s/it]

Chunk 7/93 (idx 3): 1,811 selected | name=354 bio=592 domain=1408 excluded=838


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  10%|‚ñâ         | 9/93 [12:47<1:26:02, 61.45s/it]

Chunk 8/93 (idx 13): 591 selected | name=75 bio=273 domain=397 excluded=293


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  11%|‚ñà         | 10/93 [13:35<1:19:08, 57.21s/it]

Chunk 9/93 (idx 10): 1,653 selected | name=138 bio=1036 domain=803 excluded=422


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  12%|‚ñà‚ñè        | 11/93 [13:37<55:25, 40.55s/it]  

Chunk 10/93 (idx 14): 648 selected | name=72 bio=306 domain=408 excluded=380


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  13%|‚ñà‚ñé        | 12/93 [14:28<58:53, 43.63s/it]

Chunk 11/93 (idx 11): 1,331 selected | name=176 bio=604 domain=903 excluded=572


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  14%|‚ñà‚ñç        | 13/93 [14:44<47:05, 35.32s/it]

Chunk 12/93 (idx 12): 1,404 selected | name=149 bio=695 domain=897 excluded=578


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  15%|‚ñà‚ñå        | 14/93 [15:30<50:42, 38.51s/it]

Chunk 13/93 (idx 15): 1,175 selected | name=125 bio=549 domain=761 excluded=450


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  16%|‚ñà‚ñå        | 15/93 [15:38<38:04, 29.29s/it]

Chunk 14/93 (idx 16): 1,138 selected | name=115 bio=546 domain=676 excluded=677


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  17%|‚ñà‚ñã        | 16/93 [15:47<29:38, 23.10s/it]

Chunk 15/93 (idx 0): 3,979 selected | name=741 bio=535 domain=3566 excluded=184


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  18%|‚ñà‚ñä        | 17/93 [15:52<22:19, 17.62s/it]

Chunk 16/93 (idx 1): 3,889 selected | name=514 bio=645 domain=3482 excluded=332


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  19%|‚ñà‚ñâ        | 18/93 [16:07<21:08, 16.91s/it]

Chunk 17/93 (idx 17): 1,084 selected | name=146 bio=540 domain=631 excluded=578


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  20%|‚ñà‚ñà        | 19/93 [16:15<17:33, 14.24s/it]

Chunk 18/93 (idx 20): 338 selected | name=35 bio=169 domain=201 excluded=211


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  22%|‚ñà‚ñà‚ñè       | 20/93 [16:40<21:11, 17.42s/it]

Chunk 19/93 (idx 18): 966 selected | name=121 bio=477 domain=595 excluded=866


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  23%|‚ñà‚ñà‚ñé       | 21/93 [17:22<29:42, 24.76s/it]

Chunk 20/93 (idx 19): 472 selected | name=56 bio=230 domain=279 excluded=249


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  24%|‚ñà‚ñà‚ñé       | 22/93 [17:40<27:04, 22.88s/it]

Chunk 21/93 (idx 27): 156 selected | name=15 bio=75 domain=92 excluded=85


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  25%|‚ñà‚ñà‚ñç       | 23/93 [17:42<19:28, 16.70s/it]

Chunk 22/93 (idx 26): 252 selected | name=30 bio=121 domain=159 excluded=944


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  26%|‚ñà‚ñà‚ñå       | 24/93 [18:15<24:47, 21.56s/it]

Chunk 23/93 (idx 21): 487 selected | name=54 bio=256 domain=274 excluded=337


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)


Chunk 24/93 (idx 22): 1,174 selected | name=122 bio=729 domain=556 excluded=535


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  28%|‚ñà‚ñà‚ñä       | 26/93 [19:06<25:28, 22.81s/it]

Chunk 25/93 (idx 23): 1,173 selected | name=122 bio=724 domain=535 excluded=414


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  29%|‚ñà‚ñà‚ñâ       | 27/93 [19:56<34:05, 31.00s/it]

Chunk 26/93 (idx 24): 1,109 selected | name=124 bio=592 domain=603 excluded=524


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  30%|‚ñà‚ñà‚ñà       | 28/93 [20:08<27:35, 25.47s/it]

Chunk 27/93 (idx 25): 1,118 selected | name=126 bio=613 domain=634 excluded=705


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  31%|‚ñà‚ñà‚ñà       | 29/93 [20:11<19:45, 18.52s/it]

Chunk 28/93 (idx 28): 1,203 selected | name=93 bio=813 domain=589 excluded=1373


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  32%|‚ñà‚ñà‚ñà‚ñè      | 30/93 [21:02<29:49, 28.41s/it]

Chunk 29/93 (idx 29): 1,311 selected | name=113 bio=830 domain=692 excluded=1092


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  33%|‚ñà‚ñà‚ñà‚ñé      | 31/93 [21:23<26:49, 25.96s/it]

Chunk 30/93 (idx 31): 916 selected | name=49 bio=499 domain=546 excluded=744


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  34%|‚ñà‚ñà‚ñà‚ñç      | 32/93 [21:25<19:06, 18.80s/it]

Chunk 31/93 (idx 30): 1,143 selected | name=69 bio=627 domain=624 excluded=1185


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  35%|‚ñà‚ñà‚ñà‚ñå      | 33/93 [21:48<20:11, 20.20s/it]

Chunk 32/93 (idx 32): 1,356 selected | name=71 bio=901 domain=587 excluded=915


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  37%|‚ñà‚ñà‚ñà‚ñã      | 34/93 [22:01<17:35, 17.89s/it]

Chunk 33/93 (idx 33): 990 selected | name=106 bio=512 domain=542 excluded=833


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  38%|‚ñà‚ñà‚ñà‚ñä      | 35/93 [22:45<24:55, 25.78s/it]

Chunk 34/93 (idx 34): 899 selected | name=91 bio=463 domain=527 excluded=1052


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  39%|‚ñà‚ñà‚ñà‚ñä      | 36/93 [22:48<18:02, 19.00s/it]

Chunk 35/93 (idx 35): 890 selected | name=90 bio=496 domain=486 excluded=1090


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  40%|‚ñà‚ñà‚ñà‚ñâ      | 37/93 [23:21<21:39, 23.20s/it]

Chunk 36/93 (idx 36): 830 selected | name=59 bio=464 domain=452 excluded=905


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  42%|‚ñà‚ñà‚ñà‚ñà‚ñè     | 39/93 [23:36<13:12, 14.68s/it]

Chunk 37/93 (idx 37): 860 selected | name=59 bio=508 domain=453 excluded=706
Chunk 38/93 (idx 40): 592 selected | name=53 bio=399 domain=258 excluded=801


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  43%|‚ñà‚ñà‚ñà‚ñà‚ñé     | 40/93 [23:51<13:03, 14.78s/it]

Chunk 39/93 (idx 38): 914 selected | name=80 bio=557 domain=465 excluded=1436


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  44%|‚ñà‚ñà‚ñà‚ñà‚ñç     | 41/93 [24:00<11:15, 13.00s/it]

Chunk 40/93 (idx 44): 365 selected | name=26 bio=266 domain=126 excluded=355


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  45%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 42/93 [24:19<12:30, 14.71s/it]

Chunk 41/93 (idx 39): 825 selected | name=65 bio=469 domain=457 excluded=1380


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  46%|‚ñà‚ñà‚ñà‚ñà‚ñå     | 43/93 [24:54<17:25, 20.91s/it]

Chunk 42/93 (idx 42): 848 selected | name=50 bio=552 domain=441 excluded=647


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  47%|‚ñà‚ñà‚ñà‚ñà‚ñã     | 44/93 [25:00<13:14, 16.21s/it]

Chunk 43/93 (idx 43): 725 selected | name=52 bio=450 domain=396 excluded=950


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  48%|‚ñà‚ñà‚ñà‚ñà‚ñä     | 45/93 [25:16<12:55, 16.17s/it]

Chunk 44/93 (idx 45): 560 selected | name=51 bio=378 domain=252 excluded=524


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  49%|‚ñà‚ñà‚ñà‚ñà‚ñâ     | 46/93 [25:19<09:37, 12.30s/it]

Chunk 45/93 (idx 41): 1,246 selected | name=97 bio=818 domain=566 excluded=1402


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  51%|‚ñà‚ñà‚ñà‚ñà‚ñà     | 47/93 [27:13<32:49, 42.81s/it]

Chunk 46/93 (idx 46): 1,436 selected | name=78 bio=1031 domain=588 excluded=965


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  52%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè    | 48/93 [27:43<29:14, 38.98s/it]

Chunk 47/93 (idx 48): 1,044 selected | name=82 bio=594 domain=563 excluded=897


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  53%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé    | 49/93 [27:49<21:19, 29.09s/it]

Chunk 48/93 (idx 49): 1,024 selected | name=85 bio=563 domain=572 excluded=1114


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  54%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 50/93 [27:55<15:51, 22.13s/it]

Chunk 49/93 (idx 47): 1,197 selected | name=90 bio=692 domain=655 excluded=972


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  55%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç    | 51/93 [28:02<12:17, 17.55s/it]

Chunk 50/93 (idx 51): 766 selected | name=54 bio=433 domain=444 excluded=805


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  56%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå    | 52/93 [28:13<10:40, 15.62s/it]

Chunk 51/93 (idx 50): 1,096 selected | name=107 bio=614 domain=597 excluded=1046


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  57%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã    | 53/93 [28:30<10:43, 16.10s/it]

Chunk 52/93 (idx 52): 781 selected | name=56 bio=500 domain=428 excluded=1324


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  58%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä    | 54/93 [28:40<09:20, 14.37s/it]

Chunk 53/93 (idx 55): 790 selected | name=61 bio=446 domain=458 excluded=767


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  59%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ    | 55/93 [28:46<07:25, 11.74s/it]

Chunk 54/93 (idx 53): 967 selected | name=77 bio=606 domain=522 excluded=1134


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  60%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà    | 56/93 [29:04<08:21, 13.55s/it]

Chunk 55/93 (idx 54): 831 selected | name=76 bio=461 domain=463 excluded=1001


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  61%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 57/93 [30:22<19:40, 32.79s/it]

Chunk 56/93 (idx 56): 801 selected | name=75 bio=428 domain=452 excluded=1114


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  62%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè   | 58/93 [30:38<16:18, 27.97s/it]

Chunk 57/93 (idx 57): 625 selected | name=57 bio=354 domain=343 excluded=882


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  63%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé   | 59/93 [31:08<16:12, 28.61s/it]

Chunk 58/93 (idx 58): 849 selected | name=62 bio=527 domain=434 excluded=911


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  65%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç   | 60/93 [31:09<11:10, 20.33s/it]

Chunk 59/93 (idx 60): 757 selected | name=55 bio=473 domain=387 excluded=874


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  66%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå   | 61/93 [31:28<10:38, 19.97s/it]

Chunk 60/93 (idx 61): 657 selected | name=59 bio=351 domain=413 excluded=565


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  67%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã   | 62/93 [31:36<08:24, 16.27s/it]

Chunk 61/93 (idx 59): 1,082 selected | name=72 bio=651 domain=588 excluded=871


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  68%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä   | 63/93 [31:58<08:55, 17.84s/it]

Chunk 62/93 (idx 65): 1,085 selected | name=55 bio=758 domain=444 excluded=705


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  69%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 64/93 [31:59<06:12, 12.86s/it]

Chunk 63/93 (idx 64): 769 selected | name=67 bio=427 domain=448 excluded=700


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  70%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ   | 65/93 [32:13<06:07, 13.11s/it]

Chunk 64/93 (idx 62): 1,156 selected | name=126 bio=756 domain=603 excluded=1021


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  71%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà   | 66/93 [32:15<04:24,  9.79s/it]

Chunk 65/93 (idx 63): 835 selected | name=85 bio=457 domain=466 excluded=985


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  72%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè  | 67/93 [33:19<11:21, 26.22s/it]

Chunk 66/93 (idx 66): 986 selected | name=74 bio=648 domain=448 excluded=841


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)


Chunk 67/93 (idx 67): 1,123 selected | name=100 bio=792 domain=596 excluded=1068


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  74%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç  | 69/93 [34:03<09:27, 23.64s/it]

Chunk 68/93 (idx 75): 471 selected | name=29 bio=288 domain=253 excluded=456


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  75%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå  | 70/93 [34:43<10:55, 28.48s/it]

Chunk 69/93 (idx 68): 974 selected | name=70 bio=552 domain=576 excluded=2097


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  77%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã  | 72/93 [35:10<06:52, 19.65s/it]

Chunk 70/93 (idx 70): 1,142 selected | name=81 bio=718 domain=619 excluded=1651
Chunk 71/93 (idx 69): 1,132 selected | name=101 bio=617 domain=682 excluded=2853


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  78%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä  | 73/93 [35:34<07:01, 21.06s/it]

Chunk 72/93 (idx 73): 1,613 selected | name=87 bio=1092 domain=797 excluded=1099


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  80%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ  | 74/93 [35:35<04:47, 15.15s/it]

Chunk 73/93 (idx 71): 1,223 selected | name=103 bio=725 domain=725 excluded=1453


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  81%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà  | 75/93 [36:06<05:59, 19.97s/it]

Chunk 74/93 (idx 74): 1,093 selected | name=75 bio=698 domain=575 excluded=1236


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  82%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè | 76/93 [36:21<05:09, 18.23s/it]

Chunk 75/93 (idx 72): 1,679 selected | name=97 bio=1096 domain=810 excluded=1000


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  83%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé | 77/93 [37:29<08:53, 33.32s/it]

Chunk 76/93 (idx 76): 1,045 selected | name=92 bio=547 domain=639 excluded=1427


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  84%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 78/93 [38:19<09:33, 38.27s/it]

Chunk 77/93 (idx 77): 1,426 selected | name=107 bio=873 domain=847 excluded=1954


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  85%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç | 79/93 [38:36<07:26, 31.92s/it]

Chunk 78/93 (idx 78): 1,402 selected | name=86 bio=911 domain=830 excluded=1544


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  86%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå | 80/93 [39:34<08:38, 39.87s/it]

Chunk 79/93 (idx 79): 1,404 selected | name=117 bio=853 domain=752 excluded=1419


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  87%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã | 81/93 [40:04<07:21, 36.77s/it]

Chunk 80/93 (idx 80): 1,390 selected | name=118 bio=769 domain=879 excluded=1109


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  88%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä | 82/93 [40:10<05:01, 27.42s/it]

Chunk 81/93 (idx 81): 1,414 selected | name=114 bio=853 domain=787 excluded=1709


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  89%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ | 83/93 [40:16<03:31, 21.19s/it]

Chunk 82/93 (idx 83): 1,327 selected | name=91 bio=732 domain=832 excluded=978


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  90%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà | 84/93 [40:36<03:07, 20.82s/it]

Chunk 83/93 (idx 85): 1,243 selected | name=100 bio=759 domain=698 excluded=1692


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  92%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñè| 86/93 [41:09<02:00, 17.21s/it]

Chunk 84/93 (idx 82): 1,643 selected | name=139 bio=909 domain=986 excluded=1408
Chunk 85/93 (idx 84): 2,700 selected | name=81 bio=2183 domain=864 excluded=1567


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  94%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñé| 87/93 [41:28<01:44, 17.46s/it]

Chunk 86/93 (idx 86): 1,144 selected | name=76 bio=681 domain=608 excluded=1332


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  95%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñç| 88/93 [42:00<01:50, 22.03s/it]

Chunk 87/93 (idx 88): 1,143 selected | name=76 bio=723 domain=595 excluded=1020


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  96%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñå| 89/93 [42:31<01:38, 24.71s/it]

Chunk 88/93 (idx 87): 1,773 selected | name=106 bio=1073 domain=960 excluded=1230


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  97%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñã| 90/93 [42:54<01:12, 24.03s/it]

Chunk 89/93 (idx 89): 1,684 selected | name=112 bio=1002 domain=937 excluded=633


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  98%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñä| 91/93 [43:15<00:46, 23.19s/it]

Chunk 90/93 (idx 92): 1,473 selected | name=105 bio=681 domain=958 excluded=1126


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks:  99%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñâ| 92/93 [43:19<00:17, 17.45s/it]

Chunk 91/93 (idx 90): 2,045 selected | name=157 bio=1087 domain=1297 excluded=1510


  mask_exclude = titles.str.contains(exclude_re, na=False, regex=True)
Processing chunks: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 93/93 [43:21<00:00, 27.97s/it]


Chunk 92/93 (idx 91): 2,233 selected | name=134 bio=1295 domain=1318 excluded=1427

FILTERING COMPLETE
Original rows:         1,857,355
Filtered rows:           108,150
Reduction:                  94.2%

Rule Matches (with overlaps):
  Named entities:         10,006
  Biography keywords:     58,438
  Domain keywords:        63,646

Excluded (sport/entertainment): 87,705
Domain threshold used:               5

Additional Analysis:
Average rules matched per article: 1.22


Unnamed: 0,id,title,text
0,220,Astronomia,Astronomia\n\nAstronomia √© uma ci√™ncia natural...
1,223,Am√©rica Latina,Am√©rica Latina\n\nA Am√©rica Latina (; ) √© uma ...
2,224,Albino Forjaz de Sampaio,Albino Forjaz de Sampaio\n\nAlbino Maria Perei...
3,226,Anno Domini,Anno Domini\n\nAnno Domini (A.D.) √© uma expres...
5,229,Anarcocapitalismo,Anarcocapitalismo\n\nAnarcocapitalismo (tamb√©m...
...,...,...,...
1857297,7219343,Azerbaij√£o Ocidental (conceito pol√≠tico),Azerbaij√£o Ocidental (conceito pol√≠tico)\n\nAz...
1857305,7219375,Ferdinand Berthier,"Ferdinand Berthier\n\n \nFerdinand Berthier (,..."
1857325,7219482,Camille Dimmer,"Camille Dimmer\n\nCamille Dimmer (Clervaux, 20..."
1857335,7219529,Supremo Tribunal Federal da Su√≠√ßa,Supremo Tribunal Federal da Su√≠√ßa\n\nO Supremo...


### 3.3 Semantic Filtering with Embeddings

This step extends the filtering process beyond explicit keywords by using **semantic similarity**.  
Articles are selected based on how closely their textual content aligns conceptually with  
*Law, Governance, and Ethics* through **multilingual sentence embeddings**.

A **SentenceTransformer model** (`paraphrase-multilingual-MiniLM-L12-v2`) is used to encode both  
Wikipedia article titles and domain-representative **seed sentences**, which serve as semantic anchors.

The design of the seed sentences and conceptual scope was developed with the assistance of **ChatGPT 5**,  
which helped formulate concise and representative textual prompts describing the domain themes  
(*law, governance, ethics, and logic*).

---

#### Prompt used with ChatGPT 5

> ‚ÄúCreate a set of short Portuguese sentences that represent the main themes of the  
> *Law, Governance, and Ethics* macrodomain.  
> Include references to legal systems, justice, human rights, public governance,  
> moral philosophy, and logical reasoning.  
> The sentences should capture semantic meaning suitable for use as embedding seeds.‚Äù

These seed sentences were encoded and compared against article titles using **cosine similarity**.  
Articles whose maximum similarity exceeded the threshold (`0.48`) were retained as semantically relevant.

---

* This semantic approach complements keyword filtering by capturing **contextual and conceptual matches**  
beyond explicit terminology, improving the coverage and precision of the final Wikipedia subset.


In [None]:
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Define macrodomain seed sentences
seed_texts = [
    # --- Legal Systems & Jurisprudence ---
    "direito civil, penal e constitucional, estudo das leis e da justi√ßa",
    "tribunais, constitui√ß√£o, legisla√ß√£o e aplica√ß√£o do direito",
    "jurisprud√™ncia, processos judiciais e princ√≠pios jur√≠dicos",
    "direitos humanos, igualdade perante a lei e cidadania",

    # --- Governance & Public Institutions ---
    "governan√ßa democr√°tica, estado de direito e pol√≠ticas p√∫blicas",
    "poder legislativo, executivo e judici√°rio na administra√ß√£o p√∫blica",
    "√©tica p√∫blica, responsabilidade pol√≠tica e combate √† corrup√ß√£o",
    "rela√ß√µes internacionais sob a √≥tica jur√≠dica e diplom√°tica",

    # --- Moral Philosophy & Ethics ---
    "filosofia moral, √©tica normativa e dilemas √©ticos contempor√¢neos",
    "teorias √©ticas como utilitarismo, deontologia e virtudes morais",
    "bio√©tica, justi√ßa social e responsabilidade individual e coletiva",
    "√©tica empresarial e governan√ßa corporativa respons√°vel",

    # --- Logic & Critical Thinking ---
    "fal√°cias l√≥gicas, racioc√≠nio dedutivo e an√°lise de argumentos",
    "pensamento cr√≠tico e filosofia da raz√£o aplicada √† √©tica e ao direito",
]

seed_emb = model.encode(seed_texts, convert_to_tensor=True)

> ‚ö†Ô∏è **Warning:** The next Python cell involves **high-performance computing (HPC)**.  
> Execution requires a **dedicated or cloud machine with multiple cores**, not a standard desktop or notebook.  


In [None]:
# Compute cosine similarity with article titles
# titles_emb = model.encode(df_wiki["title"].tolist(), convert_to_tensor=True, batch_size=256)
# torch.save(titles_emb, f"{output_path}/wiki_titles_emb.pt")

# ====== OR ========

# Import saved embeddings
titles_emb = torch.load(f"{output_path}/wiki_titles_emb.pt")

This cell applies the similarity threshold to create a mask of selected articles  
and extracts those most semantically related to the defined domain.  

It prints the total number of selected articles and analyzes the most frequent tokens  
from their titles, providing a quick lexical overview of the filtered subset.  

The resulting `df_semantic` DataFrame contains Wikipedia articles that were  
**selected purely by semantic similarity** rather than explicit keyword matches.

In [12]:
# Compute all cosine similarities (matrix of shape [N_titles, N_seeds])
sims_matrix = util.cos_sim(titles_emb, seed_emb)

# Take the maximum similarity per title (best match with any seed)
sims, _ = torch.max(sims_matrix, dim=1)

# Select threshold (you can tune between 0.35 and 0.45)
threshold = 0.48
mask_semantic = sims > threshold

df_semantic = df_wiki[mask_semantic.cpu().numpy()].copy()

print(f"Selected {len(df_semantic):,} articles above similarity {threshold}")

# Check top tokens
words = re.findall(r'\b[a-z√†-√∫A-Z√Ä-√ö]{4,}\b', " ".join(df_semantic['title'].tolist()).lower())
print("Most common terms in titles:", Counter(words).most_common(30))

df_semantic

Selected 26,428 articles above similarity 0.48
Most common terms in titles: [('lista', 1227), ('governo', 863), ('legislatura', 652), ('tratado', 643), ('direito', 627), ('estado', 626), ('deputados', 604), ('conselho', 603), ('na√ß√µes', 536), ('pol√≠tica', 535), ('rep√∫blica', 511), ('estaduais', 498), ('unidas', 492), ('social', 490), ('justi√ßa', 489), ('estados', 447), ('direitos', 437), ('tribunal', 424), ('internacional', 423), ('nacional', 393), ('seguran√ßa', 389), ('federal', 386), ('para', 381), ('partido', 378), ('rela√ß√µes', 360), ('unidos', 347), ('geral', 336), ('constitui√ß√£o', 331), ('p√∫blico', 331), ('resolu√ß√£o', 320)]


Unnamed: 0,id,title,text
105,367,Administra√ß√£o,Administra√ß√£o\n\nA √© a ci√™ncia social que estu...
122,386,An√°lise matem√°tica,An√°lise matem√°tica\n\nAn√°lise √© o ramo da mate...
144,416,Hino da Proclama√ß√£o da Rep√∫blica,Hino da Proclama√ß√£o da Rep√∫blica\n\nO Hino √† P...
165,460,Behaviorismo radical,Behaviorismo radical\n\nBehaviorismo radical √©...
166,461,Behaviorismo metodol√≥gico,Behaviorismo metodol√≥gico\n\nO Behaviorismo Me...
...,...,...,...
1856973,7218014,Bandeira dos Direitos Humanos,Bandeira dos Direitos Humanos\n\nA Bandeira do...
1857015,7218272,Servi√ßo Diplom√°tico da Let√¥nia no ex√≠lio,Servi√ßo Diplom√°tico da Let√¥nia no ex√≠lio\n\nO ...
1857152,7218800,Instituto de Estudos Pol√≠ticos de Grenoble,Instituto de Estudos Pol√≠ticos de Grenoble\n\n...
1857335,7219529,Supremo Tribunal Federal da Su√≠√ßa,Supremo Tribunal Federal da Su√≠√ßa\n\nO Supremo...


### 3.4 Final Dataset

In this final step, both filtering strategies ‚Äî **keyword-based** and **semantic embedding-based** ‚Äî  
are combined to build the consolidated Wikipedia dataset for the *Law, Governance, and Ethics* macrodomain.

Articles selected by either approach are merged, and **duplicates are removed** to ensure data integrity.  
The resulting dataset captures both **explicit textual relevance** (via keywords and named entities)  
and **implicit semantic alignment** (via embeddings).

The merged DataFrame is exported as `wiki_final.csv`, and basic lexical statistics are computed  
to inspect the most frequent words appearing in article titles.


In [None]:
# Join both approaches - name/bio/domain keywords and semantic similarity
df_final = pd.concat([df_filtered, df_semantic]).drop_duplicates().reset_index(drop=True)

# Count duplicates
duplicates = pd.concat([df_filtered, df_semantic]).duplicated(keep=False).sum()
print(f"‚úÖ Joined datasets with {duplicates:,} duplicates removed.")

# Save final DataFrame
df_final.to_csv(f"{output_path}/wiki_final.csv", index=False)
df_final

‚úÖ Joined datasets with 12,314 duplicates removed.


Unnamed: 0,id,title,text
0,220,Astronomia,Astronomia\n\nAstronomia √© uma ci√™ncia natural...
1,223,Am√©rica Latina,Am√©rica Latina\n\nA Am√©rica Latina (; ) √© uma ...
2,224,Albino Forjaz de Sampaio,Albino Forjaz de Sampaio\n\nAlbino Maria Perei...
3,226,Anno Domini,Anno Domini\n\nAnno Domini (A.D.) √© uma expres...
4,229,Anarcocapitalismo,Anarcocapitalismo\n\nAnarcocapitalismo (tamb√©m...
...,...,...,...
128416,7215469,O Condenado (1921),O Condenado (1921)\n\nO Condenado √© um filme m...
128417,7216611,Especialista em regula√ß√£o,Especialista em regula√ß√£o\n\n\n\n
128418,7216852,Conselho Nacional de Combate √† Discrimina√ß√£o,Conselho Nacional de Combate √† Discrimina√ß√£o\n...
128419,7217114,Ovo mundial,Ovo mundial\n\n\n\n
