
# Pipeline de Coleta → Extração → Limpeza → Deduplicação → Shards (Notebook)

Este notebook executa o pipeline em **etapas**, permitindo inspeção e depuração fase a fase.

> **Tema**: violência doméstica no contexto da atuação policial  
> **Entrada**: `crawl_manifest.jsonl` (manifesto deduplicado com URLs, **formato JSONL**)  
> **Saída**: diretório `OUTDIR/` com brutos (`data/raw`), textos limpos (`data/text`), shards JSONL (`data/shards`) e relatórios (`logs/`)


## 1) Configurações

In [None]:
#!pip install jsonlines
#!pip install tabulate
!pip install PyMuPDF pdfminer.six

In [1]:

from pathlib import Path
import os

# Caminhos
MANIFEST = Path("ds_grupos_vulneraveis.jsonl")  # <--- ALTERADO: JSONL
OUTDIR = Path("corpus_out_nb")

# Parâmetros
RATE = 1.0                # requisições por segundo (global)
MAX_WORKERS = 6           # threads de download/extração
TIMEOUT = 15              # timeout por requisição (s)
MIN_CHARS = 300           # mínimo de caracteres por doc (após limpeza)
SHARD_SIZE_MB = 100.0     # tamanho alvo de cada shard
LOG_LEVEL = "INFO"        # "DEBUG" para mais verbosidade
USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119 Safari/537.36"

# Preparar diretórios de saída
for sub in ["data/raw", "data/text", "data/shards", "logs"]:
    (OUTDIR / sub).mkdir(parents=True, exist_ok=True)

print("OUTDIR:", OUTDIR.resolve())


OUTDIR: /home/ricardo/Documentos/PESSOAL/Mestrado/Dataset/corpus_out_nb


## 2) Imports e utilitários

In [2]:

import csv, re, json, time, hashlib, logging
from urllib.parse import urlparse, urlunparse
from concurrent.futures import ThreadPoolExecutor, as_completed

import requests
from bs4 import BeautifulSoup

# Adicionar importação de jsonlines
try:
    import jsonlines
except ImportError:
    logging.error("A biblioteca 'jsonlines' é necessária. Instale com 'pip install jsonlines'")
    jsonlines = None

# Dependências opcionais (tratadas com fallback)
try:
    import trafilatura
except Exception:
    trafilatura = None

try:
    import fitz  # PyMuPDF
except Exception:
    fitz = None

try:
    from pdfminer.high_level import extract_text as pdfminer_extract_text
except Exception:
    pdfminer_extract_text = None

# Logging
logging.basicConfig(level=getattr(logging, LOG_LEVEL.upper(), logging.INFO),
                    format="%(asctime)s %(levelname)s: %(message)s")
log = logging.getLogger("pipeline_nb")

# Sessão HTTP com retries e headers de navegador
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def build_session(timeout: int, user_agent: str):
    s = requests.Session()
    retries = Retry(total=3, backoff_factor=0.8, status_forcelist=[429,500,502,503,504])
    s.mount("http://", HTTPAdapter(max_retries=retries))
    s.mount("https://", HTTPAdapter(max_retries=retries))
    s.headers.update({
        "User-Agent": user_agent,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "pt-BR,pt;q=0.9,en;q=0.8",
        "Connection": "keep-alive",
    })
    orig = s.request
    def wrapped(method, url, **kwargs):
        kwargs.setdefault("timeout", timeout)
        return orig(method, url, **kwargs)
    s.request = wrapped
    return s

session = build_session(TIMEOUT, USER_AGENT)

def sha1(s: str) -> str: return hashlib.sha1(s.encode("utf-8")).hexdigest()
def sha256(s: str) -> str: return hashlib.sha256(s.encode("utf-8")).hexdigest()

def norm_url(u: str) -> str:
    p = urlparse(u); path = p.path.rstrip("/")
    return urlunparse((p.scheme, p.netloc.lower(), path, "", "", ""))

# Rate limiter simples (token bucket)
class RateLimiter:
    def __init__(self, rate_per_sec: float):
        import threading
        self.rate = max(rate_per_sec, 0.1)
        self.tokens = self.rate
        self.last = time.monotonic()
        self.lock = threading.Lock()
    def acquire(self):
        with self.lock:
            while True:
                now = time.monotonic()
                elapsed = now - self.last
                self.tokens = min(self.rate, self.tokens + elapsed * self.rate)
                self.last = now
                if self.tokens >= 1.0:
                    self.tokens -= 1.0
                    return
                time.sleep((1.0 - self.tokens) / self.rate)

limiter = RateLimiter(RATE)

print("Setup complete.")


Setup complete.


## 3) Checagem de `robots.txt` (com timeout)

In [3]:

from urllib import robotparser

class RobotsCache:
    """Busca robots.txt com requests e timeout; se falhar, assume permitido."""
    def __init__(self, user_agent: str, timeout: int = 15):
        self.user_agent = user_agent
        self.timeout = timeout
        self.cache = {}
    def can_fetch(self, url: str) -> bool:
        p = urlparse(url)
        base = f"{p.scheme}://{p.netloc}"
        if base in self.cache:
            rp = self.cache[base]
            if rp is None: return True
            try:
                return rp.can_fetch(self.user_agent, url)
            except Exception:
                return True
        robots_url = base + "/robots.txt"
        rp = robotparser.RobotFileParser()
        try:
            resp = session.get(robots_url)  # session já tem timeout
            if resp.status_code != 200 or not resp.content:
                self.cache[base] = None
                return True
            rp.parse(resp.text.splitlines())
            self.cache[base] = rp
            return rp.can_fetch(self.user_agent, url)
        except Exception:
            self.cache[base] = None
            return True

robots = RobotsCache(USER_AGENT, timeout=TIMEOUT)

print("Setup complete.")

Setup complete.


## 4) Carregar manifesto e deduplicar URLs (Lendo JSONL)

In [3]:

import pandas as pd

if not jsonlines:
    raise ImportError("A biblioteca 'jsonlines' não foi importada corretamente. Verifique a instalação.")

records = []
try:
    with open(MANIFEST, 'r', encoding='utf-8') as f:
        reader = jsonlines.Reader(f)
        for obj in reader:
            # Garantir que a URL e o título existam para validade básica
            if 'url' in obj and 'title' in obj:
                records.append(obj)
            else:
                log.warning(f"Registro ignorado no manifesto JSONL (Falta URL/Title). Chaves: {obj.keys()}")
except FileNotFoundError:
    log.error(f"Arquivo manifesto não encontrado: {MANIFEST}")
except Exception as e:
    log.error(f"Erro ao ler o manifesto JSONL: {e}")

# Converter para DataFrame para facilidade de manipulação Pandas
df = pd.DataFrame(records)

if df.empty:
    log.error("Nenhum registro válido encontrado no manifesto. Finalizando.")
    dedup = pd.DataFrame()
else:
    df.columns = [c.strip().lower() for c in df.columns]
    df["url_norm"] = df["url"].apply(norm_url)
    
    # Deduplicação baseada na URL normalizada
    dedup = df.drop_duplicates(subset=["url_norm"]).reset_index(drop=True)

print(f"Registros no Manifesto (originais): {len(df)}")
print(f"Registros Únicos (após dedup por URL): {len(dedup)}")
print("\nPrimeiros 3 registros (deduplicados):\n")
print(dedup.head(3).to_markdown(index=False))


Registros no Manifesto (originais): 759
Registros Únicos (após dedup por URL): 184

Primeiros 3 registros (deduplicados):

| title                                                                                        | url                                                                                                         | date       | source_type            | type   |   group | url_norm                                                                                                    |
|:---------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------|:-----------|:-----------------------|:-------|--------:|:------------------------------------------------------------------------------------------------------------|
| Portaria DGP 08/2022 (PC/SP): Tratamento Específico a Travestis e Transexuais nas Delegacias | https://www.stj.jus.br/sites/por

## 5) Funções de extração e limpeza

In [None]:

import fitz, pdfminer.high_level

EMAIL_RE = re.compile(r"\b[\w\.-]+@[\w\.-]+\.\w{2,}\b", re.IGNORECASE)
CPF_RE = re.compile(r"\b\d{3}\.\d{3}\.\d{3}-\d{2}\b")
CNPJ_RE = re.compile(r"\b\d{2}\.\d{3}\.\d{3}/\d{4}-\d{2}\b")
MULTISPACE = re.compile(r"[ \t]+")
MULTINEWLINE = re.compile(r"\n{3,}")

def clean_text(s: str) -> str:
    if not s: return ""
    s = s.replace("\ufeff", "").replace("-\n","")
    s = MULTISPACE.sub(" ", s)
    s = MULTINEWLINE.sub("\n\n", s)
    s = EMAIL_RE.sub("<EMAIL>", s)
    s = CPF_RE.sub("<CPF>", s)
    s = CNPJ_RE.sub("<CNPJ>", s)
    return s.strip()

def extract_html_main(html_bytes: bytes) -> str:
    if trafilatura is not None:
        try:
            txt = trafilatura.extract(html_bytes, include_comments=False, include_tables=False, favor_recall=True)
            if txt and txt.strip():
                return txt.strip()
        except Exception:
            pass
    try:
        soup = BeautifulSoup(html_bytes, "lxml")
        for tag in soup(["script","style","nav","header","footer","noscript","aside"]):
            tag.decompose()
        return soup.get_text("\n").strip()
    except Exception:
        return ""

def extract_pdf_text(pdf_path: str) -> str:
    if fitz is not None:
        try:
            doc = fitz.open(pdf_path)
            parts = [page.get_text("text") for page in doc]
            out = "\n".join(parts)
            if out.strip():
                return out.strip()
        except Exception as e:
            print(f"fitz extraction failed. ERRO: {e}") 
            
    if pdfminer_extract_text is not None:
        try:
            out = pdfminer_extract_text(pdf_path) or ""
            if out.strip():
                return out.strip()
        except Exception as e:
            print(f"pdfminer extraction failed. ERRO: {e}")
    
    print("Nenhum método de extração funcionou.")
    return ""

print("Text extraction functions defined.")

Text extraction functions defined.


## 6) Baixar e extrair (HTML/PDF)

In [5]:
from tqdm import tqdm
from concurrent.futures import ThreadPoolExecutor, as_completed
# Assume-se que 'RAW_DIR', 'TEXT_DIR', 'limiter', 'session', 'sha256', 
# 'extract_pdf_text', 'extract_html_main', 'clean_text', 'MIN_CHARS', 
# e 'dedup' estão definidos em células anteriores.

RAW_DIR = OUTDIR / "data" / "raw"
TEXT_DIR = OUTDIR / "data" / "text"
results = []
download_results = [] # Lista para armazenar resultados da Fase 1

# --- Método 1: Download e Salvamento do Arquivo Bruto ---
def download_raw(row):
    """Realiza a requisição HTTP, salva o arquivo bruto e retorna o status e o caminho."""
    url = row["url"]
    item_type = row.get("type", "").lower()
    url_norm = row["url_norm"] # Usando url_norm já calculada

    # O true pula a verificação de robots.txt
    if not True: #robots.can_fetch(url): 
        return {"url": url, "status": "robots_disallow"}
    
    limiter.acquire()
    
    try:
        resp = session.get(url)
        ct = resp.headers.get("Content-Type", "")
        content = resp.content
        status = resp.status_code
    except Exception as e:
        return {"url": url, "status": f"fetch_error:{e}"}
    
    if status != 200 or not content:
        return {"url": url, "status": f"http_{status}"}
    
    # Determinação do tipo de arquivo, usando 'type' do manifesto como prioridade
    is_pdf = (item_type == "pdf")
    ext = ".pdf" if is_pdf else ".html"
    fname = sha256(url_norm)[:20] + ext
    raw_path = RAW_DIR / fname
    
    try:
        raw_path.write_bytes(content)
    except Exception as e:
        return {"url": url, "status": f"save_raw_error:{e}"}

    print(f"Downloaded and saved: {url} -> {raw_path} -> {item_type}")    
    return {
        "url": url, 
        "status": "download_ok", 
        "raw_path": str(raw_path),
        "is_pdf": is_pdf,
        "content": content # Passa o conteúdo para evitar ler o disco novamente para HTML
    }


# 1. FASE DE DOWNLOAD
print("\n--- FASE 1: DOWNLOAD DOS ARQUIVOS BRUTOS ---")
if 'dedup' in locals() and not dedup.empty:
    dedup_list = dedup.to_dict(orient="records")
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex_fetch:
        futs_fetch = [ex_fetch.submit(download_raw, r) for r in dedup_list]
        for fut_fetch in tqdm(as_completed(futs_fetch), total=len(futs_fetch), desc="Baixando Brutos"):
            download_results.append(fut_fetch.result())

else:
    log.warning("Nenhuma URL para processar após a deduplicação.")



# Saída do bloco original
print(f"\nTotal de resultados: {len(results)}")
print(f"Primeiros 3 resultados: {results[:3]}")

len(results), results[:3]


--- FASE 1: DOWNLOAD DOS ARQUIVOS BRUTOS ---


Baixando Brutos:   1%|          | 1/184 [00:01<04:40,  1.53s/it]

Downloaded and saved: https://www.stj.jus.br/sites/portalp/WebPub/NovoPortal/assets/flip-page/panorama_9_2024_noticias_2023.pdf -> corpus_out_nb/data/raw/f3b1ee82c4c4c7eeae02.pdf -> pdf


Baixando Brutos:   1%|          | 2/184 [00:02<03:20,  1.10s/it]

Downloaded and saved: https://www.ssp.pi.gov.br/wp-content/uploads/2025/06/SSP01_fa3d283ba1.pdf -> corpus_out_nb/data/raw/12ebadbd73215f42a57d.pdf -> pdf
Downloaded and saved: https://social.rs.gov.br/upload/arquivos/202212/06142737-2021-12-procedoperacionalpadrao-pop-idoso-pcdf.pdf -> corpus_out_nb/data/raw/3e74f3867048c2972726.pdf -> pdf


Baixando Brutos:   2%|▏         | 4/184 [00:03<02:06,  1.42it/s]

Downloaded and saved: https://news.un.org/pt/story/2024/10/1838571 -> corpus_out_nb/data/raw/f394fb49d7416d15c71c.html -> html


Baixando Brutos:   4%|▍         | 7/184 [00:06<02:52,  1.03it/s]

Downloaded and saved: https://www.gov.br/mj/pt-br/assuntos/noticias/mjsp-e-acnur-divulgam-relatorios-sobre-refugio-e-deslocamento-forcado-no-brasil-e-no-mundo -> corpus_out_nb/data/raw/9de4ddf770e95351dfb2.html -> html


Baixando Brutos:   4%|▍         | 8/184 [00:07<02:32,  1.15it/s]

Downloaded and saved: https://goias.gov.br/imb/wp-content/uploads/sites/29/2018/07/POP_Limites_Versao_001-982.pdf -> corpus_out_nb/data/raw/247d544f1d548fb31e66.pdf -> pdf


Baixando Brutos:   5%|▍         | 9/184 [00:08<02:16,  1.28it/s]

Downloaded and saved: https://www.gov.br/participamaisbrasil/blob/baixar/2100 -> corpus_out_nb/data/raw/046f13982dc3ab0d8917.html -> html


Baixando Brutos:   5%|▌         | 10/184 [00:09<02:39,  1.09it/s]

Downloaded and saved: https://www.mpf.mp.br/atuacao-tematica/ccr6/enunciados -> corpus_out_nb/data/raw/968f41fb52a5a484f312.html -> html


Baixando Brutos:   7%|▋         | 12/184 [00:11<02:30,  1.14it/s]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/leis/2003/l10.741.htm -> corpus_out_nb/data/raw/e719b3a708017147da53.html -> html


Baixando Brutos:   7%|▋         | 13/184 [00:12<02:39,  1.07it/s]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2015/lei/l13146.htm -> corpus_out_nb/data/raw/3b7e80d9119d6c0f76fe.html -> html


Baixando Brutos:   8%|▊         | 14/184 [00:13<02:42,  1.05it/s]

Downloaded and saved: https://www.gov.br/mj/pt-br/assuntos/sua-seguranca/seguranca-publica/analise-e-pesquisa/download/estudos/pspvolume5/filtragem_racial_selecao_policial_suspeitos.pdf -> corpus_out_nb/data/raw/50ef9e7853707dcc7e3d.pdf -> pdf


Baixando Brutos:   8%|▊         | 15/184 [00:14<02:55,  1.04s/it]

Downloaded and saved: https://antrabrasil.org/wp-content/uploads/2020/03/manual-de-seguranc387a-pc39ablica-atendimento-e-abordagem-lgbti.pdf -> corpus_out_nb/data/raw/12a2baa6475306526b56.pdf -> pdf


Baixando Brutos:   9%|▊         | 16/184 [00:15<03:06,  1.11s/it]

Downloaded and saved: https://www.coede.pr.gov.br/sites/coede/arquivos_restritos/files/documento/2023-05/nota_de_instrucao_no_001-2022_-_procedimentos_a_serem_observados_em_ocorrencias_envolvendo_pessoa_com_transtorno_do_espectro_autista_tea_1.pdf -> corpus_out_nb/data/raw/f91bccf104cebdc7a31e.pdf -> pdf


Baixando Brutos:   9%|▉         | 17/184 [00:17<03:40,  1.32s/it]

Downloaded and saved: https://ri.unipac.br/repositorio/wp-content/uploads/tainacan-items/282/129550/THIAGO-FREDERICO-MENDONCA-ABORDAGEM-POLICIAL-A-PESSOAS-COM-DEFICIENCIA.pdf -> corpus_out_nb/data/raw/d8eb2d6b9e806eb8ca82.pdf -> pdf


Baixando Brutos:  10%|▉         | 18/184 [00:17<02:52,  1.04s/it]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2017/lei/l13445.htm -> corpus_out_nb/data/raw/51522d2d8105b7dba399.html -> html


Baixando Brutos:  10%|█         | 19/184 [00:18<02:13,  1.24it/s]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/leis/L9474.htm -> corpus_out_nb/data/raw/3ce06512d4e4ec77f988.html -> html


Baixando Brutos:  11%|█         | 20/184 [00:19<02:19,  1.17it/s]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2015/lei/l13104.htm -> corpus_out_nb/data/raw/049da488d622b4d861a0.html -> html


Baixando Brutos:  11%|█▏        | 21/184 [00:20<02:41,  1.01it/s]

Downloaded and saved: https://www.mpac.mp.br/wp-content/uploads/Manual-de-abordagem-a-crianca-e-ao-adolescente-CEL-PM-Gilson-Santiago-Messias-CEDECA-BA.pdf -> corpus_out_nb/data/raw/5c9c0eafc75b6df2a2b8.pdf -> pdf


Baixando Brutos:  15%|█▍        | 27/184 [00:28<02:50,  1.08s/it]

Downloaded and saved: https://terradedireitos.org.br/uploads/arquivos/PROTOCOLO_CONSULTA_WEB-min.pdf -> corpus_out_nb/data/raw/b063cdbd449fb6cdef67.pdf -> pdf


Baixando Brutos:  16%|█▋        | 30/184 [00:30<01:56,  1.32it/s]

Downloaded and saved: https://www.pcdf.df.gov.br/images/Folders/cartilhaLGBTQI_FINAL_lancamento.pdf -> corpus_out_nb/data/raw/99d93e6f72a55d068e43.pdf -> pdf


Baixando Brutos:  17%|█▋        | 31/184 [00:31<02:01,  1.26it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2021/06/manual_resolucao348_LGBTI.pdf -> corpus_out_nb/data/raw/33ae05366615c6ff60c7.pdf -> pdf


Baixando Brutos:  17%|█▋        | 32/184 [00:32<02:01,  1.25it/s]

Downloaded and saved: https://www.gov.br/pf/pt-br/assuntos/seguranca-privada/legislacao-normas-e-orientacoes/cartilha-seguranca-sem-preconceito/cartilha-seguranca-sem-preconceito.pdf -> corpus_out_nb/data/raw/bf4e87175249a7689e57.pdf -> pdf


Baixando Brutos:  18%|█▊        | 33/184 [00:33<02:11,  1.15it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2025/06/relatoria-audiencia-publica-quilombolas-1jun25.pdf -> corpus_out_nb/data/raw/3b08dd64c64f5d74a23f.pdf -> pdf


Baixando Brutos:  20%|█▉        | 36/184 [00:35<02:02,  1.21it/s]

Downloaded and saved: https://www.hrw.org/pt/news/2023/12/15/un-experts-call-brazil-end-brutal-police-violence -> corpus_out_nb/data/raw/aaa6d1492af0ec687685.html -> html


Baixando Brutos:  20%|██        | 37/184 [00:37<02:29,  1.02s/it]

Downloaded and saved: https://www.dn.pt/internacional/desigualdade-racismo-e-violencia-afetam-direitos-humanos-no-brasil -> corpus_out_nb/data/raw/908e3286d7ce01cfff1a.html -> html


Baixando Brutos:  21%|██        | 38/184 [00:37<02:02,  1.19it/s]

Downloaded and saved: https://www5.bahiana.edu.br/index.php/psicologia/article/view/5809/5337 -> corpus_out_nb/data/raw/0120247f8e5c526a7cf3.html -> html


Baixando Brutos:  21%|██        | 39/184 [00:38<02:03,  1.18it/s]

Downloaded and saved: https://www.tjdft.jus.br/informacoes/cidadania/nucleo-judiciario-da-mulher/documentos-e-links/arquivos/livro-njm_pmdf.pdf -> corpus_out_nb/data/raw/937f032850e506a7a6b7.pdf -> pdf


Baixando Brutos:  22%|██▏       | 40/184 [00:39<02:01,  1.18it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2025/08/concurso-nacional-djadh-v2.pdf -> corpus_out_nb/data/raw/ba192250deb3e66bd6e3.pdf -> pdf




Downloaded and saved: https://www.saude.sp.gov.br/resources/instituto-de-saude/homepage/pdfs/temassaudecoletiva25.pdf -> corpus_out_nb/data/raw/086d5480b0bb2df265d9.pdf -> pdf


Baixando Brutos:  23%|██▎       | 42/184 [00:42<02:55,  1.24s/it]

Downloaded and saved: https://www.fadileste.edu.br/revistavox/index.php/revistavox/article/download/105/91/125 -> corpus_out_nb/data/raw/33df62dd62407e958ff8.pdf -> pdf


Baixando Brutos:  23%|██▎       | 43/184 [00:43<02:24,  1.02s/it]

Downloaded and saved: https://www.sgb.gov.br/documents/d/guest/cartilha_pgf_assedio_sexual_agosto_lilas_revisado_final-pdf -> corpus_out_nb/data/raw/f9dcce05ac7e2a7b754b.pdf -> pdf


Baixando Brutos:  24%|██▍       | 44/184 [00:44<02:05,  1.12it/s]

Downloaded and saved: https://www.tjdft.jus.br/informacoes/cidadania/nucleo-judiciario-da-mulher/documentos-e-links/normas-tecnicas/nota-tecnica-14-2024-medidas-protetivas-de-urgencia-sua-autonomia-e-prazo-de-duracao.pdf -> corpus_out_nb/data/raw/342c59451685a5eeb0fa.pdf -> pdf




Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2025/06/manual-tratamento-indigenas-adolescentes-jovens.pdf -> corpus_out_nb/data/raw/967b28ea6b87b31598dc.pdf -> pdf


Baixando Brutos:  26%|██▌       | 47/184 [00:48<02:49,  1.24s/it]

Downloaded and saved: https://pc.es.gov.br/Media/PCES/Legislação/PORTARIA%20Nº%2080-S%20-%20DEPPI.pdf -> corpus_out_nb/data/raw/5ab735bf8f1634c78d0d.pdf -> pdf


Baixando Brutos:  26%|██▌       | 48/184 [00:49<02:11,  1.03it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2024/11/protocolo-para-julgamento-com-perspectiva-racial-2.pdf -> corpus_out_nb/data/raw/10598a25a3d17742a2b2.pdf -> pdf


Baixando Brutos:  27%|██▋       | 49/184 [00:50<02:10,  1.03it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2024/06/protocolo-tecnico-interacao-pessoa-tea.pdf -> corpus_out_nb/data/raw/9a19b5bacd59418ea6eb.pdf -> pdf


Baixando Brutos:  28%|██▊       | 51/184 [00:51<01:39,  1.34it/s]

Downloaded and saved: https://www.planalto.gov.br/ccivil_03/_ato2023-2026/2023/lei/L14532.htm -> corpus_out_nb/data/raw/9e94d69d18eb7285064c.html -> html


Baixando Brutos:  29%|██▉       | 54/184 [00:53<01:47,  1.21it/s]

Downloaded and saved: https://congesp.rn.gov.br/anais/v-16/politicas-publicas-e-desenvolvimento-sustentavel/as-inovacoes-alteracoes-na-lei-maria-da-penha-o-atendimento-policial-militar-nos-casos-de-violencia-domestica.pdf -> corpus_out_nb/data/raw/b4d82409a0cda3940a1f.pdf -> pdf


Baixando Brutos:  30%|██▉       | 55/184 [00:54<01:38,  1.31it/s]

Downloaded and saved: https://www.gov.br/mj/pt-br/centrais-de-conteudo/publicacoes/categorias-de-publicacoes/manuais/diretrizes_nacionais_para_o_atendimento_policial_militar_as_mulheres_21_junho_2022-versao-final-1.pdf -> corpus_out_nb/data/raw/b87c735cde6a8ee0868c.pdf -> pdf


Baixando Brutos:  30%|███       | 56/184 [00:55<01:39,  1.29it/s]

Downloaded and saved: https://site.mppr.mp.br/sites/hotsites/arquivos_restritos/files/migrados/File/publi/caopca/lei_13431_comentada_jun2018.pdf -> corpus_out_nb/data/raw/c314da411f30a4f00eb7.pdf -> pdf


Baixando Brutos:  31%|███       | 57/184 [00:57<02:21,  1.11s/it]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2025/10/policia-judicial-antirracista-protocolo.pdf -> corpus_out_nb/data/raw/47e5ee5c3db2491ce6b6.pdf -> pdf


Baixando Brutos:  32%|███▏      | 58/184 [01:00<03:28,  1.66s/it]

Downloaded and saved: https://www.mprs.mp.br/media/areas/evcm/arquivos/padronizacao_nacional_das_deams.pdf -> corpus_out_nb/data/raw/aa74f030d3b232340bce.pdf -> pdf


Baixando Brutos:  34%|███▎      | 62/184 [01:04<02:04,  1.02s/it]

Downloaded and saved: http://www.onumulheres.org.br/wp-content/uploads/2023/05/guia_para_acolhimento_de_migrantes_refugiadas_refugiados.pdf -> corpus_out_nb/data/raw/8b10a6d0fc22bc11f467.pdf -> pdf


Baixando Brutos:  34%|███▍      | 63/184 [01:04<01:35,  1.27it/s]

Downloaded and saved: https://www.icmpd.org/file/download/54254/file/MT%2520Brasil%2520-%2520Guia%2520de%2520Atendimento.pdf -> corpus_out_nb/data/raw/bf2806b354e7e5239303.pdf -> pdf


Baixando Brutos:  35%|███▍      | 64/184 [01:06<01:59,  1.00it/s]

Downloaded and saved: https://app1.sefaz.mt.gov.br/Sistema/legislacao/LeiComplEstadual.nsf/9733a1d3f5bb1ab384256710004d4754/d0fcc124ca2a33e18425775700750cc3?OpenDocument -> corpus_out_nb/data/raw/5aa5b8f047fbce2ec6e0.html -> html


Baixando Brutos:  35%|███▌      | 65/184 [01:06<01:44,  1.14it/s]

Downloaded and saved: https://central.to.gov.br/download/398869 -> corpus_out_nb/data/raw/78e818edaf8f0c48e543.pdf -> pdf


Baixando Brutos:  36%|███▌      | 66/184 [01:08<02:03,  1.05s/it]

Downloaded and saved: https://www.gov.br/mdh/pt-br/assuntos/noticias/2022/junho/guia-direitos-humanos-e-os-sistemas-de-seguranca-publica-socioeducativo-e-penitenciario-portugues.pdf -> corpus_out_nb/data/raw/779df5279dc6ee5c95eb.pdf -> pdf


Baixando Brutos:  36%|███▋      | 67/184 [01:08<01:40,  1.17it/s]

Downloaded and saved: https://bibliotecadigital.mdh.gov.br/jspui/bitstream/192/8729/1/comissao_direitos_humanos.pdf -> corpus_out_nb/data/raw/4ccfe7447a0ecc97b822.pdf -> pdf


Baixando Brutos:  37%|███▋      | 68/184 [01:09<01:37,  1.19it/s]

Downloaded and saved: https://camposnovos.sc.gov.br/uploads/sites/405/2021/12/867067_Plano_Decenal_Campos_Novos__2017___2026_1.pdf -> corpus_out_nb/data/raw/bd44eabde9a8ebc71927.pdf -> pdf


Baixando Brutos:  38%|███▊      | 70/184 [01:11<01:45,  1.08it/s]

Downloaded and saved: https://www.tjdft.jus.br/informacoes/cidadania/nucleo-judiciario-da-mulher/documentos-e-links/arquivos/livro-contribuicoes-para-a-formacao-de-profissionais-da-seguranca-publica-do-enfrentamento-da-vdfcm_2-edicao.pdf -> corpus_out_nb/data/raw/cf5a4d2f0c4b497bccef.pdf -> pdf


Baixando Brutos:  39%|███▊      | 71/184 [01:12<01:37,  1.16it/s]

Downloaded and saved: https://www.gov.br/mj/pt-br/assuntos/sua-seguranca/seguranca-publica/operacoes-integradas/cgoi/vips/relatorio-geral-da-operacao-virtude-2024.pdf -> corpus_out_nb/data/raw/6d474b60b1b0f0f8b38a.pdf -> pdf


Baixando Brutos:  39%|███▉      | 72/184 [01:13<01:45,  1.06it/s]

Downloaded and saved: https://noticias.stf.jus.br/postsnoticias/supremo-define-que-abordagem-policial-motivada-por-cor-da-pele-e-ilegal/ -> corpus_out_nb/data/raw/2c27e352d208facc3f73.html -> html


Baixando Brutos:  40%|███▉      | 73/184 [01:14<01:41,  1.09it/s]

Downloaded and saved: https://www.stj.jus.br/sites/portalp/Paginas/Comunicacao/Noticias/20042022-Revista-pessoal-baseada-em-%E2%80%9Catitude-suspeita%E2%80%9D-e-ilegal--decide-Sexta-Turma.aspx -> corpus_out_nb/data/raw/850592bdef1d78159e3e.html -> html


Baixando Brutos:  41%|████      | 75/184 [01:15<01:19,  1.37it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2023/05/cadernos-stf-igualdade-racial-web-23-05-03.pdf -> corpus_out_nb/data/raw/412decf462a61a0ae266.pdf -> pdf


Baixando Brutos:  41%|████▏     | 76/184 [01:16<01:12,  1.49it/s]

Downloaded and saved: https://www.tjdft.jus.br/institucional/imprensa/campanhas-e-produtos/direito-facil/edicao-semanal/escuta-especializada-x-depoimento-especial -> corpus_out_nb/data/raw/0ab2fb8ae83cb95428a5.html -> html


Baixando Brutos:  42%|████▏     | 77/184 [01:17<01:28,  1.21it/s]

Downloaded and saved: https://esaj.tjsp.jus.br/gcn-frontend-vue/legislacao/find/202735 -> corpus_out_nb/data/raw/ca8d4e6585cbf4cb85d1.html -> html


Baixando Brutos:  42%|████▏     | 78/184 [01:20<02:44,  1.55s/it]

Downloaded and saved: https://www.policiacomunitaria.ms.gov.br/wp-content/uploads/2014/12/LIVRO_MULTIPLICADOR_POLICIA_COMUNITRIA.pdf -> corpus_out_nb/data/raw/96a7bda743463b3b9fe8.pdf -> pdf


Baixando Brutos:  43%|████▎     | 79/184 [01:21<02:20,  1.34s/it]

Downloaded and saved: https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/direitos_humanos/POP_RUA/CPDPopRua%20-%20Revisado.pdf -> corpus_out_nb/data/raw/8d39667c29961e4436cc.pdf -> pdf


Baixando Brutos:  43%|████▎     | 80/184 [01:22<02:08,  1.24s/it]

Downloaded and saved: https://www.mogidascruzes.sp.gov.br/public/site/doc/20250110102418678111729daab.pdf -> corpus_out_nb/data/raw/d2d4e7beaaa5bb76a5ac.pdf -> pdf


Baixando Brutos:  44%|████▍     | 81/184 [01:22<01:41,  1.01it/s]

Downloaded and saved: https://docs.bvsalud.org/biblioref/2023/06/1436927/issue-d03a857a23b5285736c4d55e0bb067c8.pdf -> corpus_out_nb/data/raw/2afa1ab1673fccc0bbf0.pdf -> pdf


Baixando Brutos:  45%|████▍     | 82/184 [01:23<01:30,  1.12it/s]

Downloaded and saved: https://www.tjsp.jus.br/download/EPM/Publicacoes/CadernosJuridicos/cj_n63_05_o%20magistrado%20garantidor%20no%20depoimento%20especial_2p.pdf?d=638070538284252291 -> corpus_out_nb/data/raw/e622eacb90d6c43d4907.pdf -> pdf


Baixando Brutos:  45%|████▌     | 83/184 [01:24<01:24,  1.19it/s]

Downloaded and saved: https://www.tjdft.jus.br/informacoes/infancia-e-juventude/publicacoes-textos-e-artigos/publicacoes/publicacoes-1/ProtocoloAtenIntegralCriancasAdolecentesVitimasViol.pdf -> corpus_out_nb/data/raw/915b3ec7f0ec2e048a49.pdf -> pdf


Baixando Brutos:  46%|████▌     | 84/184 [01:26<02:08,  1.29s/it]

Downloaded and saved: https://forumseguranca.org.br/wp-content/uploads/2020/10/manual-formacao-de-policiais-para-o-enfrentamento-da-violencia-de-genero.pdf -> corpus_out_nb/data/raw/169111217e2454ca8bc2.pdf -> pdf


Baixando Brutos:  46%|████▌     | 85/184 [01:28<02:23,  1.45s/it]

Downloaded and saved: https://www.icrc.org/pt/document/cicv-promove-capacitacao-de-forcas-policiais-e-de-seguranca-em-direitos-humanos -> corpus_out_nb/data/raw/c08630700edc592d61fe.html -> html


Baixando Brutos:  47%|████▋     | 86/184 [01:28<01:51,  1.14s/it]

Downloaded and saved: https://www.gov.br/igualdaderacial/pt-br/assuntos/plano-juventude-negra-viva/2024_Plano_Juventude_Negra_Viva_.pdf -> corpus_out_nb/data/raw/f5ef19a711823c7a4fae.pdf -> pdf


Baixando Brutos:  47%|████▋     | 87/184 [01:29<01:38,  1.01s/it]

Downloaded and saved: https://www.pcdf.df.gov.br/images/DIVICOM/Portaria_86.pdf -> corpus_out_nb/data/raw/2302c6438e2e3139d456.pdf -> pdf


Baixando Brutos:  48%|████▊     | 88/184 [01:30<01:40,  1.05s/it]

Downloaded and saved: https://www.pcdf.df.gov.br/images/conteudo/institucional/TCU/RELATORIO_DE_GESTAO_2021_V5.pdf -> corpus_out_nb/data/raw/7fd153494eb18315d4ce.pdf -> pdf


Baixando Brutos:  48%|████▊     | 89/184 [01:31<01:31,  1.04it/s]

Downloaded and saved: https://www.pc.ms.gov.br/wp-content/uploads/2024/03/DO11430_01_03_2024-pag-16-17.pdf -> corpus_out_nb/data/raw/fc52f8cf7762818bbd9a.pdf -> pdf


Baixando Brutos:  49%|████▉     | 90/184 [01:31<01:19,  1.19it/s]

Downloaded and saved: https://policiacivil.se.gov.br/wp-content/uploads/2025/01/Portaria-no-009_2025-Cria-o-Nucleo-de-Atendimento-a-Mulher-e-Demais-Grupos-Vulneraveis-NAGV-de-Itabaianinha-SE.pdf -> corpus_out_nb/data/raw/b53f8bd249a35e4f6110.pdf -> pdf


Baixando Brutos:  49%|████▉     | 91/184 [01:32<01:05,  1.42it/s]

Downloaded and saved: https://atos.cnj.jus.br/atos/detalhar/3399 -> corpus_out_nb/data/raw/f2dfb9cb6648811b64e5.html -> html


Baixando Brutos:  50%|█████     | 92/184 [01:33<01:23,  1.10it/s]

Downloaded and saved: https://leidominutoseguinte.mpf.mp.br/ -> corpus_out_nb/data/raw/a6f1d9654bbb41ebdd5f.html -> html


Baixando Brutos:  51%|█████     | 93/184 [01:34<01:10,  1.30it/s]

Downloaded and saved: https://www.tjdft.jus.br/institucional/imprensa/noticias/arquivos/ggcorp_manual_dos_fluxos-final_isbn_09maio2025-1.pdf -> corpus_out_nb/data/raw/86f27c8076f7f614b1f3.pdf -> pdf


Baixando Brutos:  51%|█████     | 94/184 [01:35<01:25,  1.05it/s]

Downloaded and saved: https://mpmt.mp.br/site/storage/webdisco/arquivos/PROTOCOLO%20PARA%20INVESTIGAC%CC%A7A%CC%83O%20DE%20CRIMES%20DE%20VIOLE%CC%82NCIA%20DOME%CC%81STICA%20E%20FAMILIAR%20CONTRA%20A%20MULHER%2C%20COM%20PERSPECTIVA%20DE%20GE%CC%82NERO%20(1)_compressed.pdf -> corpus_out_nb/data/raw/6b849444cb08ce619201.pdf -> pdf


Baixando Brutos:  52%|█████▏    | 95/184 [01:37<02:05,  1.41s/it]

Downloaded and saved: https://goias.gov.br/policiacientifica/wp-content/uploads/sites/63/2017/06/Manual.pop_.iml_.pdf -> corpus_out_nb/data/raw/2b329c210891ecb2f794.pdf -> pdf
Downloaded and saved: https://ead.pm.ma.gov.br/pluginfile.php/11527/mod_resource/content/6/14_MANOEL%20MARIA%20PIMENTA%20SILVA%20FILHO%20e%20DIEGO%20FELIPE%20BATISTA%20RIBEIRO.pdf -> corpus_out_nb/data/raw/576f9903021e7468fe30.pdf -> pdf


Baixando Brutos:  53%|█████▎    | 97/184 [01:38<01:25,  1.01it/s]

Downloaded and saved: https://periodicos.pf.gov.br/index.php/CadANP/article/view/6/19 -> corpus_out_nb/data/raw/7857559958be104a093d.pdf -> pdf


Baixando Brutos:  53%|█████▎    | 98/184 [01:40<01:34,  1.10s/it]

Downloaded and saved: https://www.mprj.mp.br/documents/20184/4655937/resolucao_2651.pdf -> corpus_out_nb/data/raw/1613e32eb1457da1521a.pdf -> pdf


Baixando Brutos:  54%|█████▍    | 100/184 [01:41<01:10,  1.19it/s]

Downloaded and saved: https://www.ba.gov.br/policiacivil/sites/site-pcba/files/2024-08/Manual%20de%20Procedimentos%20de%20Pol%C3%ADcia%20Judici%C3%A1ria%20do%20Estado%20da%20Bahia%20-%202%C2%AA%20Edi%C3%A7%C3%A3o.pdf -> corpus_out_nb/data/raw/0960a172def7d021c40b.pdf -> pdf
Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2011/02/b3f18ac2f32a661bd02ca82c1afbe3bb.pdf -> corpus_out_nb/data/raw/56edc9dc6672a608855e.pdf -> pdf


Baixando Brutos:  55%|█████▌    | 102/184 [01:44<01:21,  1.01it/s]

Downloaded and saved: https://www.mpce.mp.br/wp-content/uploads/2015/12/EUROSOCIAL-DIRETRIZES-NACIONAIS-DE-INVESTIGACAO-CRIMINAL.pdf -> corpus_out_nb/data/raw/de851ee1b807f6831f12.pdf -> pdf
Downloaded and saved: https://www.unodc.org/documents/justice-and-prison-reform/Manual_Uso_da_Forca_online2.pdf -> corpus_out_nb/data/raw/872302bc7743f89fad68.pdf -> pdf


Baixando Brutos:  56%|█████▌    | 103/184 [01:44<01:00,  1.34it/s]

Downloaded and saved: https://prefeitura.sp.gov.br/cidade/secretarias/upload/direitos_humanos/Manual%20de%20Atendimendo%20-%20Abrigo.pdf -> corpus_out_nb/data/raw/2b3914a4a8308c1590c0.pdf -> pdf


Baixando Brutos:  57%|█████▋    | 105/184 [01:45<00:48,  1.62it/s]

Downloaded and saved: https://jundiai.sp.gov.br/saude/wp-content/uploads/sites/17/2024/01/rede-de-aten__o-integral-_-mulher-em-situa__o-de-viol_ncia.pdf -> corpus_out_nb/data/raw/ae21c883d2701b7a3a90.pdf -> pdf


Baixando Brutos:  58%|█████▊    | 106/184 [01:46<00:56,  1.39it/s]

Downloaded and saved: https://memoriadigital.mpmg.mp.br/wp-content/uploads/tainacan-items/4829/139679/34_Guia-basico-para-atuacao-ministerial-em-contexto-de-violencia-domestica.pdf -> corpus_out_nb/data/raw/8d27e90d2f22d8def68c.pdf -> pdf


Baixando Brutos:  58%|█████▊    | 107/184 [01:48<01:22,  1.07s/it]

Downloaded and saved: https://sigconteudo.ufsb.edu.br/arquivos/20221651709b6e72594998850bbdc232/Dissertao_Mestrado-_Berg_PPGER_UFSB__1_-_Finalizado_com_ata.pdf -> corpus_out_nb/data/raw/2a90d1e8741d8bdb15c1.pdf -> pdf


Baixando Brutos:  59%|█████▊    | 108/184 [01:49<01:26,  1.14s/it]

Downloaded and saved: https://documents1.worldbank.org/curated/en/099031125103037423/pdf/P181402-a0c1ec80-38c6-45b3-aa5d-8759fb47d837.pdf -> corpus_out_nb/data/raw/a606f8ef68e5dcc4ad48.pdf -> pdf


Baixando Brutos:  59%|█████▉    | 109/184 [01:50<01:13,  1.03it/s]

Downloaded and saved: https://www.stj.jus.br/internet_docs/biblioteca/clippinglegislacao/Res_598_2024_CNJ.pdf -> corpus_out_nb/data/raw/85ff92249ceeeb09e5b2.pdf -> pdf


Baixando Brutos:  60%|█████▉    | 110/184 [01:50<00:57,  1.29it/s]

Downloaded and saved: https://saber.uniftc.edu.br/bitstreams/faf58238-34ec-43b9-a7e2-a2ff4b47ac2d/download -> corpus_out_nb/data/raw/816b12210b559ca8be36.pdf -> pdf


Baixando Brutos:  61%|██████    | 112/184 [01:52<00:55,  1.29it/s]

Downloaded and saved: https://repositorio.idp.edu.br/bitstream/123456789/5437/1/Monografia_ANY%20CAROLINE%20MACHADO%20DE%20OLIVEIRA_Curso%20de%20Direito.pdf -> corpus_out_nb/data/raw/1216f06315956f9bc161.pdf -> pdf


Baixando Brutos:  61%|██████▏   | 113/184 [01:52<00:51,  1.39it/s]

Downloaded and saved: https://revistas.editora.ufcg.edu.br/index.php/jmsi/article/download/841/751/5017 -> corpus_out_nb/data/raw/41cc40bc57bcb10d6844.pdf -> pdf


Baixando Brutos:  62%|██████▏   | 114/184 [01:53<00:57,  1.21it/s]

Downloaded and saved: https://periodicos.puc-campinas.edu.br/estpsi/article/download/8874/6283/34547 -> corpus_out_nb/data/raw/f7cf6759b92ad5ae865e.pdf -> pdf


Baixando Brutos:  62%|██████▎   | 115/184 [01:54<00:52,  1.31it/s]

Downloaded and saved: https://pdfs.semanticscholar.org/1133/ed54c0efd80c8692393d8850adae6f04ecd6.pdf -> corpus_out_nb/data/raw/1a76f9ab9b759b138096.pdf -> pdf


Baixando Brutos:  63%|██████▎   | 116/184 [01:55<00:54,  1.24it/s]

Downloaded and saved: https://portaldeimigracao.mj.gov.br/images/portarias/PORTARIA_N%C2%BA_87_DE_23_DE_MARC%CC%A7O_DE_2020.pdf -> corpus_out_nb/data/raw/b9d56e37a6ee4846bc0c.pdf -> pdf


Baixando Brutos:  64%|██████▎   | 117/184 [01:56<01:05,  1.03it/s]

Downloaded and saved: https://cimi.org.br/wp-content/uploads/2021/11/relatorio-violencia-povos-indigenas-2020-cimi.pdf -> corpus_out_nb/data/raw/89ce3cbefbcab7abf551.pdf -> pdf


Baixando Brutos:  64%|██████▍   | 118/184 [01:57<01:01,  1.07it/s]

Downloaded and saved: https://www.oas.org/pt/cidh/decisiones/corte/2022/br_12.569_pt.pdf -> corpus_out_nb/data/raw/ea9612a59b0755248953.pdf -> pdf


Baixando Brutos:  65%|██████▍   | 119/184 [01:58<01:05,  1.00s/it]

Downloaded and saved: https://files.ufgd.edu.br/arquivos/arquivos/78/MESTRADO-FRONTEIRAS/DISSERTA%C3%87%C3%83O%20JULIA%20STEFANELLO%20PIRES.pdf -> corpus_out_nb/data/raw/d6953df0fb5bff99bfbd.pdf -> pdf


Baixando Brutos:  65%|██████▌   | 120/184 [01:59<00:57,  1.10it/s]

Downloaded and saved: https://apublica.org/2015/10/especial-quilombolas/ -> corpus_out_nb/data/raw/d2d25a11c3c36fa6a8d7.html -> html


Baixando Brutos:  66%|██████▌   | 121/184 [02:02<01:30,  1.44s/it]

Downloaded and saved: https://brasil.un.org/pt-br/72443-brasil-viol%C3%AAncia-pobreza-e-criminaliza%C3%A7%C3%A3o-ainda-t%C3%AAm-cor-diz-relatora-da-onu-sobre-minorias -> corpus_out_nb/data/raw/f2fbcfe8425ff94b6cbb.html -> html


Baixando Brutos:  66%|██████▋   | 122/184 [02:02<01:19,  1.28s/it]

Downloaded and saved: https://tede.ufam.edu.br/bitstream/tede/4239/2/Tese%20-%20M%C3%A1rcia%20Maria%20de%20Oliveira.pdf -> corpus_out_nb/data/raw/161656b2d9c4dac8bffd.pdf -> pdf


Baixando Brutos:  67%|██████▋   | 123/184 [02:03<01:07,  1.11s/it]

Downloaded and saved: https://legis.policiacivil.pe.gov.br/b/api/files/373aad864b283ab54dd8ceef0126c426.pdf -> corpus_out_nb/data/raw/67f2a930c557b12a0d07.pdf -> pdf


Baixando Brutos:  67%|██████▋   | 124/184 [02:04<01:05,  1.09s/it]

Downloaded and saved: https://www.sds.pe.gov.br/images/media/1648750180_063%20BGSDS%20DE%2031MAR2022.pdf -> corpus_out_nb/data/raw/bbb367a3496f8c39f941.pdf -> pdf


Baixando Brutos:  68%|██████▊   | 126/184 [02:07<01:10,  1.22s/it]

Downloaded and saved: https://dspace.unila.edu.br/bitstreams/d7a5979a-f459-4b00-b064-4aa6bbd721d9/download -> corpus_out_nb/data/raw/9d31780372862382100d.pdf -> pdf


Baixando Brutos:  69%|██████▉   | 127/184 [02:07<00:56,  1.01it/s]

Downloaded and saved: https://repositorio.uergs.edu.br/xmlui/bitstream/handle/123456789/2119/_tcc_angaelica___04.08_atualizado-convertido.pdf?sequence=-1&isAllowed=y -> corpus_out_nb/data/raw/03d0873279a807d357eb.pdf -> pdf


Baixando Brutos:  70%|██████▉   | 128/184 [02:10<01:20,  1.43s/it]

Downloaded and saved: https://www.usf.edu.br/galeria/getImage/427/4525999146570085.pdf -> corpus_out_nb/data/raw/f4189afbb3376df517da.pdf -> pdf


Baixando Brutos:  70%|███████   | 129/184 [02:10<01:06,  1.21s/it]

Downloaded and saved: https://rdu.unicesumar.edu.br/bitstream/123456789/9951/1/Athely%2C%20Sidney%20Rodrigues%20Rezemde%20de%20Barros.pdf -> corpus_out_nb/data/raw/ba0c0d4506c9afc08308.pdf -> pdf


Baixando Brutos:  71%|███████   | 130/184 [02:11<00:59,  1.11s/it]

Downloaded and saved: https://ead.pm.ma.gov.br/pluginfile.php/11941/mod_resource/content/0/32.%20NEIDIANE%20SANTOS%20DE%20LIMA.pdf -> corpus_out_nb/data/raw/9e29369ad878c44e4a5b.pdf -> pdf
Downloaded and saved: https://www.gov.br/mdh/pt-br/assuntos/noticias/2025/agosto/governo-federal-lanca-protocolo-inedito-para-acolhimento-de-mulheres-lesbicas-bissexuais-travestis-transexuais-e-intersexo-em-situacao-de-violencia/POPRededeAtendimentoMulheremSituaodeViolnciaMulheresLBTI.pdf -> corpus_out_nb/data/raw/e905b65687be041ec4e6.pdf -> pdf


Baixando Brutos:  72%|███████▏  | 132/184 [02:14<01:06,  1.27s/it]

Downloaded and saved: https://tede.ufam.edu.br/bitstream/tede/5256/5/Disserta%C3%A7%C3%A3o_RaphaelLeone_PPGS.pdf -> corpus_out_nb/data/raw/a8e341e8210a43df4a1b.pdf -> pdf


Baixando Brutos:  73%|███████▎  | 134/184 [02:15<00:48,  1.03it/s]

Downloaded and saved: https://static.tre-al.jus.br/portal/o-tre/governanca-corporativa/comissoes-e-comites/TRE-AL-protocolo-atendimento-humanizado-popula%C3%A7%C3%A3o-trans-travesti.pdf -> corpus_out_nb/data/raw/899db7ff179538b6df90.pdf -> pdf


Baixando Brutos:  73%|███████▎  | 135/184 [02:16<00:44,  1.09it/s]

Downloaded and saved: https://repositorio.idp.edu.br/bitstream/123456789/4906/1/Disserta%C3%A7%C3%A3o_LUCAS%20BARROS%20BAPTISTA%20DE%20TOLEDO%20RIBEIRO_Mestrado_2023.pdf -> corpus_out_nb/data/raw/6d99b44eee5c37f9938b.pdf -> pdf


Baixando Brutos:  74%|███████▍  | 136/184 [02:16<00:37,  1.27it/s]

Downloaded and saved: https://seer.uniacademia.edu.br/index.php/cadernospsicologia/article/viewFile/4036/2994 -> corpus_out_nb/data/raw/22fa7e7122662e61f221.html -> html


Baixando Brutos:  74%|███████▍  | 137/184 [02:19<00:54,  1.15s/it]

Downloaded and saved: http://www.rio.rj.gov.br/dlstatic/10112/9492017/4238301/GuiadaDiversidade.pdf -> corpus_out_nb/data/raw/7b65905b2cfb5757ba4e.pdf -> pdf




Downloaded and saved: https://www.gov.br/mdh/pt-br/navegue-por-temas/migrantes-refugiados-e-apatridas/publicacoes/guiadeorientacaoemdireitoshumanosparapessoasdoafeganistaonobrasil_fevmarco2024.pdf -> corpus_out_nb/data/raw/235e68e8d333a12be512.pdf -> pdf


Baixando Brutos:  76%|███████▌  | 139/184 [02:20<00:43,  1.04it/s]

Downloaded and saved: https://www.riopreto.sp.gov.br/wp-content/uploads/arquivosPortalGOV/mulher/protocolo-regional/protocolo-municipal-atendimento-mulher.pdf -> corpus_out_nb/data/raw/81f770acb101b29b9174.pdf -> pdf


Baixando Brutos:  76%|███████▌  | 140/184 [02:20<00:34,  1.29it/s]

Downloaded and saved: https://www2.camarapiracicaba.sp.gov.br/documentos/Cartilha_Mulheres_GT_Rede_de_Atendimento_e_Protecao_as_Mulheres_de_Piracicaba_Outubro_2020.pdf -> corpus_out_nb/data/raw/df62ca03e9b269a4c7e0.pdf -> pdf


Baixando Brutos:  77%|███████▋  | 141/184 [02:22<00:39,  1.10it/s]

Downloaded and saved: https://www.spdo.ms.gov.br/diariodoe/Index/Download/DO11954_01_10_2025 -> corpus_out_nb/data/raw/34b27c2d8b100d29048d.html -> html


Baixando Brutos:  77%|███████▋  | 142/184 [02:23<00:40,  1.04it/s]

Downloaded and saved: https://www.cnmp.mp.br/portal/images/Publicacoes/documentos/2024/atividade_policial-v4-10out.pdf -> corpus_out_nb/data/raw/c6545be93b57af1ce2ad.pdf -> pdf




Downloaded and saved: https://www.mpac.mp.br/wp-content/uploads/Resolu%C3%A7%C3%A3o-n%C2%BA-005-2014-09.2014.000412-8-Redefine-Atribui%C3%A7%C3%B5es-Promotoria-Controle-Externo.pdf -> corpus_out_nb/data/raw/a6afc5bc6132128b5d87.pdf -> pdf


Baixando Brutos:  79%|███████▉  | 145/184 [02:27<00:53,  1.36s/it]

Downloaded and saved: https://www.rj.gov.br/saude/sites/default/files/arquivo_pagina_basica/Relatorio-Detalhado-do-Quadrimestre-Anterior_-_3%C2%BA-RDQA-2024.pdf -> corpus_out_nb/data/raw/35c4a78b7ee0ddb851f4.pdf -> pdf


Baixando Brutos:  79%|███████▉  | 146/184 [02:28<00:42,  1.13s/it]

Downloaded and saved: https://www.mpmt.mp.br/portalcao/news/1162/146419/cnmp-lanca-ouvidoria-de-combate-a-violencia-policial-e-firma-parceria-com-a-associacao-nacional-de-guardas-municipais-do-brasil/3 -> corpus_out_nb/data/raw/02ec2892aa02f93693ba.html -> html


Baixando Brutos:  82%|████████▏ | 150/184 [02:31<00:24,  1.40it/s]

Downloaded and saved: https://goias.gov.br/cultura/wp-content/uploads/sites/25/2024/10/IN_012024_SECULT_GO.pdf -> corpus_out_nb/data/raw/c5ab2e775e73eceef025.pdf -> pdf
Downloaded and saved: https://www.cnmp.mp.br/portal/images/Livro_controle_externo_da_atividade_policial_internet.pdf -> corpus_out_nb/data/raw/0ea04dc6d6ddcbe44f42.pdf -> pdf


Baixando Brutos:  82%|████████▏ | 151/184 [02:31<00:19,  1.70it/s]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2025/10/livro-pop-rua-17-09-2024.pdf -> corpus_out_nb/data/raw/a4e146c18dd3e2aecc42.pdf -> pdf


Baixando Brutos:  83%|████████▎ | 153/184 [02:32<00:16,  1.85it/s]

Downloaded and saved: https://www.trt5.jus.br/sites/default/files/sistema/normas/2024-01/0048-2024-cria-subcomite-gestor-regional-do-programa-de-equidade-de-raca-genero-e-diversidade.pdf -> corpus_out_nb/data/raw/df34dbe68b3cda12ebad.pdf -> pdf
Downloaded and saved: https://www.mpdft.mp.br/portal/pdf/nucleos/ncap/Manual_Nacional_Controle_Externo_Atividade_Policia.pdf -> corpus_out_nb/data/raw/4e22d519881ed588516c.pdf -> pdf


Baixando Brutos:  84%|████████▎ | 154/184 [02:33<00:20,  1.48it/s]

Downloaded and saved: https://ww2.trt2.jus.br/fileadmin/comunicacao/Links/20250408_Protocolos_de_Atuacao_e_Julgamento_da_Justica_do_Trabalho_.pdf -> corpus_out_nb/data/raw/93a931865bba2cabd604.pdf -> pdf


Baixando Brutos:  84%|████████▍ | 155/184 [02:34<00:19,  1.49it/s]

Downloaded and saved: https://www.pib.socioambiental.org/es/Not%C3%ADcias?id=216120 -> corpus_out_nb/data/raw/57c501eb5c115faea31e.html -> html


Baixando Brutos:  85%|████████▍ | 156/184 [02:36<00:34,  1.22s/it]

Downloaded and saved: https://portal.doe.sea.sc.gov.br/repositorio/2024/20241127/Jornal/22404.pdf -> corpus_out_nb/data/raw/2234111610fc264444ba.pdf -> pdf


Baixando Brutos:  85%|████████▌ | 157/184 [02:39<00:44,  1.66s/it]

Downloaded and saved: https://repositorio.idp.edu.br/bitstream/123456789/4936/1/Disserta%C3%A7%C3%A3o_RONALDO%20AUGUSTO%20COMAR%20MAR%C3%83O%20SAYEG_Mestrado_2023.pdf -> corpus_out_nb/data/raw/902bcf42b4590397ad17.pdf -> pdf


Baixando Brutos:  86%|████████▋ | 159/184 [02:41<00:32,  1.31s/it]

Downloaded and saved: https://www.hrw.org/pt/news/2024/10/10/un-experts-spotlight-devastating-police-brutality-brazil -> corpus_out_nb/data/raw/5f5daf1840ec9aa7a047.html -> html


Baixando Brutos:  87%|████████▋ | 160/184 [02:43<00:37,  1.54s/it]

Downloaded and saved: http://ejm.tjmmg.jus.br/ejm/wp-content/uploads/2024/12/casoteca-2017.pdf -> corpus_out_nb/data/raw/60105d962676977ec48e.pdf -> pdf




Downloaded and saved: https://www.policiacivil.ma.gov.br/wp-content/uploads/2024/05/IN-No-008-2021-PLANTAO-DEM-SLZ-NOVA-DOE.pdf -> corpus_out_nb/data/raw/c28be9acaa02965792a7.pdf -> pdf


Baixando Brutos:  89%|████████▊ | 163/184 [02:45<00:22,  1.06s/it]

Downloaded and saved: https://tede.ufam.edu.br/bitstream/tede/10191/5/DISS_RaimundoAlbuquerque_PPSCA -> corpus_out_nb/data/raw/eb36dd57c92e552be75a.pdf -> pdf


Baixando Brutos:  89%|████████▉ | 164/184 [02:46<00:18,  1.09it/s]

Downloaded and saved: https://www.hrw.org/reports/wr2015port_ForUpload.pdf -> corpus_out_nb/data/raw/6efc5773f57a3dbf0df6.pdf -> pdf


Baixando Brutos:  90%|████████▉ | 165/184 [02:47<00:18,  1.04it/s]

Downloaded and saved: https://www.tjdft.jus.br/informacoes/cidadania/nucleo-judiciario-da-mulher/parceiros/material-informativo-e-instrucional/cartilha-da-pessoa-com-deficiencia-easjur-e-dpdf.pdf -> corpus_out_nb/data/raw/094b6946739c5c3fbc24.pdf -> pdf


Baixando Brutos:  91%|█████████▏| 168/184 [02:49<00:10,  1.52it/s]

Downloaded and saved: https://sxpolitics.org/ptbr/wp-content/uploads/sites/3/2019/10/Copia-de-VERSAO-WEB-compactado-1.pdf -> corpus_out_nb/data/raw/798678a1a89f684aab45.pdf -> pdf
Downloaded and saved: https://www.pmvc.ba.gov.br/wp-content/uploads/Protocolo-Unificado-de-Atendimento-Integrado-a-Crian%C3%A7as-e-Adolescentes-V%C3%ADtimas-ou-Testemunhas-de-Viol%C3%AAncia.pdf -> corpus_out_nb/data/raw/ae24384f24e4e10014b0.pdf -> pdf


Baixando Brutos:  93%|█████████▎| 172/184 [02:52<00:09,  1.21it/s]

Downloaded and saved: https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/direitos_humanos/MULHER/Casa%20da%20mulher%20-%20revisado%202%20(2).pdf -> corpus_out_nb/data/raw/3b72f2e9e0afdb5907ec.pdf -> pdf


Baixando Brutos:  95%|█████████▍| 174/184 [02:53<00:05,  1.70it/s]

Downloaded and saved: http://ejm.tjmmg.jus.br/ejm/wp-content/uploads/2024/12/casoteca-2021-2022.pdf -> corpus_out_nb/data/raw/b8143f65beeef3451e1f.pdf -> pdf
Downloaded and saved: https://www.gov.br/mdh/pt-br/navegue-por-temas/politicas-para-mulheres/arquivo/arquivos-diversos/sev/lei-maria-da-penha/cartilhabr-mulher09.pdf -> corpus_out_nb/data/raw/a6f55463c2bc494c5014.pdf -> pdf


Baixando Brutos:  95%|█████████▌| 175/184 [02:54<00:06,  1.37it/s]

Downloaded and saved: https://www.tjes.jus.br/wp-content/uploads/Cartilha_COMVIDES_2018-1.pdf -> corpus_out_nb/data/raw/66b710fc2a255a7f2a39.pdf -> pdf


Baixando Brutos:  96%|█████████▌| 176/184 [02:55<00:05,  1.36it/s]

Downloaded and saved: https://www.gov.br/mdh/pt-br/navegue-por-temas/politicas-para-mulheres/arquivo/arquivos-diversos/sev/lei-maria-da-penha/doc-do-item-8.2.-norma-deams-2010.pdf -> corpus_out_nb/data/raw/cd76bf37ebaf28434989.pdf -> pdf


Baixando Brutos:  96%|█████████▌| 177/184 [02:56<00:06,  1.13it/s]

Downloaded and saved: https://atos.cnj.jus.br/files/compilado2022562021082061200f20b40f5.pdf -> corpus_out_nb/data/raw/3d0e351afd1d25efea8b.pdf -> pdf


Baixando Brutos:  97%|█████████▋| 178/184 [02:57<00:05,  1.09it/s]

Downloaded and saved: https://www.tjsp.jus.br/Download/Pdf/Comesp/Relatorios/RelatorioComesp2023.pdf -> corpus_out_nb/data/raw/7be252406411d0001424.pdf -> pdf


Baixando Brutos:  97%|█████████▋| 179/184 [02:59<00:06,  1.27s/it]

Downloaded and saved: https://www.prefeitura.sp.gov.br/cidade/secretarias/upload/Manual%20%20Violencia%20ESF%20%20MP(1).pdf -> corpus_out_nb/data/raw/1e7aacdba25bcecd9b2e.pdf -> pdf


Baixando Brutos:  98%|█████████▊| 180/184 [03:00<00:04,  1.21s/it]

Downloaded and saved: https://www.cnmp.mp.br/portal/images/Conatetrap/Materiais_de_Apoio/Livro_Trafico_de_Pessoas.pdf -> corpus_out_nb/data/raw/3eb149b1140975994cbb.pdf -> pdf


Baixando Brutos:  98%|█████████▊| 181/184 [03:01<00:03,  1.18s/it]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/conteudo/arquivo/2019/08/7b7cb6d9ac9042c8d3e40700b80bf207.pdf -> corpus_out_nb/data/raw/600dab2997c6b0616ad3.pdf -> pdf


Baixando Brutos:  99%|█████████▉| 182/184 [03:02<00:02,  1.06s/it]

Downloaded and saved: https://www.cnj.jus.br/wp-content/uploads/2023/08/manual-tomada-decisao-parametros-gerais-eletronica.pdf -> corpus_out_nb/data/raw/fd9875887854f6f22809.pdf -> pdf


Baixando Brutos:  99%|█████████▉| 183/184 [03:05<00:01,  1.56s/it]

Downloaded and saved: https://www.tjrj.jus.br/documents/5736540/6284571/dir-adm.pdf -> corpus_out_nb/data/raw/a9f3b17c91163e8ab545.pdf -> pdf


Baixando Brutos: 100%|██████████| 184/184 [03:07<00:00,  1.02s/it]

Downloaded and saved: https://www.tjam.jus.br/images/ESMAM/Projetos_Esmam/Cartilha_Nossa_dor.pdf -> corpus_out_nb/data/raw/9a743a94ea0ca7a559af.pdf -> pdf

Total de resultados: 0
Primeiros 3 resultados: []





(0, [])

In [38]:
# --- Método 2: Extração e Limpeza dos Dados ---
def process_raw_and_extract(raw_item):
    """Extrai o texto, limpa, verifica o tamanho e salva o arquivo .txt."""
    
    # Retorna itens que falharam na fase anterior
    if raw_item["status"] != "download_ok":
        return raw_item 
        
    url = raw_item["url"]
    raw_path = Path(raw_item["raw_path"])
    is_pdf = raw_item["is_pdf"]
    content = raw_item["content"]
    
    # Início da lógica de extração
    try:
        if is_pdf:
            text = extract_pdf_text(str(raw_path)) 
        else:
            text = extract_html_main(content)
    except Exception as e:
        text = ""
        print(e)
        log.error(f"Extraction error for {url}: {e}")
        
    text = clean_text(text)
    
    # Verifica o tamanho (lógica original)
    if len(text) < MIN_CHARS:
        return {"url": url, "status": "too_short", "chars": len(text)}
        
    # Salvar o texto limpo
    ext_raw = ".pdf" if is_pdf else ".html"
    tname = raw_path.name.replace(ext_raw, ".txt")
    text_path = TEXT_DIR / tname
    
    try:
        text_path.write_text(text, encoding="utf-8")
    except Exception as e:
        return {"url": url, "status": f"save_text_error:{e}"}
        
    # Retorno final de sucesso
    return {
        "url": url, 
        "status": "ok", 
        "text_path": str(text_path), 
        "chars": len(text)
    }

# 2. FASE DE EXTRAÇÃO
print("\n--- FASE 2: EXTRAÇÃO E LIMPEZA DE TEXTO ---")
# Filtra apenas os itens baixados com sucesso para a extração
raw_items_to_process = [r for r in download_results if r["status"] == "download_ok"]
failed_downloads = [r for r in download_results if r["status"] != "download_ok"] # Itens que falharam no download

if raw_items_to_process:
    with ThreadPoolExecutor(max_workers=MAX_WORKERS) as ex_extract:
        futs_extract = [ex_extract.submit(process_raw_and_extract, r) for r in raw_items_to_process]
        for fut_extract in tqdm(as_completed(futs_extract), total=len(futs_extract), desc="Extraindo Texto"):
            results.append(fut_extract.result())
            
    # Adiciona as falhas de download aos resultados finais
    results.extend(failed_downloads)
else:
    log.warning("Nenhum arquivo bruto baixado com sucesso para extração.")

print(f"\nItens processados: {len(raw_items_to_process)}")
print(f"\nItens falhados: {len(failed_downloads)}")


--- FASE 2: EXTRAÇÃO E LIMPEZA DE TEXTO ---
DEBUG: Texto extraído por fitz (31753 chars).


Extraindo Texto:   0%|          | 0/150 [00:00<?, ?it/s]

DEBUG: Texto extraído por fitz (20635 chars).


Extraindo Texto:   2%|▏         | 3/150 [00:00<00:37,  3.91it/s]

DEBUG: Texto extraído por fitz (218071 chars).
DEBUG: Texto extraído por fitz (368422 chars).


Extraindo Texto:   3%|▎         | 4/150 [00:01<00:40,  3.59it/s]encoding error : input conversion failed due to input error, bytes 0x8F 0x1C 0xB7 0x11
encoding error : input conversion failed due to input error, bytes 0x8F 0x1C 0xB7 0x11
Extraindo Texto:   6%|▌         | 9/150 [00:02<00:34,  4.14it/s]

DEBUG: Texto extraído por fitz (122781 chars).
DEBUG: Texto extraído por fitz (41443 chars).
DEBUG: Texto extraído por fitz (29697 chars).
DEBUG: Texto extraído por fitz (30677 chars).


Extraindo Texto:  10%|█         | 15/150 [00:04<00:24,  5.56it/s]

DEBUG: Texto extraído por fitz (61222 chars).
DEBUG: Texto extraído por fitz (6586 chars).
DEBUG: Texto extraído por fitz (39112 chars).
DEBUG: Texto extraído por fitz (135146 chars).


Extraindo Texto:  13%|█▎        | 19/150 [00:04<00:16,  7.85it/s]

DEBUG: Texto extraído por fitz (68218 chars).
DEBUG: Texto extraído por fitz (206796 chars).


Extraindo Texto:  14%|█▍        | 21/150 [00:05<00:40,  3.17it/s]encoding error : input conversion failed due to input error, bytes 0x90 0x01 0x01 0x04
encoding error : input conversion failed due to input error, bytes 0x90 0x01 0x01 0x04
Extraindo Texto:  15%|█▌        | 23/150 [00:06<00:43,  2.91it/s]

DEBUG: Texto extraído por fitz (643546 chars).
DEBUG: Texto extraído por fitz (1008770 chars).
DEBUG: Texto extraído por fitz (1528025 chars).


Extraindo Texto:  17%|█▋        | 26/150 [00:08<00:49,  2.51it/s]

DEBUG: Texto extraído por fitz (52958 chars).
DEBUG: Texto extraído por fitz (138094 chars).


Extraindo Texto:  19%|█▊        | 28/150 [00:08<00:36,  3.32it/s]

DEBUG: Texto extraído por fitz (179231 chars).
DEBUG: Texto extraído por fitz (4827 chars).


Extraindo Texto:  20%|██        | 30/150 [00:09<00:30,  3.97it/s]

DEBUG: Texto extraído por fitz (425720 chars).


Extraindo Texto:  21%|██        | 31/150 [00:09<00:30,  3.93it/s]

DEBUG: Texto extraído por fitz (499983 chars).
DEBUG: Texto extraído por fitz (54429 chars).
DEBUG: Texto extraído por fitz (48596 chars).


Extraindo Texto:  23%|██▎       | 34/150 [00:09<00:19,  5.99it/s]

DEBUG: Texto extraído por fitz (129930 chars).


Extraindo Texto:  23%|██▎       | 35/150 [00:09<00:21,  5.23it/s]

DEBUG: Texto extraído por fitz (426327 chars).
DEBUG: Texto extraído por fitz (13742 chars).
DEBUG: Texto extraído por fitz (187882 chars).


Extraindo Texto:  26%|██▌       | 39/150 [00:10<00:16,  6.68it/s]

DEBUG: Texto extraído por fitz (100068 chars).
DEBUG: Texto extraído por fitz (166625 chars).


Extraindo Texto:  27%|██▋       | 41/150 [00:10<00:13,  8.02it/s]

DEBUG: Texto extraído por fitz (23346 chars).


Extraindo Texto:  28%|██▊       | 42/150 [00:11<00:31,  3.40it/s]

DEBUG: Texto extraído por fitz (246843 chars).


Extraindo Texto:  29%|██▉       | 44/150 [00:12<00:40,  2.64it/s]

DEBUG: Texto extraído por fitz (923461 chars).
DEBUG: Texto extraído por fitz (87945 chars).


Extraindo Texto:  31%|███▏      | 47/150 [00:13<00:28,  3.55it/s]

DEBUG: Texto extraído por fitz (643345 chars).
DEBUG: Texto extraído por fitz (43943 chars).


Extraindo Texto:  33%|███▎      | 49/150 [00:13<00:22,  4.41it/s]

DEBUG: Texto extraído por fitz (347906 chars).


Extraindo Texto:  35%|███▍      | 52/150 [00:14<00:22,  4.38it/s]

DEBUG: Texto extraído por fitz (992025 chars).


Extraindo Texto:  35%|███▌      | 53/150 [00:14<00:24,  3.90it/s]

DEBUG: Texto extraído por fitz (90144 chars).


Extraindo Texto:  36%|███▌      | 54/150 [00:15<00:36,  2.64it/s]

DEBUG: Texto extraído por fitz (410671 chars).


Extraindo Texto:  37%|███▋      | 56/150 [00:15<00:29,  3.14it/s]

DEBUG: Texto extraído por fitz (44047 chars).


Extraindo Texto:  38%|███▊      | 57/150 [00:16<00:32,  2.82it/s]

DEBUG: Texto extraído por fitz (233694 chars).
DEBUG: Texto extraído por fitz (472398 chars).


Extraindo Texto:  39%|███▊      | 58/150 [00:16<00:26,  3.47it/s]

DEBUG: Texto extraído por fitz (228897 chars).


Extraindo Texto:  40%|████      | 60/150 [00:17<00:23,  3.81it/s]

DEBUG: Texto extraído por fitz (534381 chars).
DEBUG: Texto extraído por fitz (16159 chars).


Extraindo Texto:  44%|████▍     | 66/150 [00:18<00:18,  4.55it/s]

DEBUG: Texto extraído por fitz (220725 chars).
DEBUG: Texto extraído por fitz (8650 chars).
DEBUG: Texto extraído por fitz (2 chars).
Nenhum método de extração funcionou.


Extraindo Texto:  45%|████▌     | 68/150 [00:18<00:16,  4.98it/s]

DEBUG: Texto extraído por fitz (170113 chars).
DEBUG: Texto extraído por fitz (168369 chars).


Extraindo Texto:  46%|████▌     | 69/150 [00:19<00:18,  4.29it/s]

DEBUG: Texto extraído por fitz (462919 chars).


Extraindo Texto:  47%|████▋     | 71/150 [00:19<00:16,  4.81it/s]

DEBUG: Texto extraído por fitz (205734 chars).
DEBUG: Texto extraído por fitz (148937 chars).
DEBUG: Texto extraído por fitz (4790 chars).


Extraindo Texto:  49%|████▊     | 73/150 [00:19<00:16,  4.79it/s]

DEBUG: Texto extraído por fitz (415490 chars).
DEBUG: Texto extraído por fitz (182987 chars).


Extraindo Texto:  50%|█████     | 75/150 [00:20<00:14,  5.12it/s]

DEBUG: Texto extraído por fitz (205163 chars).


Extraindo Texto:  51%|█████     | 76/150 [00:20<00:20,  3.58it/s]

DEBUG: Texto extraído por fitz (642398 chars).
DEBUG: Texto extraído por fitz (94926 chars).


Extraindo Texto:  52%|█████▏    | 78/150 [00:21<00:14,  5.07it/s]

DEBUG: Texto extraído por fitz (115331 chars).


Extraindo Texto:  53%|█████▎    | 80/150 [00:21<00:12,  5.54it/s]

DEBUG: Texto extraído por fitz (90499 chars).
DEBUG: Texto extraído por fitz (187890 chars).


Extraindo Texto:  55%|█████▌    | 83/150 [00:21<00:12,  5.30it/s]

DEBUG: Texto extraído por fitz (30066 chars).
DEBUG: Texto extraído por fitz (534497 chars).
DEBUG: Texto extraído por fitz (90379 chars).


Extraindo Texto:  57%|█████▋    | 86/150 [00:22<00:09,  7.10it/s]

DEBUG: Texto extraído por fitz (506486 chars).
DEBUG: Texto extraído por fitz (52045 chars).
DEBUG: Texto extraído por fitz (39781 chars).
DEBUG: Texto extraído por fitz (11912 chars).
DEBUG: Texto extraído por fitz (70277 chars).


Extraindo Texto:  60%|██████    | 90/150 [00:22<00:04, 12.09it/s]

DEBUG: Texto extraído por fitz (388835 chars).
DEBUG: Texto extraído por fitz (311316 chars).


Extraindo Texto:  61%|██████▏   | 92/150 [00:23<00:12,  4.76it/s]

DEBUG: Texto extraído por fitz (1167059 chars).


Extraindo Texto:  63%|██████▎   | 94/150 [00:24<00:16,  3.47it/s]

DEBUG: Texto extraído por fitz (736351 chars).
DEBUG: Texto extraído por fitz (51067 chars).


Extraindo Texto:  65%|██████▍   | 97/150 [00:24<00:11,  4.63it/s]

DEBUG: Texto extraído por fitz (192425 chars).
DEBUG: Texto extraído por fitz (117140 chars).
DEBUG: Texto extraído por fitz (87597 chars).


Extraindo Texto:  67%|██████▋   | 100/150 [00:25<00:10,  4.72it/s]

DEBUG: Texto extraído por fitz (519980 chars).
DEBUG: Texto extraído por fitz (140806 chars).


Extraindo Texto:  67%|██████▋   | 101/150 [00:25<00:11,  4.22it/s]

DEBUG: Texto extraído por fitz (187402 chars).
DEBUG: Texto extraído por fitz (152487 chars).


Extraindo Texto:  69%|██████▊   | 103/150 [00:26<00:09,  5.16it/s]

DEBUG: Texto extraído por fitz (216619 chars).
DEBUG: Texto extraído por fitz (19635 chars).


Extraindo Texto:  70%|███████   | 105/150 [00:26<00:08,  5.61it/s]

DEBUG: Texto extraído por fitz (241528 chars).


Extraindo Texto:  71%|███████▏  | 107/150 [00:26<00:09,  4.78it/s]

DEBUG: Texto extraído por fitz (76415 chars).


Extraindo Texto:  72%|███████▏  | 108/150 [00:27<00:12,  3.40it/s]

DEBUG: Texto extraído por fitz (258738 chars).
DEBUG: Texto extraído por fitz (30845 chars).


Extraindo Texto:  73%|███████▎  | 110/150 [00:27<00:09,  4.07it/s]

DEBUG: Texto extraído por fitz (68488 chars).


Extraindo Texto:  75%|███████▍  | 112/150 [00:29<00:20,  1.82it/s]

DEBUG: Texto extraído por fitz (480847 chars).


Extraindo Texto:  75%|███████▌  | 113/150 [00:29<00:16,  2.20it/s]

DEBUG: Texto extraído por fitz (38612 chars).


Extraindo Texto:  76%|███████▌  | 114/150 [00:30<00:18,  1.94it/s]encoding error : input conversion failed due to input error, bytes 0x81 0xC7 0xAA 0x41
Extraindo Texto:  77%|███████▋  | 115/150 [00:30<00:14,  2.46it/s]encoding error : input conversion failed due to input error, bytes 0x81 0xC7 0xAA 0x41
Extraindo Texto:  77%|███████▋  | 116/150 [00:30<00:11,  3.02it/s]

DEBUG: Texto extraído por fitz (70986 chars).


Extraindo Texto:  78%|███████▊  | 117/150 [00:31<00:15,  2.06it/s]

DEBUG: Texto extraído por fitz (553793 chars).
DEBUG: Texto extraído por fitz (228572 chars).


Extraindo Texto:  79%|███████▉  | 119/150 [00:31<00:09,  3.18it/s]

DEBUG: Texto extraído por fitz (21349 chars).


Extraindo Texto:  80%|████████  | 120/150 [00:32<00:10,  2.97it/s]

DEBUG: Texto extraído por fitz (187576 chars).


Extraindo Texto:  81%|████████  | 121/150 [00:32<00:10,  2.68it/s]

DEBUG: Texto extraído por fitz (439276 chars).
DEBUG: Texto extraído por fitz (1350967 chars).


Extraindo Texto:  82%|████████▏ | 123/150 [00:35<00:21,  1.28it/s]

DEBUG: Texto extraído por fitz (688501 chars).


Extraindo Texto:  83%|████████▎ | 124/150 [00:35<00:18,  1.40it/s]

DEBUG: Texto extraído por fitz (366969 chars).


Extraindo Texto:  84%|████████▍ | 126/150 [00:37<00:19,  1.21it/s]

DEBUG: Texto extraído por fitz (278173 chars).
DEBUG: Texto extraído por fitz (29034 chars).


Extraindo Texto:  85%|████████▌ | 128/150 [00:38<00:12,  1.69it/s]

DEBUG: Texto extraído por fitz (316732 chars).


Extraindo Texto:  86%|████████▌ | 129/150 [00:38<00:11,  1.89it/s]

DEBUG: Texto extraído por fitz (264583 chars).


Extraindo Texto:  87%|████████▋ | 130/150 [00:38<00:09,  2.21it/s]

DEBUG: Texto extraído por fitz (92009 chars).
DEBUG: Texto extraído por fitz (180092 chars).


Extraindo Texto:  89%|████████▊ | 133/150 [00:40<00:08,  2.10it/s]

DEBUG: Texto extraído por fitz (489138 chars).
DEBUG: Texto extraído por fitz (132270 chars).


Extraindo Texto:  90%|█████████ | 135/150 [00:40<00:04,  3.12it/s]

DEBUG: Texto extraído por fitz (19186 chars).


Extraindo Texto:  91%|█████████ | 136/150 [00:40<00:04,  3.04it/s]

DEBUG: Texto extraído por fitz (224074 chars).DEBUG: Texto extraído por fitz (46592 chars).



Extraindo Texto:  92%|█████████▏| 138/150 [00:41<00:03,  3.99it/s]

DEBUG: Texto extraído por fitz (118568 chars).
DEBUG: Texto extraído por fitz (86566 chars).


Extraindo Texto:  93%|█████████▎| 140/150 [00:41<00:03,  3.12it/s]

DEBUG: Texto extraído por fitz (91396 chars).
DEBUG: Texto extraído por fitz (222473 chars).


Extraindo Texto:  95%|█████████▍| 142/150 [00:43<00:04,  1.74it/s]

DEBUG: Texto extraído por fitz (555392 chars).


Extraindo Texto:  95%|█████████▌| 143/150 [00:45<00:07,  1.04s/it]

DEBUG: Texto extraído por fitz (526002 chars).


Extraindo Texto:  96%|█████████▌| 144/150 [00:45<00:04,  1.27it/s]

DEBUG: Texto extraído por fitz (190083 chars).
DEBUG: Texto extraído por fitz (62712 chars).


Extraindo Texto:  97%|█████████▋| 146/150 [00:47<00:02,  1.48it/s]encoding error : input conversion failed due to input error, bytes 0x90 0x64 0xCC 0x5C
encoding error : input conversion failed due to input error, bytes 0x90 0x64 0xCC 0x5C
Extraindo Texto:  99%|█████████▊| 148/150 [01:02<00:09,  4.74s/it]

DEBUG: Texto extraído por fitz (1250181 chars).


Extraindo Texto: 100%|██████████| 150/150 [01:02<00:00,  2.39it/s]


Itens processados: 150

Itens falhados: 34





## 7) Deduplicar por hash e gerar shards JSONL (Mantendo Metadados Originais)

In [44]:

import json

SHARDS_DIR = OUTDIR / "data" / "shards"
SHARDS_DIR.mkdir(parents=True, exist_ok=True)
seen_hash = set()
shard_bytes_limit = int(SHARD_SIZE_MB * 1024 * 1024)
buf = []; size = 0; shard_id = 0

def flush():
    global buf, size, shard_id
    if not buf: return
    out_path = SHARDS_DIR / f"corpus_{shard_id:04d}.jsonl"
    with out_path.open("w", encoding="utf-8") as f:
        for line in buf:
            f.write(line + "\n")
    buf = []; size = 0; shard_id += 1
    log.info("Shard salvo: %s", out_path)

ok_items = [r for r in results if r.get("status") == "ok"]

# Mapear URL_norm de volta para os metadados originais
meta_map = dedup.set_index('url_norm').to_dict('index')

for item in ok_items:
    try:
        text = Path(item["text_path"]).read_text(encoding="utf-8")
    except Exception:
        continue
        
    # Deduplicação por Hash do Conteúdo
    h = sha1(text)
    if h in seen_hash:
        continue
    seen_hash.add(h)
    
    # Resgatar Metadados Originais
    url_norm = norm_url(item["url"])
    metadata = meta_map.get(url_norm, {})
    
    # Criar o objeto JSONL final (estrutura de saída inalterada no formato dos metadados)
    final_record = {
        "text": text
    }
    
    # Filtrar chaves com valor None/NaN para manter a limpeza do JSONL
    final_record = {k: v for k, v in final_record.items() if v is not None and not pd.isna(v)}
    
    line = json.dumps(final_record, ensure_ascii=False)
    buf.append(line); size += len(line.encode("utf-8"))
    
    if size >= shard_bytes_limit:
        flush()
flush()

log.info("Itens OK após extração: %d", len(ok_items))
log.info("Documentos únicos (após deduplicação por hash): %d", len(seen_hash))

len(ok_items), len(seen_hash)


2025-10-30 16:35:42,507 INFO: Shard salvo: corpus_out_nb/data/shards/corpus_0000.jsonl
2025-10-30 16:35:42,508 INFO: Itens OK após extração: 338
2025-10-30 16:35:42,508 INFO: Documentos únicos (após deduplicação por hash): 145


(338, 145)

## 8) Relatório final

In [43]:
# Célula 8 (Sintética): Imprimir Estatística Principal

# A variável 'seen_hash' é assumida como definida no Bloco 7.
# Garante que a variável exista para evitar NameError (fallback de segurança)
if 'seen_hash' not in locals():
    seen_hash = set()

# Calcula a estatística principal
total_documentos_unicos = len(seen_hash)

# Imprime o resultado de forma clara
print("=========================================")
print(f"✅ Documentos Únicos e Válidos Extraídos: {total_documentos_unicos}")
print("=========================================")

# Retorna a variável para manter a consistência do notebook
total_documentos_unicos

✅ Documentos Únicos e Válidos Extraídos: 145


145


## 9) Dicas
- Se um host estiver lento, ajuste: `MAX_WORKERS = 1`, `RATE = 0.5`, `TIMEOUT = 10`.
- Para debug detalhado: `LOG_LEVEL = "DEBUG"`.
- Se precisar **ignorar robots** para diagnóstico rápido, troque `robots.can_fetch(url)` para sempre `True` (apenas para teste).
- Para enriquecer metadados, crie um `index.csv` registrando `url`, `hash`, `path` e `chars` durante o loop.
