<a href="https://colab.research.google.com/github/sergiocostaifes/PPCOMP_DM/blob/main/02_clean_normalize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# O que o notebook fez (metodologia)

1. Leu o dataset validado do Notebook 01 (trace_raw_validated.parquet).

2. Garantiu time numérico/int64 e removeu time nulo (consistência).

3. Criou o tempo relativo:

t0 = min(time)
t_rel_us = time - t0

4. Aplicou um corte conservador do período analisado:

T_REL_MAX_US = 3e12 (~34,7 dias)

5. Criou hour em relação ao tempo relativo.

6. Removeu o artefato hour==0 (as primeiras ocorrências/“hora 0” da EDA), registrando métricas before/after (linhas e range de hour) no summary para rastreabilidade metodológica.

7. Salvou o dataset limpo como google_trace_clean.parquet e gerou um JSON de resumo.

# O que os resultados indicam (sanidade)

1. Antes da remoção de hour==0, você tinha 0..744 horas e 405891 registros após o corte — exatamente o volume esperado do parquet validado do 01.
2. A remoção hour==0 retirou 56740 registros. Isso é grande (~14%), mas coerente com a ideia de “descartar a fase inicial/artifact” identificada na EDA (e é exatamente por isso que você está registrando e formalizando a regra).
3. O dataset final foi salvo no caminho correto, e o df.head() mostra as colunas esperadas + t_rel_us e hour (23 colunas).


In [5]:
# =========================
# 02_clean_normalize.ipynb
# Limpeza + Normalização + Parquet limpo
# =========================

# ===== Bootstrap padrão PPCOMP_DM (Colab) =====
from google.colab import drive
drive.mount("/content/drive", force_remount=False)

import sys, subprocess, importlib
from pathlib import Path
from importlib.machinery import PathFinder

REPO_DIR = Path("/content/drive/MyDrive/Mestrado/PPCOMP_DM")
GITHUB_REPO = "https://github.com/sergiocostaifes/PPCOMP_DM.git"

if not REPO_DIR.exists():
    REPO_DIR.parent.mkdir(parents=True, exist_ok=True)
    subprocess.run(["git", "clone", GITHUB_REPO, str(REPO_DIR)], check=True)

repo_str = str(REPO_DIR)
if repo_str in sys.path:
    sys.path.remove(repo_str)
sys.path.insert(0, repo_str)

importlib.invalidate_caches()
if PathFinder not in sys.meta_path:
    sys.meta_path.append(PathFinder)

from importlib import reload
import src.paths as _paths
reload(_paths)

from src.paths import RAW_PATH, PROCESSED_PATH, REPORTS_PATH, ensure_dirs
ensure_dirs()

def log(msg: str) -> None:
    print(f"[02_clean_normalize] {msg}")

# ===== Processamento =====
import pandas as pd
import numpy as np

VALIDATED_PARQUET = PROCESSED_PATH / "trace_raw_validated.parquet"
assert VALIDATED_PARQUET.exists(), (
    f"Parquet validado não encontrado: {VALIDATED_PARQUET}. Execute o 01_ingest_validate."
)

df = pd.read_parquet(VALIDATED_PARQUET)
log("Lido do parquet validado.")

# time já deve estar int64, mas garantimos:
df["time"] = pd.to_numeric(df["time"], errors="coerce")
df = df.dropna(subset=["time"]).copy()
df["time"] = df["time"].astype("int64")

# Tempo relativo
t0 = int(df["time"].min())
df["t_rel_us"] = df["time"] - t0

# Corte conservador do período analisado
T_REL_MAX_US = 3_000_000_000_000  # ~34,7 dias
df = df[(df["t_rel_us"] >= 0) & (df["t_rel_us"] <= T_REL_MAX_US)].copy()

# hour (a partir do tempo relativo)
df["hour"] = (df["t_rel_us"] // 3_600_000_000).astype("int64")
log(f"Horas (min..max): {df['hour'].min()}..{df['hour'].max()}")
log(f"Registros após corte T_REL_MAX_US: {len(df)}")

# Remover hour==0 (artefato)
rows_before_hour0 = len(df)
hour_min_before = int(df["hour"].min())
hour_max_before = int(df["hour"].max())

df = df[df["hour"] != 0].copy()

rows_after_hour0 = len(df)
hour_min_after = int(df["hour"].min())
hour_max_after = int(df["hour"].max())

# ===== Persistência =====
CLEAN_PARQUET = PROCESSED_PATH / "google_trace_clean.parquet"
df.to_parquet(CLEAN_PARQUET, index=False, compression="snappy")
log(f"Salvo: {CLEAN_PARQUET}")

import json
summary02 = {
    "input_file": str(VALIDATED_PARQUET),
    "output_file": str(CLEAN_PARQUET),

    # dataset final (após limpeza)
    "rows_out": int(len(df)),
    "cols_out": int(df.shape[1]),

    # referências temporais
    "t0_time": int(t0),
    "t_rel_max_us": int(T_REL_MAX_US),

    # hour==0 artefato (before/after)
    "rows_before_hour0": int(rows_before_hour0),
    "rows_after_hour0": int(rows_after_hour0),
    "rows_removed_hour0": int(rows_before_hour0 - rows_after_hour0),

    "hour_min_before": int(hour_min_before),
    "hour_max_before": int(hour_max_before),
    "hour_min_after": int(hour_min_after),
    "hour_max_after": int(hour_max_after),
}
summary_file = REPORTS_PATH / "02_clean_normalize_summary.json"
summary_file.write_text(json.dumps(summary02, indent=2, ensure_ascii=False), encoding="utf-8")
log(f"Resumo salvo: {summary_file}")

df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[02_clean_normalize] Lido do parquet validado.
[02_clean_normalize] Horas (min..max): 0..744
[02_clean_normalize] Registros após corte T_REL_MAX_US: 405891
[02_clean_normalize] Salvo: /content/drive/MyDrive/Mestrado/02-datasets/02-processed/google_trace_clean.parquet
[02_clean_normalize] Resumo salvo: /content/drive/MyDrive/Mestrado/04-reports/02_clean_normalize_summary.json


Unnamed: 0,time,collection_id,scheduling_class,priority,instance_index,machine_id,resource_request,scheduler,start_time,end_time,...,assigned_memory,page_cache_memory,cycles_per_instruction,memory_accesses_per_instruction,sample_rate,cluster,event,failed,t_rel_us,hour
1,2517305308183,260697606809,2,360,335,85515092,"{'cpus': 0.00724029541015625, 'memory': 0.0013...",0.0,1800713000000,1800714000000,...,0.0,0.0,,,1.0,7,FAIL,1,2517305308183,699
2,195684022913,276227177776,2,103,376,169321752432,"{'cpus': 0.048583984375, 'memory': 0.004165649...",1.0,81300000000,81600000000,...,0.010422,0.000235,0.939919,0.001318,1.0,7,SCHEDULE,0,195684022913,54
4,1810627494172,25911621841,2,0,3907,231364893292,"{'cpus': 0.00244903564453125, 'memory': 0.0002...",0.0,1565315000000,1565317000000,...,0.000272,1e-05,,,1.0,2,FINISH,0,1810627494172,502
5,1626744497194,235085571060,0,103,345,34202965855,"{'cpus': 0.0615234375, 'memory': 0.00540924072...",1.0,1626600000000,1626761000000,...,0.005409,0.001844,,,1.0,5,SCHEDULE,0,1626744497194,451
6,130721370174,275444626052,1,117,13138,10129440520,"{'cpus': 0.00566864013671875, 'memory': 0.0015...",0.0,343800000000,344100000000,...,0.00219,6e-06,0.646689,0.007937,1.0,7,ENABLE,0,130721370174,36
