<a href="https://colab.research.google.com/github/tivanello/fase5/blob/main/notebooks/Entrega%20DATATHON%20Fase5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Importação da base diretamente do GitHub (com 3 abas)**

Este bloco lê o arquivo Excel que está no seu repositório GitHub (pasta **data/raw/**) e importa as três abas (**PEDE2022**, **PEDE2023**, **PEDE2024**) em dataframes separados. Depois, adiciona a coluna **ano_base** para identificar o ano de cada registro e junta tudo no dataframe final **df_fase5**.

**O que cada parte faz:**

1) **Monta a URL “raw” do GitHub**  
- **base** aponta para a pasta do arquivo no GitHub via **raw.githubusercontent.com** (é o formato certo para o Python baixar o arquivo).  
- **nome** é o nome do arquivo exatamente como está no repositório.

2) **Trata espaços no nome do arquivo**  
- **quote(nome)** converte espaços e caracteres especiais para um formato aceito em URL (por exemplo, espaço vira **%20**).  
- Isso evita erro de “arquivo não encontrado” quando o nome tem espaços.

3) **Confere as abas do Excel antes de ler**  
- **pd.ExcelFile(url)** abre o arquivo para inspecionar a estrutura.  
- **xls.sheet_names** imprime a lista de abas encontradas, para você confirmar os nomes exatos.

4) **Lê as três abas**  
- **pd.read_excel(url, sheet_name="PEDE2022")** (e equivalentes) cria **df_2022**, **df_2023**, **df_2024**.  
- Se o nome da aba estiver diferente (maiúsculas/minúsculas, espaços), ajuste o **sheet_name** usando o que apareceu em **Abas encontradas**.

5) **Cria a coluna ano_base**  
- Adiciona o ano correspondente em cada dataframe para manter rastreabilidade depois de juntar.

6) **Concatena em um único dataframe**  
- **pd.concat([df_2022, df_2023, df_2024], ignore_index=True)** junta tudo em **df_fase5** e recria o índice do zero.

7) **Validação rápida**  
- Imprime o **shape** de cada ano e do dataframe final.  
- Mostra as primeiras linhas (**head()**) para confirmar que a importação ficou correta.



In [44]:
###############################################################################################################################################
# Importação das abas (PEDE2022, PEDE2023, PEDE2024) e criação do df_fase5
#
# O que eu faço aqui:
# - Mont0 a URL do arquivo Excel no GitHub (raw) tratando espaços e caracteres especiais no nome do arquivo.
# - Abro o Excel remoto e listo as abas disponíveis (checagem rápida de nomes).
# - Leio as três abas (PEDE2022, PEDE2023, PEDE2024) em dataframes separados.
# - Crio a coluna ano_base em cada aba para rastrear a origem do registro.
# - Concatêno tudo em um único dataframe (df_fase5) para EDA e modelagem.
# - Exibo shapes e uma amostra inicial para validar que a carga ficou ok.
###############################################################################################################################################

import pandas as pd
from urllib.parse import quote

base = "https://raw.githubusercontent.com/tivanello/fase5/main/data/raw/"
nome = "BASE DE DADOS PEDE 2024 - DATATHON.xlsx"

url = base + quote(nome)

# Conferir abas
xls = pd.ExcelFile(url)
print("Abas encontradas:", xls.sheet_names)

# Ler abas (ajuste se tiver diferença de maiúsculas/minúsculas)
df_2022 = pd.read_excel(url, sheet_name="PEDE2022")
df_2023 = pd.read_excel(url, sheet_name="PEDE2023")
df_2024 = pd.read_excel(url, sheet_name="PEDE2024")

# Tag de ano (pra não virar bagunça depois)
df_2022["ano_base"] = 2022
df_2023["ano_base"] = 2023
df_2024["ano_base"] = 2024

# Junta tudo
df_fase5 = pd.concat([df_2022, df_2023, df_2024], ignore_index=True)

print("Shapes:", df_2022.shape, df_2023.shape, df_2024.shape)
print("df_fase5 shape:", df_fase5.shape)
display(df_fase5.head())



Abas encontradas: ['PEDE2022', 'PEDE2023', 'PEDE2024']
Shapes: (860, 43) (1014, 49) (1156, 51)
df_fase5 shape: (3030, 64)


Unnamed: 0,RA,Fase,Turma,Nome,Ano nasc,Idade 22,Gênero,Ano ingresso,Instituição de ensino,Pedra 20,...,Fase Ideal,Defasagem,Destaque IPV.1,INDE 2024,Pedra 2024,Avaliador5,Avaliador6,Escola,Ativo/ Inativo,Ativo/ Inativo.1
0,RA-1,7,A,Aluno-1,2003.0,19.0,Menina,2016,Escola Pública,Ametista,...,,,,,,,,,,
1,RA-2,7,A,Aluno-2,2005.0,17.0,Menina,2017,Rede Decisão,Ametista,...,,,,,,,,,,
2,RA-3,7,A,Aluno-3,2005.0,17.0,Menina,2016,Rede Decisão,Ametista,...,,,,,,,,,,
3,RA-4,7,A,Aluno-4,2005.0,17.0,Menino,2017,Rede Decisão,Ametista,...,,,,,,,,,,
4,RA-5,7,A,Aluno-5,2005.0,17.0,Menina,2016,Rede Decisão,Ametista,...,,,,,,,,,,


In [45]:
df_fase5.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3030 entries, 0 to 3029
Data columns (total 64 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   RA                     3030 non-null   object 
 1   Fase                   3030 non-null   object 
 2   Turma                  3030 non-null   object 
 3   Nome                   860 non-null    object 
 4   Ano nasc               860 non-null    float64
 5   Idade 22               860 non-null    float64
 6   Gênero                 3030 non-null   object 
 7   Ano ingresso           3030 non-null   int64  
 8   Instituição de ensino  3029 non-null   object 
 9   Pedra 20               754 non-null    object 
 10  Pedra 21               1061 non-null   object 
 11  Pedra 22               1932 non-null   object 
 12  INDE 22                1932 non-null   float64
 13  Cg                     860 non-null    float64
 14  Cf                     860 non-null    float64
 15  Ct  

In [46]:
###############################################################################################################################################
# BLOCO: Conferência pós-importação (sanidade do df_fase5)
#
# O que eu faço aqui:
# - Confirmo se a concatenação das abas ficou consistente (quantidade de linhas por ano_base).
# - Confiro quantos alunos únicos (RA) existem por ano_base.
# - Verifico se há RA duplicado dentro do mesmo ano (isso atrapalha EDA/modelagem).
# - Identifico colunas 100% vazias (lixo de exportação do Excel).
# - Aponto colunas “suspeitas” que estão como texto (object), mas deveriam ser número ou data,
#   e mostro uma amostra dos valores para entender o padrão antes de corrigir.
###############################################################################################################################################

# 1) Linhas por ano + RAs únicos
print("Linhas por ano_base:")
print(df_fase5["ano_base"].value_counts(dropna=False).sort_index())

print("\nRAs únicos por ano_base:")
print(df_fase5.groupby("ano_base")["RA"].nunique())

# 2) Duplicidade de RA por ano
dup = df_fase5.duplicated(subset=["ano_base", "RA"], keep=False)
print("\nLinhas com RA duplicado dentro do mesmo ano:", int(dup.sum()))
if dup.any():
    display(df_fase5.loc[dup, ["ano_base", "RA", "Turma"]].sort_values(["ano_base","RA"]).head(20))

# 3) Colunas 100% vazias
zero_non_null = [c for c in df_fase5.columns if df_fase5[c].notna().sum() == 0]

print("Qtd de colunas 100% vazias:", len(zero_non_null))
print("Colunas 100% vazias:")
for c in zero_non_null:
    print("-", c)

# 4) Tipos suspeitos (object)
suspeitas = [c for c in df_fase5.columns
             if ("INDE" in c or c in ["Idade","Data de Nasc","INDE 2024"])
             and df_fase5[c].dtype == "object"]
print("\nColunas suspeitas (object onde deveria ser numérico/data):", suspeitas)

# 5) Amostra das suspeitas
for c in suspeitas[:6]:
    print("\nAmostra ->", c)
    display(df_fase5[c].dropna().astype(str).head(10))



Linhas por ano_base:
ano_base
2022     860
2023    1014
2024    1156
Name: count, dtype: int64

RAs únicos por ano_base:
ano_base
2022     860
2023    1014
2024    1156
Name: RA, dtype: int64

Linhas com RA duplicado dentro do mesmo ano: 0
Qtd de colunas 100% vazias: 1
Colunas 100% vazias:
- Destaque IPV.1

Colunas suspeitas (object onde deveria ser numérico/data): ['Data de Nasc', 'Idade', 'INDE 2024']

Amostra -> Data de Nasc


Unnamed: 0,Data de Nasc
860,6/17/2015
861,5/31/2014
862,2/25/2016
863,2015-12-03 00:00:00
864,11/13/2014
865,2016-10-02 00:00:00
866,6/29/2015
867,2015-08-11 00:00:00
868,1/15/2015
869,10/20/2014



Amostra -> Idade


Unnamed: 0,Idade
860,8
861,9
862,7
863,1900-01-08 00:00:00
864,8
865,1900-01-07 00:00:00
866,8
867,1900-01-07 00:00:00
868,8
869,9



Amostra -> INDE 2024


Unnamed: 0,INDE 2024
1874,7.611366666700001
1875,8.002866666700001
1876,7.952200000100001
1877,7.156366666600001
1878,5.444199999900001
1879,8.0822
1880,8.959700000000002
1881,7.346677272700001
1882,8.152624242500002
1883,7.982890909100001


In [47]:
###############################################################################################################################################
#  Copiar df_fase5 para trabalhar com segurança
#
# O que eu faço aqui:
# - Crio uma cópia do df_fase5 para não correr o risco de estragar o original sem querer.
###############################################################################################################################################

import pandas as pd
import numpy as np

df = df_fase5.copy()
print("OK. Cópia criada. Shape:", df.shape)


OK. Cópia criada. Shape: (3030, 64)


In [48]:
###############################################################################################################################################
#  Remover colunas 100% vazias (0 valores preenchidos)
#
# O que eu faço aqui:
# - Procuro colunas que não têm nenhum valor preenchido (só NaN).
# - Removo essas colunas porque são sobra/lixo do Excel e atrapalham.
###############################################################################################################################################

vazias = [c for c in df.columns if df[c].notna().sum() == 0]
print("Colunas 100% vazias (serão removidas):", vazias)

if vazias:
    df = df.drop(columns=vazias)

print("OK. Shape após remover vazias:", df.shape)


Colunas 100% vazias (serão removidas): ['Destaque IPV.1']
OK. Shape após remover vazias: (3030, 63)


In [49]:
###############################################################################################################################################
#  Ajustar Data de Nasc para datetime
#
# O que eu faço aqui:
# - Converto a coluna Data de Nasc para data (datetime).
# - Funciona mesmo com formatos mistos (ex.: 6/17/2015 e 2015-12-03 00:00:00).
# - Valores inválidos viram NaT (nulo de data).
###############################################################################################################################################

if "Data de Nasc" in df.columns:
    df["Data de Nasc"] = pd.to_datetime(df["Data de Nasc"], errors="coerce")
    print("OK. Data de Nasc ->", df["Data de Nasc"].dtype)
else:
    print("Coluna Data de Nasc não existe no df.")


OK. Data de Nasc -> datetime64[ns]


In [50]:
###############################################################################################################################################
#  Ajustar Idade para inteiro (corrigir erro típico do Excel)
#
# O que eu faço aqui:
# - Tenta converter Idade para número.
# - Se Idade estiver como data (1900-01-xx), eu extraio o 'dia' como idade (ex.: 1900-01-08 -> 8).
# - No final, Idade vira Int64 (inteiro com suporte a nulos).
###############################################################################################################################################

if "Idade" in df.columns:
    s = df["Idade"]

    idade_num = pd.to_numeric(s, errors="coerce")
    idade_dt = pd.to_datetime(s, errors="coerce")

    idade_from_date = np.where(
        idade_dt.notna() & (idade_dt.dt.year == 1900) & (idade_dt.dt.month == 1),
        idade_dt.dt.day,
        np.nan
    )

    df["Idade"] = pd.Series(idade_num).fillna(pd.Series(idade_from_date)).astype("Int64")
    print("OK. Idade ->", df["Idade"].dtype)
else:
    print("Coluna Idade não existe no df.")


OK. Idade -> Int64


In [51]:
###############################################################################################################################################
#  Ajustar INDE 2024 para float
#
# O que eu faço aqui:
# - Converto INDE 2024 para número (float).
# - Se estiver como texto, vira float; se tiver lixo, vira NaN.
###############################################################################################################################################

if "INDE 2024" in df.columns:
    df["INDE 2024"] = pd.to_numeric(df["INDE 2024"], errors="coerce")
    print("OK. INDE 2024 ->", df["INDE 2024"].dtype)
else:
    print("Coluna INDE 2024 não existe no df.")


OK. INDE 2024 -> float64


In [52]:
###############################################################################################################################################
# Conferência final + atualizar df_fase5
#
# O que eu faço aqui:
# - Mostro os tipos finais das colunas críticas.
# - Atualizo o df_fase5 para seguir o fluxo já com os dados corrigidos.
###############################################################################################################################################

for c in ["Data de Nasc", "Idade", "INDE 2024"]:
    if c in df.columns:
        print(c, "->", df[c].dtype)

df_fase5 = df
print("OK. df_fase5 atualizado. Shape:", df_fase5.shape)


Data de Nasc -> datetime64[ns]
Idade -> Int64
INDE 2024 -> float64
OK. df_fase5 atualizado. Shape: (3030, 63)


In [53]:
###############################################################################################################################################
# BLOCO: Auditoria geral de tipos (diagnóstico)
#
# O que eu faço aqui:
# - Examino todas as colunas do df_fase5 e marco como "suspeitas" quando:
#   1) A coluna é texto (object/string), mas parece numérica (muitos valores viram número após limpeza)
#   2) A coluna é texto (object/string) e tem "cara de data" pelo nome (ex.: data, nasc, nascimento),
#      e muitos valores viram datetime
#   3) A coluna é float, mas parece inteiro (quase tudo é inteiro, só ficou float por causa de NaN)
# - Eu só tento parsear data nas colunas que parecem ser de data (pra evitar aviso e custo).
###############################################################################################################################################

import pandas as pd
import numpy as np
import re
import warnings

df = df_fase5.copy()

def clean_numeric_series(s):
    x = s.astype("string").str.strip()
    x = x.str.replace(r"\.", "", regex=True)     # remove pontos (milhar)
    x = x.str.replace(",", ".", regex=False)     # vírgula decimal -> ponto
    x = x.str.replace(r"[^0-9\.\-]", "", regex=True)
    return pd.to_numeric(x, errors="coerce")

def looks_like_date_col(colname):
    c = str(colname).lower()
    return any(k in c for k in ["data", "dt", "nasc", "nascimento", "date"])

def guess_profile(col):
    s = df[col]
    dtype = str(s.dtype)

    sample = s.dropna()
    if len(sample) > 500:
        sample = sample.sample(500, random_state=42)

    pct_num = None
    pct_date = None
    pct_intlike = None

    if dtype in ["object", "string"]:
        num = clean_numeric_series(sample)
        pct_num = float(num.notna().mean()) if len(sample) else 0.0

        # Só tenta data se o nome da coluna indicar que é data
        if looks_like_date_col(col):
            with warnings.catch_warnings():
                warnings.simplefilter("ignore")
                dt = pd.to_datetime(sample.astype("string").str.strip(), errors="coerce")
            pct_date = float(dt.notna().mean()) if len(sample) else 0.0
        else:
            pct_date = 0.0

    if "float" in dtype:
        vals = sample.astype(float)
        pct_intlike = float(np.isclose(vals.dropna() % 1, 0).mean()) if vals.notna().any() else 0.0

    return dtype, pct_num, pct_date, pct_intlike, sample.astype("string").head(8).tolist()

suspeitas = []

for col in df.columns:
    dtype, pct_num, pct_date, pct_intlike, amostra = guess_profile(col)

    flag = False
    motivo = []

    if pct_num is not None and pct_num >= 0.85:
        flag = True
        motivo.append(f"parece_numero={pct_num:.0%}")

    # só marcamos "parece_data" se for coluna com cara de data
    if looks_like_date_col(col) and pct_date is not None and pct_date >= 0.60:
        flag = True
        motivo.append(f"parece_data={pct_date:.0%}")

    if pct_intlike is not None and pct_intlike >= 0.98:
        flag = True
        motivo.append(f"float_parece_inteiro={pct_intlike:.0%}")

    if flag:
        suspeitas.append({
            "coluna": col,
            "dtype_atual": dtype,
            "motivo": "; ".join(motivo),
            "amostra": amostra
        })

suspeitas_df = pd.DataFrame(suspeitas).sort_values(["dtype_atual","coluna"]).reset_index(drop=True)

print("Qtd colunas suspeitas:", len(suspeitas_df))
display(suspeitas_df)


Qtd colunas suspeitas: 21


Unnamed: 0,coluna,dtype_atual,motivo,amostra
0,Ano nasc,float64,float_parece_inteiro=100%,"[2013.0, 2012.0, 2006.0, 2008.0, 2009.0, 2013...."
1,Cf,float64,float_parece_inteiro=100%,"[166.0, 169.0, 33.0, 140.0, 114.0, 6.0, 75.0, ..."
2,Cg,float64,float_parece_inteiro=100%,"[604.0, 704.0, 413.0, 843.0, 649.0, 32.0, 279...."
3,Ct,float64,float_parece_inteiro=100%,"[9.0, 14.0, 5.0, 7.0, 9.0, 1.0, 4.0, 3.0]"
4,Defas,float64,float_parece_inteiro=100%,"[-1.0, -1.0, -2.0, -1.0, -1.0, -1.0, -1.0, -2.0]"
5,Defasagem,float64,float_parece_inteiro=100%,"[-1.0, -2.0, -1.0, 0.0, -1.0, -1.0, 0.0, -2.0]"
6,IAN,float64,float_parece_inteiro=98%,"[5.0, 5.0, 5.0, 10.0, 5.0, 5.0, 5.0, 5.0]"
7,Idade 22,float64,float_parece_inteiro=100%,"[9.0, 10.0, 16.0, 14.0, 13.0, 9.0, 10.0, 15.0]"
8,Nº Av,float64,float_parece_inteiro=100%,"[3.0, 3.0, 4.0, 0.0, 2.0, 4.0, 2.0, 3.0]"
9,Avaliador1,object,parece_numero=100%,"[Avaliador-21, Avaliador-13, Avaliador-22, Ava..."


In [54]:
###############################################################################################################################################
# Ajuste de tipo da coluna "Ano nasc" (float -> inteiro com NA)
#
# O que eu faço aqui:
# - Converto "Ano nasc" para numérico de forma segura (coerção de erros).
# - Arredondo/normalizo (quando vier como 2013.0) e converto para Int64 (aceita NA).
# - Marco como NA valores fora de um intervalo plausível (evita lixo).
# - Mostro contagem de nulos e amostra antes/depois.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL = "Ano nasc"

print("Antes:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" amostra:", df[COL].dropna().head(10).tolist())

# 1) garante numérico
s = pd.to_numeric(df[COL], errors="coerce")

# 2) como parece inteiro 100%, arredonda e converte
s = s.round(0)

# 3) valida faixa plausível (ajuste se quiser)
#    Exemplo: alunos nascidos entre 1990 e 2020
MIN_ANO = 1990
MAX_ANO = 2020
s = s.where((s >= MIN_ANO) & (s <= MAX_ANO), pd.NA)

# 4) aplica como inteiro com NA
df[COL] = s.astype("Int64")

print("\nDepois:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" amostra:", df[COL].dropna().head(10).tolist())

# se você quiser manter df_fase5 atualizado:
df_fase5 = df


Antes:
 dtype: float64
 nulos: 2170
 amostra: [2003.0, 2005.0, 2005.0, 2005.0, 2005.0, 2004.0, 2004.0, 2002.0, 2004.0, 2004.0]

Depois:
 dtype: Int64
 nulos: 2170
 amostra: [2003, 2005, 2005, 2005, 2005, 2004, 2004, 2002, 2004, 2004]


In [55]:
###############################################################################################################################################
# BLOCO: Ajuste de tipo das colunas de ranking (Cf, Cg, Ct) -> inteiro com NA (Int64)
#
# Contexto (dicionário):
# - CF_2022 = ranking na Fase
# - CG_2022 = ranking Geral
# - CT_2022 = ranking na Turma
#
# O que eu faço aqui:
# - Converto Cf, Cg e Ct para numérico com coerção de erros.
# - Arredondo (remove .0) e converto para Int64 (aceita NA).
# - Valido apenas o mínimo (ranking deve ser >= 1). Não limito teto para não cortar valor válido.
# - Mostro antes/depois.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

cols = ["Cf", "Cg", "Ct"]

for COL in cols:
    if COL not in df.columns:
        print(f"Coluna ausente: {COL} (pulando)")
        continue

    print("\n" + "-"*90)
    print(f"Coluna: {COL}")
    print("Antes -> dtype:", df[COL].dtype, "| nulos:", int(df[COL].isna().sum()))
    print("Amostra:", df[COL].dropna().head(10).tolist())

    s = pd.to_numeric(df[COL], errors="coerce").round(0)

    # ranking válido: >= 1 (0 e negativos viram NA)
    s = s.where(s >= 1, pd.NA)

    df[COL] = s.astype("Int64")

    print("Depois -> dtype:", df[COL].dtype, "| nulos:", int(df[COL].isna().sum()))
    print("Amostra:", df[COL].dropna().head(10).tolist())

df_fase5 = df




------------------------------------------------------------------------------------------
Coluna: Cf
Antes -> dtype: float64 | nulos: 2170
Amostra: [18.0, 8.0, 13.0, 15.0, 6.0, 16.0, 11.0, 21.0, 1.0, 17.0]
Depois -> dtype: Int64 | nulos: 2170
Amostra: [18, 8, 13, 15, 6, 16, 11, 21, 1, 17]

------------------------------------------------------------------------------------------
Coluna: Cg
Antes -> dtype: float64 | nulos: 2170
Amostra: [753.0, 469.0, 629.0, 731.0, 344.0, 745.0, 550.0, 836.0, 113.0, 752.0]
Depois -> dtype: Int64 | nulos: 2170
Amostra: [753, 469, 629, 731, 344, 745, 550, 836, 113, 752]

------------------------------------------------------------------------------------------
Coluna: Ct
Antes -> dtype: float64 | nulos: 2170
Amostra: [10.0, 3.0, 6.0, 7.0, 2.0, 8.0, 5.0, 13.0, 1.0, 9.0]
Depois -> dtype: Int64 | nulos: 2170
Amostra: [10, 3, 6, 7, 2, 8, 5, 13, 1, 9]


In [56]:
###############################################################################################################################################
# Unificar e ajustar tipo de "Defas" e "Defasagem" (mantendo códigos negativos)
#
# O que eu faço aqui:
# - Converto Defas e Defasagem para numérico com coerção de erros.
# - Crio "Defasagem_final":
#     usa Defasagem quando preenchido, senão usa Defas.
# - Identifico conflitos (quando as duas existem e têm valores diferentes).
# - Converto para Int64 (inteiro com NA), mantendo valores negativos (códigos).
# - Mostro distribuição de valores (contagem) pra validar rapidamente.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL_A = "Defas"
COL_B = "Defasagem"
COL_OUT = "Defasagem_final"

# garante que as colunas existam
for c in [COL_A, COL_B]:
    if c not in df.columns:
        print(f"Coluna ausente: {c}")

# converte para numérico (sem matar negativos)
if COL_A in df.columns:
    a = pd.to_numeric(df[COL_A], errors="coerce").round(0)
else:
    a = pd.Series([pd.NA] * len(df), index=df.index, dtype="float64")

if COL_B in df.columns:
    b = pd.to_numeric(df[COL_B], errors="coerce").round(0)
else:
    b = pd.Series([pd.NA] * len(df), index=df.index, dtype="float64")

# conflitos: quando os dois estão preenchidos e são diferentes
mask_conf = a.notna() & b.notna() & (a != b)
print("Conflitos (Defas vs Defasagem):", int(mask_conf.sum()))

if int(mask_conf.sum()) > 0:
    print("\nAmostra de conflitos (até 20):")
    display(df.loc[mask_conf, ["ano_base", COL_A, COL_B]].head(20))

# coluna final: prioriza Defasagem, depois Defas
out = b.where(b.notna(), a)

# converte para Int64 (inteiro com NA) mantendo negativos
df[COL_OUT] = out.astype("Int64")

print("\nResumo Defasagem_final:")
print(" dtype:", df[COL_OUT].dtype)
print(" nulos:", int(df[COL_OUT].isna().sum()))
print(" valores (contagem):")
print(df[COL_OUT].value_counts(dropna=False).sort_index())

# se quiser: manter df_fase5 atualizado
df_fase5 = df


Conflitos (Defas vs Defasagem): 0

Resumo Defasagem_final:
 dtype: Int64
 nulos: 0
 valores (contagem):
Defasagem_final
-5       1
-4       5
-3      39
-2     383
-1    1259
0     1152
1      165
2       24
3        2
Name: count, dtype: Int64


In [57]:
###############################################################################################################################################
#  Sanidade do IAN (tipo numérico e faixa plausível)
#
# O que eu faço aqui:
# - Forço IAN para numérico (float) com coerção de erros.
# - Mostro min/max e valores suspeitos (fora da faixa esperada).
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL = "IAN"

df[COL] = pd.to_numeric(df[COL], errors="coerce")

print("IAN -> dtype:", df[COL].dtype)
print("nulos:", int(df[COL].isna().sum()))
print("min:", df[COL].min(), "| max:", df[COL].max())

# ajuste a faixa se você já souber a regra do indicador
MIN_V = 0
MAX_V = 10

sus = df[(df[COL].notna()) & ((df[COL] < MIN_V) | (df[COL] > MAX_V))]
print("suspeitos fora da faixa 0–10:", len(sus))

if len(sus) > 0:
    display(sus[["ano_base", "RA", COL]].head(20))


IAN -> dtype: float64
nulos: 0
min: 2.5 | max: 10.0
suspeitos fora da faixa 0–10: 0


In [58]:
###############################################################################################################################################
# BLOCO: Ajuste de tipo da coluna "Idade" (float -> inteiro com NA)
#
# O que eu faço aqui:
# - Converto "Idade" para numérico com coerção de erros.
# - Arredondo (remove .0) e converto para Int64 (aceita NA).
# - Valido faixa plausível de idade para evitar lixo.
# - Mostro antes/depois.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL = "Idade"

print("Antes:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" amostra:", df[COL].dropna().head(10).tolist())

s = pd.to_numeric(df[COL], errors="coerce").round(0)

# faixa plausível (ajuste se quiser)
MIN_IDADE = 0
MAX_IDADE = 100
s = s.where((s >= MIN_IDADE) & (s <= MAX_IDADE), pd.NA)

df[COL] = s.astype("Int64")

print("\nDepois:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" amostra:", df[COL].dropna().head(10).tolist())

df_fase5 = df


Antes:
 dtype: Int64
 nulos: 860
 amostra: [8, 9, 7, 8, 8, 7, 8, 7, 8, 9]

Depois:
 dtype: Int64
 nulos: 860
 amostra: [8, 9, 7, 8, 8, 7, 8, 7, 8, 9]


In [59]:
###############################################################################################################################################
# BLOCO: Ajuste de tipo da coluna "Nº Av" (float -> inteiro com NA)
#
# O que eu faço aqui:
# - Converto "Nº Av" para numérico com coerção de erros.
# - Arredondo (remove .0) e converto para Int64 (aceita NA).
# - Valido faixa mínima/máxima (contagem de avaliadores não deve ser negativa).
# - Mostro antes/depois e distribuição de valores.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL = "Nº Av"

print("Antes:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" amostra:", df[COL].dropna().head(10).tolist())

s = pd.to_numeric(df[COL], errors="coerce").round(0)

# contagem válida: >= 0
# teto: se você usa até Avaliador6, faz sentido limitar em 0..6
MIN_V = 0
MAX_V = 6
s = s.where((s >= MIN_V) & (s <= MAX_V), pd.NA)

df[COL] = s.astype("Int64")

print("\nDepois:")
print(" dtype:", df[COL].dtype)
print(" nulos:", int(df[COL].isna().sum()))
print(" valores (contagem):")
print(df[COL].value_counts(dropna=False).sort_index())

df_fase5 = df


Antes:
 dtype: float64
 nulos: 76
 amostra: [4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]

Depois:
 dtype: Int64
 nulos: 76
 valores (contagem):
Nº Av
0        127
2        703
3       1100
4        876
5        142
6          6
<NA>      76
Name: count, dtype: Int64


In [60]:
###############################################################################################################################################
# BLOCO: Ajuste das colunas Avaliador* (manter como texto + normalização; opcional: extrair id numérico)
#
# O que eu faço aqui:
# - Identifico todas as colunas que começam com "Avaliador".
# - Normalizo texto: tira espaços, colapsa espaços internos, padroniza traços, padroniza caixa.
# - Converte strings vazias em NA.
# - (Opcional) Cria colunas Avaliador*_id extraindo o número (ex.: "avaliador-21" -> 21).
###############################################################################################################################################

import pandas as pd
import re

df = df_fase5.copy()

avaliador_cols = [c for c in df.columns if str(c).lower().startswith("avaliador")]
print("Colunas Avaliador* encontradas:", avaliador_cols)

def norm_txt(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip()
    s = re.sub(r"\s+", " ", s)
    if s == "":
        return pd.NA
    s = s.replace("–", "-").replace("—", "-")
    s = s.lower()
    return s

for c in avaliador_cols:
    df[c] = df[c].apply(norm_txt).astype("string")

# opcional: extrair id numérico (deixa comentado se não quiser)
for c in avaliador_cols:
    df[c + "_id"] = (
        df[c]
        .str.extract(r"(\d+)", expand=False)
        .astype("Int64")
    )

# checagem rápida
for c in avaliador_cols:
    print(f"\n{c} -> distintos:", int(df[c].nunique(dropna=True)), "| nulos:", int(df[c].isna().sum()))
    print(" amostra:", df[c].dropna().unique()[:8])

df_fase5 = df


Colunas Avaliador* encontradas: ['Avaliador1', 'Avaliador2', 'Avaliador3', 'Avaliador4', 'Avaliador5', 'Avaliador6']

Avaliador1 -> distintos: 20 | nulos: 203
 amostra: <StringArray>
[ 'avaliador-5',  'avaliador-6',  'avaliador-4',  'avaliador-7',
 'avaliador-11', 'avaliador-13', 'avaliador-19', 'avaliador-15']
Length: 8, dtype: string

Avaliador2 -> distintos: 23 | nulos: 203
 amostra: <StringArray>
['avaliador-27', 'avaliador-30',  'avaliador-2',  'avaliador-3',
  'avaliador-4',  'avaliador-5',  'avaliador-6',  'avaliador-7']
Length: 8, dtype: string

Avaliador3 -> distintos: 21 | nulos: 996
 amostra: <StringArray>
['avaliador-28', 'avaliador-29', 'avaliador-24', 'avaliador-30',
 'avaliador-31', 'avaliador-15',  'avaliador-3',  'avaliador-2']
Length: 8, dtype: string

Avaliador4 -> distintos: 11 | nulos: 1979
 amostra: <StringArray>
['avaliador-31',  'avaliador-8',  'avaliador-2', 'avaliador-15',
  'avaliador-5', 'avaliador-10',  'avaliador-7', 'avaliador-17']
Length: 8, dtype: strin

In [61]:
###############################################################################################################################################
# BLOCO: Padronização de Fase / Fase Ideal (texto) + extração do número da fase (features numéricas)
#
# O que eu faço aqui:
# - Normalizo texto em "Fase", "Fase Ideal" e "Fase ideal" (espaços, traços, caixa).
# - Unifico "Fase Ideal" e "Fase ideal" em uma coluna final: "FaseIdeal_txt".
# - Extraio o número da fase:
#     * "Fase_num" a partir de "Fase" (ex.: "FASE 1", "3", "2L" -> 1, 3, 2)
#     * "FaseIdeal_num" a partir do texto da fase ideal (ex.: "Fase 3 (...)" -> 3)
# - (Opcional) Extraio sufixo de "Fase" quando existir (ex.: "2L" -> "l")
###############################################################################################################################################

import pandas as pd
import re

df = df_fase5.copy()

COL_FASE = "Fase"
COL_FI_A = "Fase Ideal"
COL_FI_B = "Fase ideal"

def norm_txt(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip()
    s = re.sub(r"\s+", " ", s)
    if s == "":
        return pd.NA
    s = s.replace("–", "-").replace("—", "-")
    return s

# 1) normaliza as colunas se existirem
if COL_FASE in df.columns:
    df[COL_FASE] = df[COL_FASE].apply(norm_txt).astype("string")

if COL_FI_A in df.columns:
    df[COL_FI_A] = df[COL_FI_A].apply(norm_txt).astype("string")

if COL_FI_B in df.columns:
    df[COL_FI_B] = df[COL_FI_B].apply(norm_txt).astype("string")

# 2) unifica Fase Ideal (prioriza "Fase Ideal", senão usa "Fase ideal")
if (COL_FI_A in df.columns) and (COL_FI_B in df.columns):
    df["FaseIdeal_txt"] = df[COL_FI_A].where(df[COL_FI_A].notna(), df[COL_FI_B])
elif COL_FI_A in df.columns:
    df["FaseIdeal_txt"] = df[COL_FI_A]
elif COL_FI_B in df.columns:
    df["FaseIdeal_txt"] = df[COL_FI_B]
else:
    df["FaseIdeal_txt"] = pd.NA

df["FaseIdeal_txt"] = df["FaseIdeal_txt"].astype("string")

# 3) extrai número da fase (primeiro dígito/grupo numérico)
if COL_FASE in df.columns:
    df["Fase_num"] = df[COL_FASE].str.extract(r"(\d+)", expand=False).astype("Int64")
    # (opcional) sufixo tipo "L" em "2L"
    df["Fase_sufixo"] = df[COL_FASE].str.extract(r"\d+\s*([A-Za-z])", expand=False).str.lower().astype("string")
else:
    df["Fase_num"] = pd.Series([pd.NA]*len(df), index=df.index, dtype="Int64")
    df["Fase_sufixo"] = pd.Series([pd.NA]*len(df), index=df.index, dtype="string")

df["FaseIdeal_num"] = df["FaseIdeal_txt"].str.extract(r"(\d+)", expand=False).astype("Int64")

# 4) checagens rápidas
print("Fase - valores distintos (amostra):", df[COL_FASE].dropna().unique()[:12] if COL_FASE in df.columns else "sem coluna")
print("Fase_num - distintos:", int(df["Fase_num"].nunique(dropna=True)), "| nulos:", int(df["Fase_num"].isna().sum()))
print("Fase_sufixo - distintos:", df["Fase_sufixo"].dropna().unique()[:12])
print("FaseIdeal_num - distintos:", int(df["FaseIdeal_num"].nunique(dropna=True)), "| nulos:", int(df["FaseIdeal_num"].isna().sum()))

# (opcional) se quiser dropar as colunas duplicadas originais depois de validar:
# df = df.drop(columns=[c for c in [COL_FI_A, COL_FI_B] if c in df.columns])

df_fase5 = df


Fase - valores distintos (amostra): <StringArray>
['7', '6', '5', '4', '3', '2', '1', '0', 'ALFA', 'FASE 1', 'FASE 2', 'FASE 3']
Length: 12, dtype: string
Fase_num - distintos: 10 | nulos: 427
Fase_sufixo - distintos: <StringArray>
['a', 'b', 'c', 'd', 'e', 'g', 'h', 'j', 'k', 'l', 'm', 'n']
Length: 12, dtype: string
FaseIdeal_num - distintos: 8 | nulos: 0


In [62]:
###############################################################################################################################################
# BLOCO: Ajuste de "Nome" e "Nome Anonimizado" (texto) + extração opcional de ID numérico
#
# O que eu faço aqui:
# - Normalizo "Nome" e "Nome Anonimizado" como texto (strip, espaços, traços, caixa).
# - Troco strings vazias por NA.
# - (Opcional) Extraio o número do padrão "Aluno-####" para colunas *_id (Int64).
###############################################################################################################################################

import pandas as pd
import re

df = df_fase5.copy()

COL_NOME = "Nome"
COL_ANON = "Nome Anonimizado"

def norm_txt(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip()
    s = re.sub(r"\s+", " ", s)
    if s == "":
        return pd.NA
    s = s.replace("–", "-").replace("—", "-")
    s = s.lower()
    return s

for col in [COL_NOME, COL_ANON]:
    if col in df.columns:
        df[col] = df[col].apply(norm_txt).astype("string")
    else:
        print(f"Coluna ausente: {col}")

# opcional: extrair id numérico
if COL_NOME in df.columns:
    df["Nome_id"] = df[COL_NOME].str.extract(r"(\d+)", expand=False).astype("Int64")

if COL_ANON in df.columns:
    df["NomeAnon_id"] = df[COL_ANON].str.extract(r"(\d+)", expand=False).astype("Int64")

# checagens rápidas
if COL_NOME in df.columns:
    print("\nNome -> distintos:", int(df[COL_NOME].nunique(dropna=True)), "| nulos:", int(df[COL_NOME].isna().sum()))
    print("amostra:", df[COL_NOME].dropna().unique()[:8])

if COL_ANON in df.columns:
    print("\nNome Anonimizado -> distintos:", int(df[COL_ANON].nunique(dropna=True)), "| nulos:", int(df[COL_ANON].isna().sum()))
    print("amostra:", df[COL_ANON].dropna().unique()[:8])

df_fase5 = df



Nome -> distintos: 860 | nulos: 2170
amostra: <StringArray>
['aluno-1', 'aluno-2', 'aluno-3', 'aluno-4', 'aluno-5', 'aluno-6', 'aluno-7',
 'aluno-8']
Length: 8, dtype: string

Nome Anonimizado -> distintos: 1405 | nulos: 860
amostra: <StringArray>
['aluno-861', 'aluno-862', 'aluno-863', 'aluno-864', 'aluno-865', 'aluno-866',
 'aluno-867', 'aluno-868']
Length: 8, dtype: string


In [63]:
###############################################################################################################################################
# BLOCO: Extrair número do RA (RA-####) para "ra_num" (Int64)
#
# O que eu faço aqui:
# - Encontro a coluna de RA (RA/ra) sem depender de snake_case.
# - Normalizo o texto (strip, espaços, traços) antes de extrair.
# - Extraio o número do padrão RA-#### e salvo como Int64 (aceita NA).
# - Mostro amostra e checo possíveis inconsistências (mapeamento 1:1).
###############################################################################################################################################

import pandas as pd
import re

df = df_fase5.copy()

# acha a coluna RA (independente de maiúscula/minúscula)
ra_candidates = [c for c in df.columns if str(c).strip().lower() == "ra"]
if not ra_candidates:
    raise ValueError("Não encontrei coluna RA/ra no df.")
base_col = ra_candidates[0]

# normaliza texto
s = df[base_col].astype("string")
s = s.str.strip().str.replace(r"\s+", " ", regex=True)
s = s.str.replace("–", "-", regex=False).str.replace("—", "-", regex=False)

# extrai número
df["ra_num"] = pd.to_numeric(
    s.str.extract(r"(\d+)", expand=False),
    errors="coerce"
).astype("Int64")

print("OK. ra_num criado.")
print("Coluna base:", base_col)
print("Nulos em ra_num:", int(df["ra_num"].isna().sum()))
display(df[[base_col, "ra_num"]].drop_duplicates().head(15))

# checagem rápida: 1 RA textual -> 1 ra_num
tmp = df[[base_col, "ra_num"]].dropna().drop_duplicates()
dup_txt = tmp.groupby(base_col)["ra_num"].nunique().sort_values(ascending=False)
dup_num = tmp.groupby("ra_num")[base_col].nunique().sort_values(ascending=False)

print("\nRA textual com mais de 1 ra_num (top 5):")
print(dup_txt[dup_txt > 1].head(5))

print("\nra_num com mais de 1 RA textual (top 5):")
print(dup_num[dup_num > 1].head(5))

df_fase5 = df


OK. ra_num criado.
Coluna base: RA
Nulos em ra_num: 0


Unnamed: 0,RA,ra_num
0,RA-1,1
1,RA-2,2
2,RA-3,3
3,RA-4,4
4,RA-5,5
5,RA-6,6
6,RA-7,7
7,RA-8,8
8,RA-9,9
9,RA-10,10



RA textual com mais de 1 ra_num (top 5):
Series([], Name: ra_num, dtype: int64)

ra_num com mais de 1 RA textual (top 5):
Series([], Name: RA, dtype: int64)


In [64]:
###############################################################################################################################################
# Verificar se ra_num ficou único dentro de cada ano_base
#
# O que eu faço aqui:
# - Confiro se ra_num é único por ano_base (igual você fez com ra).
# - Se não for, eu mostro exemplos (pra entender se existe reaproveitamento de RA em outro contexto).
###############################################################################################################################################

df = df_fase5.copy()

if "ano_base" in df.columns and "ra_num" in df.columns:
    dup = df.duplicated(subset=["ano_base", "ra_num"], keep=False)
    print("Linhas com ra_num duplicado dentro do mesmo ano:", int(dup.sum()))
    if dup.any():
        cols_show = [c for c in ["ano_base", "ra", "RA", "ra_num", "turma", "Turma"] if c in df.columns]
        display(df.loc[dup, cols_show].sort_values(["ano_base","ra_num"]).head(30))
else:
    print("Faltando ano_base ou ra_num no df.")


Linhas com ra_num duplicado dentro do mesmo ano: 0


In [65]:
df_fase5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3030 entries, 0 to 3029
Data columns (total 77 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   RA                     3030 non-null   object        
 1   Fase                   3030 non-null   string        
 2   Turma                  3030 non-null   object        
 3   Nome                   860 non-null    string        
 4   Ano nasc               860 non-null    Int64         
 5   Idade 22               860 non-null    float64       
 6   Gênero                 3030 non-null   object        
 7   Ano ingresso           3030 non-null   int64         
 8   Instituição de ensino  3029 non-null   object        
 9   Pedra 20               754 non-null    object        
 10  Pedra 21               1061 non-null   object        
 11  Pedra 22               1932 non-null   object        
 12  INDE 22                1932 non-null   float64       
 13  Cg 

In [66]:
###############################################################################################################################################
# BLOCO: Ajuste "Idade 22" (float -> Int64)
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

COL = "Idade 22"
if COL in df.columns:
    s = pd.to_numeric(df[COL], errors="coerce").round(0)
    s = s.where((s >= 0) & (s <= 100), pd.NA)
    df[COL] = s.astype("Int64")

print("dtype Idade 22:", df["Idade 22"].dtype if "Idade 22" in df.columns else "coluna ausente")
df_fase5 = df


dtype Idade 22: Int64


In [67]:
###############################################################################################################################################
# BLOCO: Checagem de duplicidade entre "Ativo/ Inativo" e "Ativo/ Inativo.1"
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

c1 = "Ativo/ Inativo"
c2 = "Ativo/ Inativo.1"

if c1 in df.columns and c2 in df.columns:
    eq = df[c1].astype("string").fillna("<NA>") == df[c2].astype("string").fillna("<NA>")
    print("Linhas iguais (%):", round(100 * eq.mean(), 2))
    print("Divergências:", int((~eq).sum()))
    if int((~eq).sum()) > 0:
        display(df.loc[~eq, ["ano_base", "RA", c1, c2]].head(20))
else:
    print("Uma das colunas não existe.")


Linhas iguais (%): 100.0
Divergências: 0


In [68]:
###############################################################################################################################################
# BLOCO: Remover coluna duplicada "Ativo/ Inativo.1" (idêntica a "Ativo/ Inativo")
###############################################################################################################################################

df = df_fase5.copy()

dup = "Ativo/ Inativo.1"
if dup in df.columns:
    df = df.drop(columns=[dup])
    print("OK. Removida:", dup)

df_fase5 = df


OK. Removida: Ativo/ Inativo.1


In [69]:
###############################################################################################################################################
# BLOCO: Normalização básica de colunas categóricas (strip + espaços)
###############################################################################################################################################

import pandas as pd
import re

df = df_fase5.copy()

cat_cols = [
    "Gênero", "Instituição de ensino", "Turma", "Escola",
    "Ativo/ Inativo", "Ativo/ Inativo.1",
    "Indicado", "Atingiu PV",
    "Destaque IEG", "Destaque IDA", "Destaque IPV",
    "Rec Av1", "Rec Av2", "Rec Av3", "Rec Av4", "Rec Psicologia",
    "Pedra 20", "Pedra 21", "Pedra 22", "Pedra 23", "Pedra 2023", "Pedra 2024"
]

def norm_cat(x):
    if pd.isna(x):
        return pd.NA
    s = str(x).strip()
    s = re.sub(r"\s+", " ", s)
    return pd.NA if s == "" else s

for c in cat_cols:
    if c in df.columns:
        df[c] = df[c].apply(norm_cat).astype("string")

df_fase5 = df
print("OK. Categorias normalizadas (strip/espaços).")


OK. Categorias normalizadas (strip/espaços).


In [70]:
###############################################################################################################################################
# BLOCO: Checagem de consistência (ano_base vs Ano nasc / Data de Nasc)
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

# 2022: Ano nasc x Idade 22
if "Ano nasc" in df.columns and "Idade 22" in df.columns:
    calc_22 = df["ano_base"].astype("Int64") - df["Ano nasc"].astype("Int64")
    m = df["Idade 22"].notna() & df["Ano nasc"].notna()
    dif = (df.loc[m, "Idade 22"].astype("Int64") - calc_22.loc[m]).abs()
    print("Idade 22 vs (ano_base - Ano nasc) | divergência > 1 ano:", int((dif > 1).sum()))
    if int((dif > 1).sum()) > 0:
        display(df.loc[m].assign(calc=calc_22)[dif > 1][["ano_base","RA","Ano nasc","Idade 22","calc"]].head(20))

# 2023/2024: Data de Nasc x Idade
if "Data de Nasc" in df.columns and "Idade" in df.columns:
    calc = df["ano_base"].astype("Int64") - df["Data de Nasc"].dt.year.astype("Int64")
    m = df["Idade"].notna() & df["Data de Nasc"].notna()
    dif = (df.loc[m, "Idade"].astype("Int64") - calc.loc[m]).abs()
    print("Idade vs (ano_base - ano(Data de Nasc)) | divergência > 1 ano:", int((dif > 1).sum()))
    if int((dif > 1).sum()) > 0:
        display(df.loc[m].assign(calc=calc)[dif > 1][["ano_base","RA","Data de Nasc","Idade","calc"]].head(20))


Idade 22 vs (ano_base - Ano nasc) | divergência > 1 ano: 0
Idade vs (ano_base - ano(Data de Nasc)) | divergência > 1 ano: 0


In [71]:
###############################################################################################################################################
# BLOCO: Consolidar INDE e PEDRA por ano_base (colunas finais únicas)
#
# O que eu faço aqui:
# - Crio INDE_final e PEDRA_final pegando o valor correspondente ao ano_base.
# - Registro qual coluna foi usada (INDE_fonte / PEDRA_fonte) para auditoria.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

# mapeie possíveis nomes por ano
map_inde = {
    2022: ["INDE 22", "INDE_2022", "INDE 2022"],
    2023: ["INDE 23", "INDE 2023", "INDE_2023", "INDE 2023"],
    2024: ["INDE 2024", "INDE_2024", "INDE 24", "INDE 2024"],
}

map_pedra = {
    2022: ["Pedra 22", "Pedra_2022", "Pedra 2022"],
    2023: ["Pedra 23", "Pedra 2023", "Pedra_2023", "Pedra 2023"],
    2024: ["Pedra 2024", "Pedra_2024", "Pedra 24", "Pedra 2024"],
}

def pick_value(row, candidates):
    for c in candidates:
        if c in row.index and pd.notna(row[c]):
            return row[c], c
    return pd.NA, pd.NA

inde_vals = []
inde_src  = []
ped_vals  = []
ped_src   = []

for _, row in df.iterrows():
    ano = int(row["ano_base"])
    v, s = pick_value(row, map_inde.get(ano, []))
    inde_vals.append(v)
    inde_src.append(s)

    v, s = pick_value(row, map_pedra.get(ano, []))
    ped_vals.append(v)
    ped_src.append(s)

df["INDE_final"] = pd.to_numeric(pd.Series(inde_vals, index=df.index), errors="coerce")
df["INDE_fonte"] = pd.Series(inde_src, index=df.index).astype("string")

df["PEDRA_final"] = pd.Series(ped_vals, index=df.index).astype("string")
df["PEDRA_fonte"] = pd.Series(ped_src, index=df.index).astype("string")

print("INDE_final nulos:", int(df["INDE_final"].isna().sum()))
print("PEDRA_final nulos:", int(df["PEDRA_final"].isna().sum()))
print("Fontes INDE (top):")
print(df["INDE_fonte"].value_counts(dropna=False).head(10))
print("Fontes PEDRA (top):")
print(df["PEDRA_fonte"].value_counts(dropna=False).head(10))

df_fase5 = df


INDE_final nulos: 185
PEDRA_final nulos: 147
Fontes INDE (top):
INDE_fonte
INDE 2024    1054
INDE 2023     931
INDE 22       860
<NA>          185
Name: count, dtype: Int64
Fontes PEDRA (top):
PEDRA_fonte
Pedra 2024    1092
Pedra 2023     931
Pedra 22       860
<NA>           147
Name: count, dtype: Int64


In [72]:
###############################################################################################################################################
# BLOCO: Padronizar rótulos das colunas fonte (INDE_fonte / PEDRA_fonte)
#
# O que eu faço aqui:
# - Troco "INDE 22" -> "INDE 2022"
# - Troco "Pedra 22" -> "Pedra 2022"
# (mantém o resto como está)
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

if "INDE_fonte" in df.columns:
    df["INDE_fonte"] = (
        df["INDE_fonte"].astype("string")
        .str.replace(r"^INDE[\s_]*22$", "INDE 2022", regex=True)
    )

if "PEDRA_fonte" in df.columns:
    df["PEDRA_fonte"] = (
        df["PEDRA_fonte"].astype("string")
        .str.replace(r"^Pedra[\s_]*22$", "Pedra 2022", regex=True)
    )

print("Fontes INDE (top):")
print(df["INDE_fonte"].value_counts(dropna=False).head(10))

print("\nFontes PEDRA (top):")
print(df["PEDRA_fonte"].value_counts(dropna=False).head(10))

df_fase5 = df


Fontes INDE (top):
INDE_fonte
INDE 2024    1054
INDE 2023     931
INDE 2022     860
<NA>          185
Name: count, dtype: Int64

Fontes PEDRA (top):
PEDRA_fonte
Pedra 2024    1092
Pedra 2023     931
Pedra 2022     860
<NA>           147
Name: count, dtype: Int64


In [73]:
###############################################################################################################################################
# BLOCO: Inspeção rápida de conteúdo de colunas (auditoria vs final)
#
# O que eu faço aqui:
# - Mostro dtype, nulos, distintos e amostra de valores (top frequentes + valores únicos).
# - Para colunas numéricas: min/max.
# - Para colunas texto: exemplos de valores.
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

cols = [
    "Defas", "Defasagem", "Defasagem_final",
    "Fase Ideal", "Fase ideal", "FaseIdeal_txt",
    "Nome", "Nome Anonimizado", "Nome_id", "NomeAnon_id"
]

def inspeciona_coluna(df, col, top_n=10, amostra_n=12):
    if col not in df.columns:
        print(f"\n{col}: coluna ausente")
        return

    s = df[col]
    print("\n" + "="*100)
    print(f"Coluna: {col}")
    print("dtype:", s.dtype, "| nulos:", int(s.isna().sum()), "| distintos:", int(s.nunique(dropna=True)))

    if pd.api.types.is_numeric_dtype(s):
        print("min:", s.min(), "| max:", s.max())
        print("top valores (contagem):")
        print(s.value_counts(dropna=False).head(top_n))
    else:
        print("top valores (contagem):")
        print(s.astype("string").value_counts(dropna=False).head(top_n))
        print("amostra (valores únicos):")
        vals = s.astype("string").dropna().unique()
        print(vals[:amostra_n])

for c in cols:
    inspeciona_coluna(df, c)



Coluna: Defas
dtype: float64 | nulos: 2170 | distintos: 8
min: -5.0 | max: 2.0
top valores (contagem):
Defas
 NaN    2170
-1.0     410
 0.0     247
-2.0     163
-3.0      23
 1.0       9
-4.0       4
 2.0       3
-5.0       1
Name: count, dtype: int64

Coluna: Defasagem
dtype: float64 | nulos: 860 | distintos: 8
min: -4.0 | max: 3.0
top valores (contagem):
Defasagem
 0.0    905
 NaN    860
-1.0    849
-2.0    220
 1.0    156
 2.0     21
-3.0     16
 3.0      2
-4.0      1
Name: count, dtype: int64

Coluna: Defasagem_final
dtype: Int64 | nulos: 0 | distintos: 9
min: -5 | max: 3
top valores (contagem):
Defasagem_final
-1    1259
0     1152
-2     383
1      165
-3      39
2       24
-4       5
3        2
-5       1
Name: count, dtype: Int64

Coluna: Fase Ideal
dtype: string | nulos: 860 | distintos: 9
top valores (contagem):
Fase Ideal
<NA>                       860
Fase 2 (5° e 6° ano)       527
Fase 3 (7° e 8° ano)       437
Fase 1 (3° e 4° ano)       290
Fase 4 (9° ano)            18

In [74]:
###############################################################################################################################################
# BLOCO: Remover colunas "Defas" e "Defasagem" (manter só "Defasagem_final")
###############################################################################################################################################

df = df_fase5.copy()

cols_drop = [c for c in ["Defas", "Defasagem"] if c in df.columns]
df = df.drop(columns=cols_drop)

print("Removidas:", cols_drop)
print("Colunas restantes relacionadas à defasagem:", [c for c in df.columns if "defas" in str(c).lower()])

df_fase5 = df


Removidas: ['Defas', 'Defasagem']
Colunas restantes relacionadas à defasagem: ['Defasagem_final']


In [75]:
###############################################################################################################################################
# BLOCO: Limpeza das colunas de fase ideal (manter só FaseIdeal_txt + padronizar ° -> º)
#
# O que eu faço aqui:
# - Padronizo o símbolo em FaseIdeal_txt (° -> º) para reduzir duplicidade boba.
# - Apago as colunas originais duplicadas: "Fase Ideal" e "Fase ideal".
###############################################################################################################################################

import pandas as pd

df = df_fase5.copy()

# 1) padroniza símbolo em FaseIdeal_txt
if "FaseIdeal_txt" in df.columns:
    df["FaseIdeal_txt"] = df["FaseIdeal_txt"].astype("string").str.replace("°", "º", regex=False)

# 2) remove duplicadas originais
cols_drop = [c for c in ["Fase Ideal", "Fase ideal"] if c in df.columns]
df = df.drop(columns=cols_drop)

print("Removidas:", cols_drop)

print("OK. FaseIdeal_txt dtype:", df["FaseIdeal_txt"].dtype if "FaseIdeal_txt" in df.columns else "coluna ausente")
print("Distinct FaseIdeal_txt:", int(df["FaseIdeal_txt"].nunique(dropna=True)) if "FaseIdeal_txt" in df.columns else "-")

df_fase5 = df


Removidas: ['Fase Ideal', 'Fase ideal']
OK. FaseIdeal_txt dtype: string
Distinct FaseIdeal_txt: 11


In [76]:
###############################################################################################################################################
# BLOCO: Remover colunas de identificação (Nome / Nome Anonimizado / IDs) para modelagem
#
# O que eu faço aqui:
# - Apago colunas que funcionam como identificador e tendem a causar leakage/memorização no modelo:
#   "Nome", "Nome Anonimizado" e colunas derivadas de ID.
# - Mostro o que foi removido e a nova shape.
###############################################################################################################################################

df = df_fase5.copy()

cols_drop = [
    "Nome", "Nome Anonimizado",
    "Nome_id", "NomeAnon_id",
    "nome_num", "nome_anon_num"
]

cols_drop = [c for c in cols_drop if c in df.columns]

df = df.drop(columns=cols_drop)

print("Removidas (IDs/leakage):", cols_drop)
print("df_fase5 shape (novo):", df.shape)

df_fase5 = df


Removidas (IDs/leakage): ['Nome', 'Nome Anonimizado', 'Nome_id', 'NomeAnon_id']
df_fase5 shape (novo): (3030, 72)
