# Notebook 02 — Train/Test Split

**Objetivos:**
- Carregar o dataset limpo (`cleaned_df.csv`)  
- Separar features (`X`) e target (`y`)  
- Dividir em conjuntos de treino e teste (80/20)  
- Garantir estratificação e consistência no `Churn`  
- Guardar os dois ficheiros em `data/processed/`


In [None]:
import sys, os
sys.path.append(os.path.abspath(".."))


In [None]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from src.utils_data import save_df

DATA_DIR = os.path.join("..", "data")
CLEAN_PATH = os.path.join(DATA_DIR, "interim", "cleaned_df.csv")


In [None]:
df = pd.read_csv(CLEAN_PATH)
print("Shape do dataset limpo:", df.shape)
df.head()


Shape do dataset limpo: (7043, 50)


Unnamed: 0,customerid,gender,age,under30,seniorcitizen,married,dependents,numberofdependents,country,state,...,totallongdistancecharges,totalrevenue,satisfactionscore,customerstatus,churnlabel,churnvalue,churnscore,cltv,churncategory,churnreason
0,8779-QRDMV,Male,78,No,Yes,No,No,0,United States,California,...,0.0,59.65,3,Churned,Yes,1,91,5433,Competitor,Competitor offered more data
1,7495-OOKFY,Female,74,No,Yes,Yes,Yes,1,United States,California,...,390.8,1024.1,3,Churned,Yes,1,69,5302,Competitor,Competitor made better offer
2,1658-BYGOY,Male,71,No,Yes,No,Yes,3,United States,California,...,203.94,1910.88,2,Churned,Yes,1,81,3179,Competitor,Competitor made better offer
3,4598-XLKNJ,Female,78,No,Yes,Yes,Yes,1,United States,California,...,494.0,2995.07,2,Churned,Yes,1,88,5337,Dissatisfaction,Limited range of services
4,4846-WHAFZ,Female,80,No,Yes,Yes,Yes,1,United States,California,...,234.21,3102.36,2,Churned,Yes,1,67,2793,Price,Extra data charges


In [None]:
# Identificar coluna alvo
target_col = "churnvalue"

# Features e target
X = df.drop(columns=[target_col])
y = df[target_col]

print("X shape:", X.shape)
print("y distribution:")
print(y.value_counts(normalize=True))


X shape: (7043, 49)
y distribution:
churnvalue
0    0.73463
1    0.26537
Name: proportion, dtype: float64


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y  # garante a mesma proporção de churn em ambos
)


In [None]:
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

from src.utils_data import save_df
save_df(train_df, "train_df", folder="processed")
save_df(test_df, "test_df", folder="processed")

print("✅ Ficheiros guardados em data/processed/")


✅ Guardado: /Users/pedroazevedo/Documents/GitHub/EnterpriseDataScienceBootcamp_workgroup/data/processed/train_df.csv
✅ Guardado: /Users/pedroazevedo/Documents/GitHub/EnterpriseDataScienceBootcamp_workgroup/data/processed/test_df.csv
✅ Ficheiros guardados em data/processed/


In [None]:
for name, part in [("Train", train_df), ("Test", test_df)]:
    print(f"\n{name} set:")
    print(part["churnvalue"].value_counts(normalize=True))



Train set:
churnvalue
0    0.734647
1    0.265353
Name: proportion, dtype: float64

Test set:
churnvalue
0    0.734564
1    0.265436
Name: proportion, dtype: float64
