## Dataset Overview


- **T√≠tulo**: Marketing Banc√°rio
- **Criado por**: Paulo Cortez (Univ. Minho) e S√©rgio Moro (ISCTE-IUL) @ 2012
- **Total de Inst√¢ncias**: 45.211 (bank-full.csv)
- **N√∫mero de Atributos**: 16 vari√°veis de entrada + 1 vari√°vel alvo
- **Valores Ausentes**: Nenhum

### Contexto

Este conjunto de dados est√° relacionado a campanhas de marketing direto de uma institui√ß√£o banc√°ria portuguesa. As campanhas foram realizadas por meio de liga√ß√µes telef√¥nicas, muitas vezes exigindo m√∫ltiplos contatos com o mesmo cliente para determinar se ele iria subscrever um dep√≥sito a prazo.

### Vari√°vel Alvo

- **y**: Se o cliente subscreveu um dep√≥sito a prazo (bin√°rio: "yes"/"no")

### Descri√ß√£o das Vari√°veis

#### Dados Pessoais do Cliente

1. **age**

   - Tipo: Num√©rico
   - Descri√ß√£o: Idade do cliente

2. **job**

   - Tipo: Categ√≥rico
   - Valores: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services"
   - Descri√ß√£o: Tipo de emprego do cliente

3. **marital**

   - Tipo: Categ√≥rico
   - Valores: "married", "divorced", "single"
   - Observa√ß√£o: "divorced" inclui tanto divorciados quanto vi√∫vos

4. **education**

   - Tipo: Categ√≥rico
   - Valores: "unknown", "secondary", "primary", "tertiary"
   - Descri√ß√£o: N√≠vel de escolaridade do cliente

5. **default**

   - Tipo: Bin√°rio
   - Valores: "yes", "no"
   - Descri√ß√£o: Se o cliente possui cr√©dito em inadimpl√™ncia

6. **balance**

   - Tipo: Num√©rico
   - Descri√ß√£o: Saldo m√©dio anual em euros

7. **housing**

   - Tipo: Bin√°rio
   - Valores: "yes", "no"
   - Descri√ß√£o: Se o cliente possui empr√©stimo habitacional

8. **loan**
   - Tipo: Bin√°rio
   - Valores: "yes", "no"
   - Descri√ß√£o: Se o cliente possui empr√©stimo pessoal

#### Informa√ß√µes do Contato da Campanha Atual

9. **contact**

   - Tipo: Categ√≥rico
   - Valores: "unknown", "telephone", "cellular"
   - Descri√ß√£o: Tipo de comunica√ß√£o utilizada para o contato

10. **day**

    - Tipo: Num√©rico
    - Descri√ß√£o: √öltimo dia de contato no m√™s

11. **month**

    - Tipo: Categ√≥rico
    - Valores: "jan", "feb", "mar", ..., "nov", "dec"
    - Descri√ß√£o: √öltimo m√™s de contato no ano

12. **duration**
    - Tipo: Num√©rico
    - Descri√ß√£o: Dura√ß√£o do √∫ltimo contato em segundos

#### Informa√ß√µes da Campanha

13. **campaign**

    - Tipo: Num√©rico
    - Descri√ß√£o: N√∫mero de contatos realizados durante esta campanha para este cliente (inclui o √∫ltimo contato)

14. **pdays**

    - Tipo: Num√©rico
    - Descri√ß√£o: N√∫mero de dias desde o √∫ltimo contato do cliente em uma campanha anterior
    - Observa√ß√£o: -1 significa que o cliente n√£o foi previamente contatado

15. **previous**

    - Tipo: Num√©rico
    - Descri√ß√£o: N√∫mero de contatos realizados antes desta campanha para este cliente

16. **poutcome**
    - Tipo: Categ√≥rico
    - Valores: "unknown", "other", "failure", "success"
    - Descri√ß√£o: Resultado da campanha de marketing anterior


## Explora√ß√£o


In [1]:
import pandas as pd
import numpy as np
import sklearn as skl
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from imblearn.over_sampling import SMOTE

In [None]:
df = pd.read_csv("datasets/bank.csv", sep=";")
df[df['y']=='yes'].head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
13,20,student,single,secondary,no,502,no,no,cellular,30,apr,261,1,-1,0,unknown,yes
30,68,retired,divorced,secondary,no,4189,no,no,telephone,14,jul,897,2,-1,0,unknown,yes
33,32,management,single,tertiary,no,2536,yes,no,cellular,26,aug,958,6,-1,0,unknown,yes
34,49,technician,married,tertiary,no,1235,no,no,cellular,13,aug,354,3,-1,0,unknown,yes
36,78,retired,divorced,primary,no,229,no,no,telephone,22,oct,97,1,-1,0,unknown,yes


In [None]:
df[df['y']=='no'].head()

In [3]:
feature_cols = [
  "job",
  "marital",
  "education",
  "contact",
  "housing",
  "loan",
  "default",
  "day",
]

In [4]:
df.loc[df['y'] == 'yes', feature_cols].iloc[30].to_dict()

{'job': 'blue-collar',
 'marital': 'single',
 'education': 'secondary',
 'contact': 'cellular',
 'housing': 'no',
 'loan': 'yes',
 'default': 'no',
 'day': 10}

In [5]:
def ingest_and_prep_data(
    bank_dataset: str = "datasets/bank-full.csv",
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, OneHotEncoder]:
    
    df = pd.read_csv(bank_dataset, delimiter=";", decimal=",")

    feature_cols = [
        "job",
        "marital",
        "education",
        "contact",
        "housing",
        "loan",
        "default",
        "day",
    ]

    X = df[feature_cols].copy()
    y = df["y"].apply(lambda x: 1 if x == "yes" else 0).copy()

    print(f"Dataset shape: {X.shape}, {y.shape}")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42
    )

    # Feature engineering
    enc = OneHotEncoder(handle_unknown="ignore")
    X_train_enc = enc.fit_transform(X_train)
    X_test_enc = enc.transform(X_test)

    return X_train_enc, X_test_enc, y_train, y_test, enc

In [6]:
def rebalance_classes(
    X: pd.DataFrame, y: pd.DataFrame
) -> tuple[pd.DataFrame, pd.DataFrame]:
    sm = SMOTE()
    X_balanced, y_balanced = sm.fit_resample(X, y)
    return X_balanced, y_balanced

In [7]:
def get_hyperparam_grid() -> dict:
    # Hyperparameter optimisation
    n_estimators = [int(x) for x in np.linspace(start=200, stop=2000, num=10)]
    # Number of features to consider at every split
    max_features = [
        "log2",
        "sqrt",
    ]  # ['auto', 'sqrt'] #TODO: auto throws some errors, remove from book example?
    # Maximum number of levels in tree
    max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
    max_depth.append(None)
    # Minimum number of samples required to split a node
    min_samples_split = [2, 5, 10]
    # Minimum number of samples required at each leaf node
    min_samples_leaf = [1, 2, 4]
    # Method of selecting samples for training each tree
    bootstrap = [True, False]  # Create the random grid
    random_grid = {
        "n_estimators": n_estimators,
        "max_features": max_features,
        "max_depth": max_depth,
        "min_samples_split": min_samples_split,
        "min_samples_leaf": min_samples_leaf,
        "bootstrap": bootstrap,
    }
    return random_grid

In [8]:
def get_randomised_rf_cv(
    random_grid: dict,
) -> skl.model_selection._search.RandomizedSearchCV:
    # Use the random grid to search for best hyperparameters
    # First create the base model to tune
    rf = RandomForestClassifier()
    # Random search of parameters, using 3 fold cross validation,
    # search across 3 different combinations, and use all available cores
    rf_random = RandomizedSearchCV(
        estimator=rf,
        param_distributions=random_grid,
        n_iter=3,
        cv=3,
        verbose=2,
        random_state=42,
        n_jobs=-1,
        scoring="f1",
    )
    return rf_random

## Treinamento do modelo


In [9]:
import mlflow
from mlflow import MlflowClient

In [10]:
client = MlflowClient()

In [11]:
mlflow.set_tracking_uri(uri="http://localhost:8080")

In [12]:
import joblib

def main():
    with mlflow.start_run() as run:
        X_train, X_test, y_train, y_test, enc = ingest_and_prep_data()

        X_balanced, y_balanced = rebalance_classes(X_train, y_train)

        rf_random = get_randomised_rf_cv(random_grid=get_hyperparam_grid())

        mlflow.log_params(rf_random.get_params())
        rf_random.fit(X_balanced, y_balanced)

        # X_test j√° est√° codificado, n√£o precisa transformar novamente
        y_pred = rf_random.predict(X_test)

        mlflow.log_metrics({"f1": f1_score(y_test, y_pred)})

        # Save encoder as part of the model artifacts
        encoder_path = "encoder.joblib"
        joblib.dump(enc, encoder_path)
        mlflow.log_artifact(encoder_path)

        # Log the model with the encoder path in the context
        mlflow.sklearn.log_model(
            sk_model=rf_random,
            name="rf-model",
            input_example=X_test,
            registered_model_name="random-forest",
            tags={"version": "latest"},
        )

        model_uri = f"runs:/{run.info.run_id}/rf-model"
        mv = mlflow.register_model(
            model_uri, "random-forest", tags={"version": "latest"}
        )

        print(f"Name: {mv.name}")
        print(f"Version: {mv.version}")


if __name__ == "__main__":
    main()

Dataset shape: (45211, 8), (45211,)
Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] END bootstrap=True, max_depth=50, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=  36.2s
[CV] END bootstrap=True, max_depth=50, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=  38.1s
[CV] END bootstrap=True, max_depth=50, max_features=sqrt, min_samples_leaf=2, min_samples_split=10, n_estimators=200; total time=  38.3s
[CV] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=600; total time= 1.6min
[CV] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=600; total time= 1.7min
[CV] END bootstrap=False, max_depth=90, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=600; total time= 1.7min
[CV] END bootstrap=False, max_depth=60, max_features=log2, min_samples_leaf=2, m

Registered model 'random-forest' already exists. Creating a new version of this model...
2025/11/06 12:15:35 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: random-forest, version 3
Created version '3' of model 'random-forest'.
Registered model 'random-forest' already exists. Creating a new version of this model...
2025/11/06 12:15:35 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: random-forest, version 4
Created version '4' of model 'random-forest'.


Name: random-forest
Version: 4
üèÉ View run beautiful-steed-940 at: http://localhost:8080/#/experiments/0/runs/8bcfcdb44e6542eda2106b8d4cc3485f
üß™ View experiment at: http://localhost:8080/#/experiments/0
