# Binary Classification with a Bank Dataset

**Goal**:<br>
Our goal is to predict whether a client will subscribe to a bank term deposit. Main metric - ROC AUC.

**Data Description**:
- `age`: Age of the client (numeric)
- `job`: Type of job (categorical: "admin.", "blue-collar", "entrepreneur", etc.)
- `marital`: Marital status (categorical: "married", "single", "divorced")
- `education`: Level of education (categorical: "primary", "secondary", "tertiary", "unknown")
- `default`: Has credit in default? (categorical: "yes", "no")
- `balance`: Average yearly balance in euros (numeric)
- `housing`: Has a housing loan? (categorical: "yes", "no")
- `loan`: Has a personal loan? (categorical: "yes", "no")
- `contact`: Type of communication contact (categorical: "unknown", "telephone", "cellular")
- `day`: Last contact day of the month (numeric, 1-31)
- `month`: Last contact month of the year (categorical: "jan", "feb", "mar", …, "dec")
- `duration`: Last contact duration in seconds (numeric)
- `campaign`: Number of contacts performed during this campaign (numeric)
- `pdays`: Number of days since the client was last contacted from a previous campaign (numeric; -1 means the client was not - previously contacted)
- `previous`: Number of contacts performed before this campaign (numeric)
- `poutcome`: Outcome of the previous marketing campaign (categorical: "unknown", "other", "failure", "success")
- `y`: The target variable, whether the client subscribed to a term deposit (binary: "yes", "no")

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/playground-series-s5e8/sample_submission.csv
/kaggle/input/playground-series-s5e8/train.csv
/kaggle/input/playground-series-s5e8/test.csv


## Загрузка и просмотр данных

In [2]:
# === Импорты и настройки ===================================================
import os, warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

# Настройка pandas для отображения полного текста
pd.set_option('display.max_colwidth', None)

from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import lightgbm as lgb

RANDOM_STATE = 42
USE_GPU = False

In [3]:
# === Параметры окружения ====================================================
TRAIN_PATH = "/kaggle/input/playground-series-s5e8/train.csv"
TEST_PATH  = "/kaggle/input/playground-series-s5e8/test.csv"

# Локально можно заменить на "train.csv" и "test.csv"
if not os.path.exists(TRAIN_PATH):
    TRAIN_PATH = "train.csv"
if not os.path.exists(TEST_PATH):
    TEST_PATH = "test.csv"

# === Загрузка данных ========================================================
train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)

print("train.shape:", train.shape)
print("test.shape :", test.shape)
print("\nПервые строки train:")
display(train.head())
display(train.info())

train.shape: (750000, 18)
test.shape : (250000, 17)

Первые строки train:


Unnamed: 0,id,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 750000 entries, 0 to 749999
Data columns (total 18 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   id         750000 non-null  int64 
 1   age        750000 non-null  int64 
 2   job        750000 non-null  object
 3   marital    750000 non-null  object
 4   education  750000 non-null  object
 5   default    750000 non-null  object
 6   balance    750000 non-null  int64 
 7   housing    750000 non-null  object
 8   loan       750000 non-null  object
 9   contact    750000 non-null  object
 10  day        750000 non-null  int64 
 11  month      750000 non-null  object
 12  duration   750000 non-null  int64 
 13  campaign   750000 non-null  int64 
 14  pdays      750000 non-null  int64 
 15  previous   750000 non-null  int64 
 16  poutcome   750000 non-null  object
 17  y          750000 non-null  int64 
dtypes: int64(9), object(9)
memory usage: 103.0+ MB


None

## Обучение модели

In [4]:
# ===== Разделим X, y (без id/previous удаление сделаем в препроцессоре) =====
X_train = train.drop(columns=["y"]).copy()
y_train = train["y"].astype(int).values

In [5]:
class FeatureEngineeringTransformer(BaseEstimator, TransformerMixin):
    """
    Инжиниринг признаков:
    - balance_log_signed, duration_log
    - duration_bin (квантильные бины)
    - age_bin (фиксированные бины)
    - campaign_cap = min(campaign, 6)
    - month_sin/cos (месяц как 1..12)
    - day_sin/cos  (день месяца 1..31)
    """
    def __init__(self, duration_bins=5):
        self.duration_bins = duration_bins
        self.duration_edges_ = None
        self._month_map = {
            'jan':1,'feb':2,'mar':3,'apr':4,'may':5,'jun':6,
            'jul':7,'aug':8,'sep':9,'oct':10,'nov':11,'dec':12
        }

    def fit(self, X, y=None):
        X = X.copy()
        if "duration" in X.columns:
            self.duration_edges_ = np.unique(
                np.quantile(X["duration"].values, q=np.linspace(0, 1, self.duration_bins+1))
            )
            if len(self.duration_edges_) <= 2:
                self.duration_edges_ = None
        return self

    def transform(self, X):
        X = X.copy()
        # balance_log_signed
        if "balance" in X.columns:
            X["balance_log_signed"] = np.sign(X["balance"]) * np.log1p(np.abs(X["balance"]))
        # duration_log + duration_bin
        if "duration" in X.columns:
            X["duration_log"] = np.log1p(X["duration"])
            if self.duration_edges_ is not None and len(self.duration_edges_) > 2:
                X["duration_bin"] = pd.cut(
                    X["duration"], bins=self.duration_edges_,
                    include_lowest=True, duplicates="drop"
                ).astype(str)
            else:
                X["duration_bin"] = pd.qcut(X["duration"], q=5, duplicates="drop").astype(str)
        # age_bin
        if "age" in X.columns:
            X["age_bin"] = pd.cut(
                X["age"],
                bins=[17, 25, 35, 45, 60, 120],
                labels=["18-25","26-35","36-45","46-60","60+"],
                include_lowest=True
            ).astype(str)
        # campaign_cap
        if "campaign" in X.columns:
            X["campaign_cap"] = np.minimum(X["campaign"], 6)
        # month_sin/cos
        if "month" in X.columns:
            mnum = pd.Series(X["month"]).map(self._month_map).fillna(0).astype(int)
            X["month_sin"] = np.sin(2*np.pi * mnum / 12)
            X["month_cos"] = np.cos(2*np.pi * mnum / 12)
        # day_sin/cos
        if "day" in X.columns:
            day_num = pd.to_numeric(X["day"], errors="coerce").fillna(0).astype(int)
            X["day_sin"] = np.sin(2*np.pi * day_num / 31)
            X["day_cos"] = np.cos(2*np.pi * day_num / 31)
        return X


class PdaysTransformer(BaseEstimator, TransformerMixin):
    """
    - was_contacted = (pdays != -1).astype(int)
    - pdays_pos     = clip(pdays, 0, None)
    - drop 'pdays' и 'previous'
    - приведение day/campaign к строке (категории)
    """
    def __init__(self, cast_to_str_cols=("day", "campaign")):
        self.cast_to_str_cols = cast_to_str_cols

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if "pdays" in X.columns:
            X["was_contacted"] = (X["pdays"] != -1).astype(int)
            X["pdays_pos"] = X["pdays"].clip(lower=0)
            X = X.drop(columns=["pdays"])
        if "previous" in X.columns:
            X = X.drop(columns=["previous"])
        for c in self.cast_to_str_cols:
            if c in X.columns:
                X[c] = X[c].astype(str)
        return X

In [6]:
# ===== Признаки =====
cat_features = [
    'job','marital','education','default','housing','loan',
    'contact','month','poutcome','day','campaign',
    'was_contacted','duration_bin','age_bin'
]
num_features = [
    'age','balance','duration','pdays_pos',
    'balance_log_signed','duration_log',
    'campaign_cap','month_sin','month_cos','day_sin','day_cos'
]

In [7]:
# ===== Препроцессор =====
preprocess = Pipeline(steps=[
    ("fe", FeatureEngineeringTransformer(duration_bins=5)),
    ("pdays_block", PdaysTransformer(cast_to_str_cols=("day","campaign"))),
    ("ct", ColumnTransformer(
        transformers=[
            ("num", "passthrough", num_features),
            ("cat", OneHotEncoder(handle_unknown="ignore", sparse_output=True), cat_features),
        ],
        remainder="drop"
    ))
])

In [8]:
# ---------- Лучшие гиперпараметры LGBM ----------
best_params = {
    'colsample_bytree': 0.6351472444970627,
    'learning_rate': 0.019717734140941426,
    'max_bin': 255,               # GPU совместимо
    'max_depth': -1,
    'min_child_samples': 26,
    'min_split_gain': 0.023339212828336266,
    'n_estimators': 1710,
    'num_leaves': 182,
    'reg_alpha': 1.2122037621116772,
    'reg_lambda': 3.971447249832312,
    'scale_pos_weight': 1.0,
    'subsample': 0.9718116576991304,
    'random_state': RANDOM_STATE,
    'verbosity': -1,
}
if USE_GPU:
    best_params['device_type'] = 'gpu'

In [9]:
# ---------- Пайплайн и обучение ----------
pipe_submit = Pipeline(steps=[
    ("prep", preprocess),
    ("clf", lgb.LGBMClassifier(objective="binary", **best_params))
])

pipe_submit.fit(X_train, y_train)

In [10]:
# ---------- Предсказание и сабмит ----------
test_ids = test["id"].copy()
proba = pipe_submit.predict_proba(test)[:, 1]

submission = pd.DataFrame({"id": test_ids, "y": proba})
submission.to_csv("submission.csv", index=False)
print("Saved: submission.csv")

Saved: submission.csv
