# Voting classifier Training

We implemented a full-featured stacking ensemble (Logistic + Random Forest + XGBoost) for the Titanic dataset using a modular scikit-learn pipeline.
Validation accuracy reached 0.799 (comparable to the official Kaggle tutorialâ€™s 0.775).
The project demonstrates proper pipeline encapsulation, data leakage prevention, and ensemble design.

## feature engineering and preprocessing
0. concat test and train
1. tag age is missing
2. fill age missing value
3. split cat, num features, and fill na values


In [1]:
import os
os.chdir("d:/playground/kaggle/Titanic")
os.environ["PYTHONIOENCODING"] = "utf-8"

# Create a clean ASCII-only temp folder on D:
os.environ["JOBLIB_TEMP_FOLDER"] = r"D:\playground\temp\joblib_temp"
os.makedirs(os.environ["JOBLIB_TEMP_FOLDER"], exist_ok=True)


In [2]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from xgboost import XGBClassifier

# Load data
df = pd.read_csv("data/train.csv")
t_df = pd.read_csv("data/test.csv")

# ---- Features ----
for tar in (df, t_df):
    # Age missing flag
    tar['Age_is_missing'] = tar['Age'].isna().astype(int)

    # Fill Age with median
    tar['Age'] = tar['Age'].fillna(tar['Age'].median())

feature_cols = ['Pclass','Sex','SibSp','Parch','Age_is_missing']
X = df[feature_cols].copy()
y = df['Survived']
X_test_final = t_df[feature_cols].copy()

# ---- Define transformers ----
numeric_features = ['SibSp','Parch']   # small integers
categorical_features = ['Pclass','Sex','Age_is_missing']  # treated as categorical

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   # helps LR
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)


## Model training
Three base models:
1. logistic regression: capture linearity
2. XGBoost: reduce bias
3. Random Forest: reduce variance

In [3]:
# Base learners
estimators = [
    ('lr', LogisticRegression(max_iter=2000)),
    ('rf', RandomForestClassifier(
        n_estimators=200, max_depth=7, min_samples_leaf=3, random_state=42)),
    ('xgb', XGBClassifier(
        n_estimators=300, learning_rate=0.05, max_depth=4,
        subsample=0.8, colsample_bytree=0.8, random_state=42,
        eval_metric='logloss'))
]

# Meta learner
final_estimator = LogisticRegression(max_iter=2000)

stack_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=final_estimator,
    cv=StratifiedKFold(n_splits=5),
    n_jobs=-1,
    passthrough=False
)

In [4]:
from sklearn.model_selection import train_test_split

# Split the data explicitly
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Full pipeline = preprocessing + stacking model
pipe = Pipeline([
    ('preprocess', preprocessor),
    ('model', stack_clf)
])

# Fit and evaluate
pipe.fit(X_train, y_train)
print("Train accuracy:", pipe.score(X_train, y_train))
print("Validation accuracy:", pipe.score(X_val, y_val))

Train accuracy: 0.824438202247191
Validation accuracy: 0.7988826815642458


In [7]:
pred_test = pipe.predict(X_test_final)

submission_df = pd.DataFrame({
    'PassengerId': t_df['PassengerId'],
    'Survived': pred_test.astype(int)
})
submission_df.to_csv("submissions/submission_stack3.csv", index=False)
print("Saved submission_stack3.csv")

Saved submission_stack3.csv
