<p><font size="6" color='grey'> <b>
Machine Learning
</b></font> </br></p>
<p><font size="5" color='grey'> <b>
Ensemble Learning - Sklearn Boosting - Titanic
</b></font> </br></p>

---


In [1]:
#@title 🔧 Colab-Umgebung { display-mode: "form" }
!uv pip install --system -q git+https://github.com/ralf-42/Python_Modules
from ml_lib.utilities import get_ipinfo
import sys
print()
print(f"Python Version: {sys.version}")
print()
get_ipinfo()


Python Version: 3.12.11 (main, Jun  4 2025, 08:56:18) [GCC 11.4.0]

IP-Adresse: 35.229.185.225
Hostname: 225.185.229.35.bc.googleusercontent.com
Stadt: Taipei
Region: Taiwan
Land: TW
Koordinaten: 25.0531,121.5264
Provider: AS396982 Google LLC
Postleitzahl: None
Zeitzone: Asia/Taipei


# 0  | Install & Import
***

In [2]:
# Install
!uv pip install --system -q dtreeviz

In [3]:
# Import
from pandas import read_csv, DataFrame
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay,
    classification_report,
)

import plotly.express as px
import plotly.subplots as sp

In [4]:
# Warnung ausstellen
import warnings

warnings.filterwarnings("ignore")

# 1  | Understand
***


<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Aufgabe verstehen</br>
✅ Daten sammeln</br>
✅ Statistische Analyse (Min, Max, Mean, Korrelation, ...)</br>
✅ Datenvisualisierung (Streudiagramm, Box-Plot, ...)</br>
✅ Prepare Schritte festlegen</br>

<p><font color='black' size="5">
📒 Anwendungsfall
</font></p>

Dies ist der legendäre Titanic ML-Wettbewerb – die beste erste Herausforderung, um in ML-Modellierung einzutauchen.

Die Aufgabe ist einfach: Verwenden Sie maschinelles Lernen, um ein Modell zu erstellen, das vorhersagt, welche Passagiere den Schiffbruch der Titanic überlebt haben.

In diesem Notebook verwenden wir Boosting-Algorithmen aus Scikit-learn (AdaBoost und Gradient Boosting).

[Titanic Org](https://www.encyclopedia-titanica.org/)

[DataSet](https://www.openml.org/search?type=data&status=active&id=41265)

[Info](https://www.kaggle.com/competitions/titanic/data)


**Datenfelder:**   
+ Age: Alter
+ Fare: Ticketpreis
+ Sex: Geschlecht (0 = männlich, 1 = weiblich)
+ sibsp: Der Datensatz definiert Familienbeziehungen auf diese Weise ... Geschwister = Bruder, Schwester, Stiefbruder, Stiefschwester Ehepartner = Ehemann, Ehefrau (Geliebte und Verlobte wurden ignoriert)
+ parch: Der Datensatz definiert Familienbeziehungen auf diese Weise ... Elternteil = Mutter, Vater Kind = Tochter, Sohn, Stieftochter, Stiefsohn. Einige Kinder reisten nur mit einem Kindermädchen, daher ist für sie Parch=0
+ Pclass: Passagierklasse, 1.- 3. Klasse
+ Embarked: Hafen der Einschiffung

In [5]:
df = read_csv(
    "https://raw.githubusercontent.com/ralf-42/ML_Intro/main/02%20data/Titanic.csv",
    usecols=["pclass", "survived", "sex", "age", "sibsp", "parch"],
)

IncompleteRead: IncompleteRead(32768 bytes read, 83738 more expected)

In [None]:
data = df.copy()
target = data.pop("survived")

<p><font color='black' size="5">
🔎 EDA (Exploratory Data Analysis) mit Pandas
</font></p>

In [None]:
data.info()

In [None]:
data.describe().T

In [None]:
data.groupby("sex").count()

In [None]:
target.value_counts()

# 2 | Prepare

---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Nicht benötigte Features löschen</br>
✅ Datentyp ermitteln/ändern</br>
✅ Duplikate ermitteln/löschen</br>
✅ Missing Values behandeln</br>
✅ Ausreißer behandeln</br>
✅ Kategorischer Features Kodieren</br>
✅ Numerischer Features skalieren</br>
✅ Feature-Engineering (neue Features schaffen)</br>
✅ Dimensionalität reduzieren</br>
✅ Resampling (Over-/Undersampling)</br>
✅ Pipeline erstellen/konfigurieren</br>
✅ Train-Test-Split durchführen</br>

<p><font color='black' size="5">
Datentyp ermitteln
</font></p>

In [None]:
all_col = data.columns
num_col = data.select_dtypes(include="number").columns
cat_col = data.select_dtypes(exclude="number").columns

<p><font color='black' size="5">
Missing Values
</font></p>

In [None]:
mv = data.isnull().sum()
mv_col = list(mv[mv > 0].index)

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
data[mv_col] = DataFrame(imputer.fit_transform(data[mv_col]))

<p><font color='black' size="5">
Kodierung
</font></p>

In [None]:
coder = OrdinalEncoder()
data[cat_col] = coder.fit_transform(data[cat_col])

<p><font color='black' size="5">
Skalierung
</font></p>

In [None]:
scaler = MinMaxScaler()
data[num_col] = scaler.fit_transform(data[num_col])

<p><font color='black' size="5">
Train-Test-Split
</font></p>

In [None]:
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.30, random_state=42, stratify=target
)
data_train.shape, data_test.shape, target_train.shape, target_test.shape

# 3 | Modeling
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Modellauswahl treffen</br>
✅ Pipeline erweitern/konfigurieren</br>
✅ Training durchführen</br>
✅ Hyperparameter Tuning</br>
✅ Cross-Valdiation</br>
✅ Bootstrapping</br>
✅ Regularization</br>

 <p><font color='black' size="5">
Modellauswahl & Training - AdaBoost
</font></p>

[AdaBoost Doku](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)    

In [None]:
# AdaBoost mit Decision Tree als Base Estimator
ada_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    learning_rate=1.0,
    random_state=42
)

In [None]:
ada_model.fit(data_train, target_train)

 <p><font color='black' size="5">
Modellauswahl & Training - Gradient Boosting
</font></p>

[Gradient Boosting Doku](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)    

In [None]:
# Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

In [None]:
gb_model.fit(data_train, target_train)

# 4 | Evaluate
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Prognose (Train, Test) erstellen</br>
✅ Modellgüte prüfen</br>
✅ Residuenanalyse erstellen</br>
✅ Feature Importance/Selektion prüfen</br>
✅ Robustheitstest erstellen</br>
✅ Modellinterpretation erstellen</br>
✅ Sensitivitätsanalyse erstellen</br>
✅ Kommunikation (Key Takeaways)</br>

<p><font color='black' size="5">
Prognose - AdaBoost
</font></p>

In [None]:
ada_train_pred = ada_model.predict(data_train)
ada_test_pred = ada_model.predict(data_test)

<p><font color='black' size="5">
Prognose - Gradient Boosting
</font></p>

In [None]:
gb_train_pred = gb_model.predict(data_train)
gb_test_pred = gb_model.predict(data_test)

<p><font color='black' size="5">
Accuracy Vergleich
</font></p>

In [None]:
# AdaBoost Accuracy
ada_acc_train = accuracy_score(target_train, ada_train_pred) * 100
ada_acc_test = accuracy_score(target_test, ada_test_pred) * 100

print(f"AdaBoost -- Train -- Accuracy: {ada_acc_train:5.2f}%")
print(f"AdaBoost -- Test  -- Accuracy: {ada_acc_test:5.2f}%")
print()

# Gradient Boosting Accuracy
gb_acc_train = accuracy_score(target_train, gb_train_pred) * 100
gb_acc_test = accuracy_score(target_test, gb_test_pred) * 100

print(f"Gradient Boosting -- Train -- Accuracy: {gb_acc_train:5.2f}%")
print(f"Gradient Boosting -- Test  -- Accuracy: {gb_acc_test:5.2f}%")

<p><font color='black' size="5">
Confusion Matrix - AdaBoost
</font></p>

In [None]:
conf_matrix = confusion_matrix(target_test, ada_test_pred)
display_labels_ = ["Not Survived", "Survived"]
disp = ConfusionMatrixDisplay(conf_matrix, display_labels=display_labels_)
disp.plot(cmap="Blues")

In [None]:
print(
    classification_report(target_test, ada_test_pred, target_names=display_labels_)
)

<p><font color='black' size="5">
Confusion Matrix - Gradient Boosting
</font></p>

In [None]:
conf_matrix = confusion_matrix(target_test, gb_test_pred)
display_labels_ = ["Not Survived", "Survived"]
disp = ConfusionMatrixDisplay(conf_matrix, display_labels=display_labels_)
disp.plot(cmap="Greens")

In [None]:
print(
    classification_report(target_test, gb_test_pred, target_names=display_labels_)
)

<p><font color='black' size="5">
Feature Importance Vergleich
</font></p>

In [None]:
# AdaBoost Feature Importance
fig1 = px.bar(
    x=ada_model.feature_importances_,
    y=data.columns,
    title="AdaBoost Feature Importance",
    width=500,
    height=400
).update_yaxes(categoryorder="total ascending")

# Gradient Boosting Feature Importance
fig2 = px.bar(
    x=gb_model.feature_importances_,
    y=data.columns,
    title="Gradient Boosting Feature Importance",
    width=500,
    height=400
).update_yaxes(categoryorder="total ascending")

# Subplots erstellen
fig = sp.make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("AdaBoost", "Gradient Boosting")
)

for trace in fig1.data:
    fig.add_trace(trace, row=1, col=1)
for trace in fig2.data:
    fig.add_trace(trace, row=1, col=2)

fig.update_layout(width=1000, height=500, title_text="Feature Importance Vergleich")
fig.show()

<p><font color='black' size="5">
Modellvergleich - Zusammenfassung
</font></p>

In [None]:
# Zusammenfassung der Ergebnisse
results = DataFrame({
    'Model': ['AdaBoost', 'Gradient Boosting'],
    'Train_Accuracy': [ada_acc_train, gb_acc_train],
    'Test_Accuracy': [ada_acc_test, gb_acc_test],
    'Overfitting': [ada_acc_train - ada_acc_test, gb_acc_train - gb_acc_test]
})

results

# 5 | Deploy
---

<p><font color='black' size="5">📋 Checkliste</font></p>

✅ Modellexport und -speicherung</br>
✅ Abhängigkeiten und Umgebung</br>
✅ Sicherheit und Datenschutz</br>
✅ In die Produktion integrieren</br>
✅ Tests und Validierung</br>
✅ Dokumentation & Wartung</br>