<font size="6" color='grey'> <b>
Machine Learning
</b></font> </br>

<font size="5" color='grey'> <b>
Pipeline - Decision Tree Regression - Diamonds
</b></font> </br>

---

In [None]:
#@title 🔧 Colab-Umgebung { display-mode: "form" }
!uv pip install --system -q git+https://github.com/ralf-42/Python_Modules
from ml_lib.utilities import get_ipinfo
import sys
print()
print(f"Python Version: {sys.version}")
print()
get_ipinfo()

# 0  | Install & Import
***

In [None]:
# Install

In [None]:
# Import
from pandas import read_csv, DataFrame
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error

import plotly.express as px
import plotly.subplots as sp
import matplotlib.pyplot as plt

In [None]:
# Warnung ausstellen
import warnings
warnings.filterwarnings("ignore")

# 1 | Understand
***

<font color='black' size="5">📋 Checkliste</font>

✅ Aufgabe verstehen</br>
✅ Daten sammeln</br>
✅ Statistische Analyse (Min, Max, Mean, Korrelation, ...)</br>
✅ Datenvisualisierung (Streudiagramm, Box-Plot, ...)</br>
✅ Prepare Schritte festlegen</br>

<font color='black' size="5">
Anwendungsfall
</font>

---

Dieser klassische Datensatz enthält die Preise und andere Attribute von fast 54.000 Diamanten.



[DataSet](https://www.openml.org/search?type=data&status=active&id=42225)

[Info](https://www.kaggle.com/datasets/shivam2503/diamonds)

In [None]:
df = read_csv(
    "https://raw.githubusercontent.com/ralf-42/ML_Intro/main/02%20data/diamonds.csv",
    usecols=[ "carat", "cut", "color", "clarity", "depth", "table", "price", ])

In [None]:
data = df.copy()
target = data.pop("price")

# 2 |  Prepare

---

<font color='black' size="5">📋 Checkliste</font>

✅ Nicht benötigte Features löschen</br>
✅ Datentyp ermitteln/ändern</br>
✅ Duplikate ermitteln/löschen</br>
✅ Missing Values behandeln</br>
✅ Ausreißer behandeln</br>
✅ Kategorischer Features Kodieren</br>
✅ Numerischer Features skalieren</br>
✅ Feature-Engineering (neue Features schaffen)</br>
✅ Dimensionalität reduzieren</br>
✅ Resampling (Over-/Undersampling)</br>
✅ Pipeline erstellen/konfigurieren</br>
✅ Train-Test-Split durchführen</br>

<font color='black' size="5">
Datentyp ermitteln
</font>

In [None]:
all_col = data.columns
num_col = data.select_dtypes(include="number").columns
cat_col = data.select_dtypes(exclude="number").columns

<font color='black' size="5">
Missing Values & Kodierung - Pipeline für kategoriale Features
</font>

In [None]:
pipe_cat = Pipeline(
    [
        ("imputer_c", SimpleImputer(strategy="most_frequent")),
        ("encoder",OrdinalEncoder(unknown_value=999, handle_unknown="use_encoded_value")),
    ]
)

<font color='black' size="5">
Missing Values & Skalierung - Pieline für numerische Features
</font>

In [None]:
pipe_num = Pipeline(
    [("imputer_n", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)

<font color='black' size="5">
Verbinden der beiden Pipelines zu einer Prepare-Pipeline
</font>

In [None]:
pipe_prepare = ColumnTransformer(
    transformers=[("categorical", pipe_cat, cat_col), ("numerical", pipe_num, num_col)]
)

<font color='black' size="5">
Train-Test-Set
</font>

In [None]:
data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.30, shuffle=True, random_state=42
)
data_train.shape, data_test.shape, target_train.shape, target_test.shape

# 3 |  Modeling
---

 <font color='black' size="5">
Modellauswahl, Verbinden Prepare- & Model-Pipeline, Training
</font>

<font color='black' size="5">📋 Checkliste</font>

✅ Modellauswahl treffen</br>
✅ Pipeline erweitern/konfigurieren</br>
✅ Training durchführen</br>
✅ Hyperparameter Tuning</br>
✅ Cross-Valdiation</br>
✅ Bootstrapping</br>
✅ Regularization</br>

In [None]:
# Create a separate pipeline for the model
pipe_model = Pipeline([("tree", DecisionTreeRegressor(max_depth=4, min_samples_split=50))])

In [None]:
# Combine the prepare and model pipelines
model = Pipeline([("prepare", pipe_prepare), ("model", pipe_model)])

In [None]:
model.fit(data_train, target_train)

# 4 | Evaluate
---

<font color='black' size="5">📋 Checkliste</font>

✅ Prognose (Train, Test) erstellen</br>
✅ Modellgüte prüfen</br>
✅ Residuenanalyse erstellen</br>
✅ Feature Importance/Selektion prüfen</br>
✅ Robustheitstest erstellen</br>
✅ Modellinterpretation erstellen</br>
✅ Sensitivitätsanalyse erstellen</br>
✅ Kommunikation (Key Takeaways)</br>

<font color='black' size="5">
Prediction
</font>

In [None]:
target_train_pred = model.predict(data_train)
target_test_pred = model.predict(data_test)

<font color='black' size="5">
Bestimmtheitsmass
</font>

In [None]:
r2 = r2_score(target_train, target_train_pred)
print(f"-- Train --- Bestimmtheitsmass: {r2:5.2f}")

In [None]:
r2 = r2_score(target_test, target_test_pred)
print(f"-- Test --- Bestimmtheitsmass: {r2:5.2f}")

<font color='black' size="5">
Mean Absolut Error
</font>

In [None]:
mae = mean_absolute_error(target_test, target_test_pred)
print(f"-- Test -- Mean Absolute Error: {mae:5.2f}")

<font color='black' size="5">
Analyse von Zwischenschritten einer Pipeline
</font>

In [None]:
# Konfiguration Pipeline 'prepare'
model.named_steps.prepare

In [None]:
# Mittelwerte der skalierten Features
pipe_prepare.named_transformers_["numerical"].named_steps["scaler"].mean_

In [None]:
# Anzahl der in DecisionTreeRegression verwendeten Features
model.named_steps["model"].named_steps["tree"].n_features_in_

<font color='black' size="5">
Feature Importance
</font>

In [None]:
# Feature Importance vom DecisionTree
importance = model.named_steps["model"].named_steps["tree"].feature_importances_
print("Feature Importance Values:")
for i, imp in enumerate(importance):
    print(f"Feature {i}: {imp:.4f}")

In [None]:
# Feature Namen zuordnen
# Nach ColumnTransformer: [categorical features] + [numerical features]
feature_names = list(cat_col) + list(num_col)
print("Feature Names nach Transformation:")
print(feature_names)

# Feature Importance mit Namen
feature_importance_df = DataFrame({
    'Feature': feature_names,
    'Importance': importance
}).sort_values('Importance', ascending=False)

print("\nFeature Importance (sortiert):")
print(feature_importance_df)

In [None]:
# Alternative Visualisierung mit Plotly
fig = px.bar(feature_importance_df,
             x='Importance',
             y='Feature',
             orientation='h',
             title='Decision Tree - Feature Importance',
             labels={'Importance': 'Feature Importance', 'Feature': 'Features'})
fig.update_layout(yaxis={'categoryorder':'total ascending'})
fig.show()

# 5 | Deploy
---

<font color='black' size="5">📋 Checkliste</font>

✅ Modellexport und -speicherung</br>
✅ Abhängigkeiten und Umgebung</br>
✅ Sicherheit und Datenschutz</br>
✅ In die Produktion integrieren</br>
✅ Tests und Validierung</br>
✅ Dokumentation & Wartung</br>