# 05_Finales Modell trainieren und exportieren
In Notebook 04_Modelle_evaluieren wurde das beste Modell und die besten Hyperparameter evaluiert. Ergebnis: **Gradient Boosting Regressor**.
In diesem Notebook wird das Modell auf die gesamten vorliegenden Daten trainiert und dann für die App exportiert.

## 1) Libraries importieren

In [1]:
# Libraries importieren
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 2) Daten einlesen

In [3]:
# Daten importieren
df_clean = pd.read_csv("../Data/clean_data.csv")

# Daten inspizieren
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70864 entries, 0 to 70863
Data columns (total 11 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   body_type             70864 non-null  object 
 1   fuel_type             70864 non-null  object 
 2   horsepower            70864 non-null  float64
 3   make_name             70864 non-null  object 
 4   mileage               70864 non-null  float64
 5   model_name            70864 non-null  object 
 6   price                 70864 non-null  float64
 7   wheel_system_display  70864 non-null  object 
 8   average_fuel_economy  70864 non-null  float64
 9   age                   70864 non-null  int64  
 10  manual                70864 non-null  int64  
dtypes: float64(4), int64(2), object(5)
memory usage: 5.9+ MB


## 3) Dummy-Variablen erstellen

In [7]:
# Dummies in neuem Dataset speichern
df_dummies = pd.get_dummies(df_clean, drop_first=True, dtype=float)

In [11]:
# Info
df_dummies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70864 entries, 0 to 70863
Columns: 778 entries, horsepower to wheel_system_display_Rear-Wheel Drive
dtypes: float64(776), int64(2)
memory usage: 420.6 MB


## 4) X und y definieren

In [14]:
# X und y definieren
X = df_dummies.drop(["price"], axis=1)
y = df_dummies["price"]

## 5) Random Forest Regressor trainieren

In [None]:
# Imports
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [None]:
# Best Parameters aus Notebook 04
best_params = {
    'max_depth': None,
    'min_samples_leaf': 1,
    'min_samples_split': 4,
    'n_estimators': 100}

In [42]:
# Random Forest Regressor erstellen (n_estimators=100) und trainieren
final_model = RandomForestRegressor(n_estimators=100, random_state=42)
forest_model.fit(X_train, y_train)

In [44]:
# Modellevaluation auf Trainingsdaten

# Vorhersagen auf dem Trainingsset machen
y_pred_train = forest_model.predict(X_train)

# Evaluationsmetrics berechnen
r2_train = r2_score(y_train, y_pred_train)
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)
mape_train = mean_absolute_percentage_error(y_train, y_pred_train)

# Prints
print("Evaluation auf Trainingsdaten")
print("R²-Wert Train:", r2_train)
print("RMSE Train:", np.sqrt(mse_train))
print("MAE Train:", mae_train)
print("MAPE Train:", mape_train)

Evaluation auf Trainingsdaten
R²-Wert Train: 0.9709970043985188
RMSE Train: 2387.22712257198
MAE Train: 1474.5640531867305
MAPE Train: 0.054278848777977985


In [45]:
# Modellevaluation auf Testdaten

# Vorhersagen auf dem Testset machen
y_pred = forest_model.predict(X_test)

# Evaluationsmetrics berechnen
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
mape = mean_absolute_percentage_error(y_test, y_pred)

# Prints
print("Evaluation auf Testdaten")
print("R²-Wert:", r2)
print("RMSE:", np.sqrt(mse))
print("MAE:", mae)
print("MAPE:", mape)

Evaluation auf Testdaten
R²-Wert: 0.9225251505194979
RMSE: 3898.235467994862
MAE: 2667.0212807479406
MAPE: 0.10687002602434235
