# Übung 8: Feature Engineering und Parameter Tuning

## Aufgabe: Titanic reloaded

![](images/titanic.jpg)

Wie wir in Übung 4 gelernt haben gibt es noch einiges was wir tun können um die Perfromance im Titanic Dataset zu erhöhen. Unter anderem war FeatureEngineering und Parameter Tuning dabei. Dies lernen wir heute.

1. Schreiben Sie eine Funktion die die Passagiere der Titanic in die Altersklassen 0-16, 16-32, 32-48 und über 64 einteilt 
2. Erstellen Sie eine Funktion die die Anzahl der Familienmitglieder zählt und die Reisekosten pro Person
3. Erstellen Sie eine Funktion die die die Titel aus den Namen extrahiert. Fassen sie hierbei seltene Namen in eine Kategorie zusammen.
4. Benutzen Sie die Pipeline um die Funktionen aus Aufgabe 1-3 zu der aus Übung 4 bekannten Pipeline hinzuzufügen
5. Tunen Sie die Parameter eine DecisionTrees und vergleichen Sie die Ergebnisse mit der aus Übung 4

In [1]:
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import (
    accuracy_score,
    average_precision_score,
    f1_score,
    precision_score,
    recall_score,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    OneHotEncoder,
    OrdinalEncoder,
    StandardScaler,
)

In [2]:
titanic_train = pd.read_csv("data/titanic/train.csv")
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [21]:
def transform_age(df):
    X_temp = df.copy()

    def age_discriminatior(age):
        if age < 16:
            return 0
        elif age < 32:
            return 1
        elif age < 48:
            return 2
        elif age < 64:
            return 3
        else:
            return 4

    X_temp["Age_binned"] = X_temp["Age"].map(age_discriminatior)

    return X_temp


age_transformer = FunctionTransformer(transform_age)
age_transformer.transform(titanic_train)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_binned
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,4
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1


In [22]:
def transform_family_fare(df):
    X_temp = df.copy()

    X_temp["Fcount"] = X_temp["Parch"] + X_temp["SibSp"] + 1
    X_temp["FarePerPerson"] = X_temp["Fare"] / X_temp["Fcount"]

    return X_temp


family_fare_transformer = FunctionTransformer(transform_family_fare)
family_fare_transformer.transform(titanic_train)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fcount,FarePerPerson
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,2,3.62500
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2,35.64165
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1,7.92500
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2,26.55000
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,1,8.05000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1,13.00000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1,30.00000
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,4,5.86250
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1,30.00000


In [5]:
# Es gibt sehr viele Titel und auch Tippfehler
titanic_train.Name.str.extract(r"\s*([A-Za-z]+)\.", expand=False).unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'Countess',
       'Jonkheer'], dtype=object)

In [23]:
def transform_name(df):
    X_temp = df.copy()
    X_temp["Title"] = df.Name.str.extract(r"\s*([A-Za-z]+)\.", expand=False)
    X_temp["Title"] = X_temp["Title"].replace(
        [
            "Capt",
            "Countess",
            "Dona",
            "Col",
            "Don",
            "Dr",
            "Jonkheer",
            "Lady",
            "Major",
            "Rev",
            "Sir",
        ],
        "Selten",
    )

    X_temp["Title"] = X_temp["Title"].replace(
        {"Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs"}
    )

    return X_temp


name_transformer = FunctionTransformer(transform_name)
name_transformer.transform(titanic_train)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,Selten
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,Mr


In [24]:
FeatureEngineering = Pipeline(
    steps=[
        ("age", age_transformer),
        ("family_fare", family_fare_transformer),
        ("title", name_transformer),
    ]
)
FeatureEngineering.transform(titanic_train)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_binned,Fcount,FarePerPerson,Title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,1,2,3.62500,Mr
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,2,2,35.64165,Mrs
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,1,1,7.92500,Miss
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,2,2,26.55000,Mrs
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,2,1,8.05000,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S,1,1,13.00000,Selten
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1,1,30.00000,Miss
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S,4,4,5.86250,Miss
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C,1,1,30.00000,Mr


In [25]:
ordinal_features = ["Sex"]
nominal_features = ["Embarked", "Title"]
numeric_features = ["Pclass", "Age", "Fare", "Fcount", "FarePerPerson", "Age_binned"]

In [26]:
numeric_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="mean")), ("scaler", StandardScaler())]
)
nominal_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("one_hot_encoding", OneHotEncoder(handle_unknown="ignore")),
    ]
)
ordinal_transfomer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ordinal_encoding", OrdinalEncoder()),
    ]
)

column_transformer = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat_nominal", nominal_transformer, nominal_features),
        ("cat_ordinal", ordinal_transfomer, ordinal_features),
    ],
)

# Gesamtpipeline
preprocessor = Pipeline(
    steps=[
        ("age", age_transformer),
        ("family_fare", family_fare_transformer),
        ("title", name_transformer),
        ("column", column_transformer),
    ]
)

set_config(display="diagram")
preprocessor

In [10]:
X = titanic_train.drop("Survived", axis=1)
y = titanic_train[["Survived"]]

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [11]:
X_train_prepared = preprocessor.fit_transform(X_train)
y_train_prepared = y_train.to_numpy().ravel()

In [12]:
from sklearn.model_selection import RandomizedSearchCV

random_grid = {
    "n_estimators": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    "min_samples_split": [2, 4, 6, 8, 10, 15, 20],
    "max_depth": [2, 4, 8, 10, 15, 20, 30],
}

rf_clf = RandomForestClassifier()
rand_search = RandomizedSearchCV(
    estimator=rf_clf,
    param_distributions=random_grid,
    n_iter=25,
    cv=3,
    random_state=42,
    n_jobs=-1,
)  # Fit the random search model
rand_search.fit(X_train_prepared, y_train_prepared)

rand_search.best_params_

{'n_estimators': 30, 'min_samples_split': 8, 'max_depth': 30}

## Load test data and make predictions

In [13]:
X_test_prepared = preprocessor.transform(X_test)
clf = RandomForestClassifier(oob_score=True)
clf.fit(X_train_prepared, y_train_prepared)
print(f"Out of Bag Score ohne tuning: {clf.oob_score_}")
clf = RandomForestClassifier(**rand_search.best_params_, oob_score=True)
clf = clf.fit(X_train_prepared, y_train_prepared)
print(f"Out of Bag Score mit tuning: {clf.oob_score_}")

Out of Bag Score ohne tuning: 0.7859281437125748
Out of Bag Score mit tuning: 0.8293413173652695


In [14]:
predicted = clf.predict(X_test_prepared)

accuracy = accuracy_score(y_pred=predicted, y_true=y_test)
precision = precision_score(y_pred=predicted, y_true=y_test)
recall = recall_score(y_pred=predicted, y_true=y_test)
auc = roc_auc_score(y_true=y_test, y_score=predicted)
aps = average_precision_score(y_true=y_test, y_score=predicted)
f1 = f1_score(y_true=y_test, y_pred=predicted)

print(f"accuracy: {accuracy}")
print(f"precision: {precision}")
print(f"recall: {recall}")
print(f"F1 Score: {f1}")
print(f"AUC: {auc}")
print("\n")

accuracy: 0.8385650224215246
precision: 0.8160919540229885
recall: 0.7802197802197802
F1 Score: 0.797752808988764
AUC: 0.8295038295038294




## Evaluation

Unser getunted Modell hat eine Accurary Score 0.84 was deutlich besser als als beim letzten Mal, da war es nur 0.77.

Hier zum Vergleich Übung 4:

```
RandomForest
accuracy: 0.7737219730941704
precision: 0.7261904761904762
recall: 0.7261904761904762
F1 Score: 0.7261904761904762
AUC: 0.7803614251455979
```