# Übung 7: Regularisierung und  Decision Trees

#### Aufgabe 1

In der Vorlesung haben wir gelernt, dass die L1 Regulierung dazu neigt viele Koeffizieten auf 0 zu setzen. 

Benutzen Sie die Daten der männlichen Raucher und führen Sie darauf eine Regression mit unterschiedlich starker Regularisierung, sowohl L1 als auch L2, durch. Schauen Sie sich dabei verschiedene statistische Kennzahlen der Regressionskoeffizieten wie Mittelwert, Min/Max und Quantile an. Zählen Sie wie viele Koeffizieten dabei 0 sind. Was stellen Sie fest?

#### Aufgabe 2

Nehme Sie nun alle Daten aus der 'insurance.csv' und trainieren Sie eine Regression mit einem DecisionTree. Variieren Sie der Parameter 'max_depth' und 'min_sample_leaves' und suchen Sie die beste Kombination.

#### Aufgabe 3

Wie Sie in Aufgabe 2 gesehen haben ist es sehr aufwändig die richtigen Parameter zu finden. Zum Glück kann scikit-learn dies automatisieren. Lesen Sie sich dafür in 'GridSearchCV' ein und finden Sie die besten Parameter für 'max_depth', 'min_samples_split' und 'min_samples_leaf'.

In [2]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, Ridge
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OrdinalEncoder, PolynomialFeatures

In [3]:
insurance = pd.read_csv("data/insurance.csv")
insurance.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,expenses
0,19,female,27.9,0,yes,southwest,16884.92
1,18,male,33.8,1,no,southeast,1725.55
2,28,male,33.0,3,no,southeast,4449.46
3,33,male,22.7,0,no,northwest,21984.47
4,32,male,28.9,0,no,northwest,3866.86


In [4]:
male_smokers = insurance[
    (insurance["sex"] == "male") & (insurance["smoker"] == "yes")
].sort_values(by=["bmi"])

X = male_smokers["bmi"].to_numpy().reshape(-1, 1)
y = male_smokers["expenses"].to_numpy().reshape(-1, 1)

## Aufgabe 1

In [21]:
# benutzen Sie np.abs(coeff) < eps statt wirklich nach 0 zu suchen
eps = 1e-4

poly_features = PolynomialFeatures(degree=50, include_bias=False)
min_max_scaler = MinMaxScaler()

for lam in [0, 0.1, 1, 2, 5]:
    print(f"Lambda = {lam}")
    lasso_reg = Lasso(alpha=lam)
    pipe_lasso = Pipeline(
        [
            ("std_scaler", min_max_scaler),
            ("poly_features", poly_features),
            ("lasso", lasso_reg),
        ]
    )

    pipe_lasso.fit(X, y)
    # Parameter der lasso regression aus der pipeline
    coeff = pipe_lasso.get_params()["lasso"].coef_
    print(coeff)

    series = pd.Series(coeff.flatten())
    print(series.describe())

    non_zero_coeff = len(series[np.abs(series) < eps])
    print(f"Koeffizieten die 0 sind: {non_zero_coeff}")

# analog zu oben
# pipe_lasso = ...

# Schleife über lam(bda)

Lambda = 0
[-3.02840635e+04  2.57377213e+05 -1.12597045e+05 -1.64812205e+05
 -6.28544058e+04  2.43027714e+04  5.29230586e+04  4.57269142e+04
  2.83391919e+04  1.31300278e+04  3.22734986e+03 -2.03680959e+03
 -4.20505790e+03 -4.61760107e+03 -4.18175785e+03 -3.43175048e+03
 -2.64695719e+03 -1.95300109e+03 -1.39120911e+03 -9.61237656e+02
 -6.45184547e+02 -4.20197305e+02 -2.64428698e+02 -1.59382312e+02
 -9.04367502e+01 -4.65446356e+01 -1.96309846e+01 -3.94773322e+00
  4.50404094e+00  8.44734645e+00  9.69664224e+00  9.43277559e+00
  8.40320250e+00  7.06462429e+00  5.68269007e+00  4.40047604e+00
  3.28471019e+00  2.35640789e+00  1.61075999e+00  1.02972604e+00
  5.89756107e-01  2.66317388e-01  3.63691038e-02 -1.20445642e-01
 -2.21382349e-01 -2.80621354e-01 -3.09527698e-01 -3.17014802e-01
 -3.09934908e-01 -2.93453194e-01]
count        50.000000
mean        549.372535
std       48967.303386
min     -164812.205078
25%        -882.224379
50%          -0.287037
75%           5.388028
max      25737

  self._final_estimator.fit(Xt, y, **fit_params_last_step)
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


## Aufgabe 2

In [25]:
X = insurance.drop(columns=["expenses"], axis=1)
y = insurance["expenses"].to_numpy().reshape(-1, 1)

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [30]:
from sklearn.tree import DecisionTreeRegressor

numeric_features = ["age", "bmi", "children"]
ordinal_features = ["sex", "smoker"]

numeric_transformer = Pipeline(  # Decision Tree braucht kein Scaling
    steps=[
        (
            "imputer",
            SimpleImputer(strategy="mean"),
        ),
        ("poly_features", PolynomialFeatures(degree=8)),
    ]
)

ordinal_transfomer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("ordinal_encoding", OrdinalEncoder()),
    ]
)

y_preprocessor = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="mean")),
    ]
)

X_preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat_ordinal", ordinal_transfomer, ordinal_features),
    ]
)


X_train_prepared = X_preprocessor.fit_transform(X_train)
y_train_prepared = y_preprocessor.fit_transform(y_train)

X_test_prepared = X_preprocessor.fit_transform(X_test)
y_test_prepared = y_preprocessor.fit_transform(y_test)


# Doppelschleife über max_depth und min_sample leaves

for depth in [1, 2, 10, 20]:
    for min_sample in [2, 4, 8, 16]:
        print(f"max_depath = {depth}, min_samples_leaf={min_sample}")

        reg = DecisionTreeRegressor(max_depth=depth, min_samples_leaf=min_sample)
        reg.fit(X_train_prepared, y_train_prepared)

        train_score = reg.score(X_train_prepared, y_train_prepared)
        test_score = reg.score(X_test_prepared, y_test_prepared)

        print(f"train score: {train_score}, test score: {test_score} \n")

max_depath = 1, min_samples_leaf=2
train score: 0.6239920522984277, test score: 0.6047215682149817 

max_depath = 1, min_samples_leaf=4
train score: 0.6239920522984277, test score: 0.6047215682149817 

max_depath = 1, min_samples_leaf=8
train score: 0.6239920522984277, test score: 0.6047215682149817 

max_depath = 1, min_samples_leaf=16
train score: 0.6239920522984277, test score: 0.6047215682149816 

max_depath = 2, min_samples_leaf=2
train score: 0.8341822124289608, test score: 0.8019209567011335 

max_depath = 2, min_samples_leaf=4
train score: 0.8341822124289608, test score: 0.8019209567011335 

max_depath = 2, min_samples_leaf=8
train score: 0.8341822124289608, test score: 0.8019209567011335 

max_depath = 2, min_samples_leaf=16
train score: 0.8341822124289608, test score: 0.8019209567011335 

max_depath = 10, min_samples_leaf=2
train score: 0.9684043984138457, test score: 0.7256435732620563 

max_depath = 10, min_samples_leaf=4
train score: 0.9428950740465415, test score: 0.76001

## Aufgabe 3

In [None]:
from sklearn.model_selection import GridSearchCV

square = lambda n: [x**2 for x in np.arange(1, n)]
param_grid = {
    "max_depth": square(10),
    "min_samples_split": square(10),
    "min_samples_leaf": square(10),
}

In [None]:
clf = DecisionTreeRegressor(**best_params)
clf.fit(X_train_prepared, y_train_prepared)

print(f"Train score: {clf.score(X_train_prepared, y_train_prepared)}")
print(f"Test score: {clf.score(X_test_prepared, y_test_prepared)}")