Notebook Linear Regression Case
Oefening Data Scientist 
Geert Vandezande

Doel:
- Supervised Learning toepassen
- EDA uitvoeren op een dataset
- Lineair Regression toepassen op de data: de target is beter doen dan r² = 80% nauwkeurigheid die in de meeste uitwerkingen zit...
- Ook door andere vormen van regressie toe te passen, het doel is om r² zo goed mogelijk te krijgen

==> Resultaat: r² = 86% en 27% fout marge 


Extra:
- er wordt logging voorzien voor en na de belangrijke stappen (zie LinReg_logging.log). Hiermee kunnen de stappen en de resultaten opgevolgd worden. Dit wordt in de LinReg_logging weg geschreven
- we hebben een aantal herbruikbare code-blokken in een functie gestoken
- een aparte class gemaakt voor BinaryValueEncoders om eens te proberen (kan uiteraard met de OneHotEncoder)
- we hebben een functie geschreven om snel een reeks van modellen te kunnen evalueren, zowel zonder pipelining als met pipelining

Dataset: 
- More info: see kaggle https://www.kaggle.com/datasets/mirichoi0218/insurance/data

Metadata :
- age: age of primary beneficiary
- sex: insurance contractor gender, female, male
- bmi: Body mass index, providing an understanding of body, weights that are relatively high or low relative to height, objective index of body weight (kg / m ^ 2) using the ratio of height to weight, ideally 18.5 to 24.9
- children: Number of children covered by health insurance / Number of dependents
- smoker: Smoking
- region: the beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
- charges: Individual medical costs billed by health insurance


Volgorde van activiteiten in deze notebook: (cfr Datacamp "preparing data for modelling)
- data inlezen
- data bekijken, visueel en numerisch
- data summarizen via summarytools 
- missing en duplicated data oplossen 
- incorrect types controleren
- numerische waarde standardizeren
- categorische varaiabelen processen
- feature engineering checken
- linear, ridge, lasso, gradient boost, random forest,...  modellen uitvoeren
- alle modellen toegepast op 2 situaties voor stratefy
    - Train_test_split zonder stratify
    - Train_test_split met stratefy op basis van de categorisatie van de charges (target v) (omdat er skewness is op charges )

- zowel met pipelining als zonder pipelining uitgewerkt
- voor het beste model (XGSBoost) gaan we nog een Hyperparameter tuning uitvoeren





In [None]:
# import van de diverse modules
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split


# Machine learning algorithm
from statsmodels.formula.api import ols
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.ensemble import IsolationForest

# Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# system utils
import warnings
from pathlib import Path
import datetime
from colorama import Fore, Back, Style
import sys
import os
import chardet

Extra code snippits die doorheen de notebook gebruikt worden:

save_fig: na generatie van een image kan de image naar file geschreven worden in de images/.. directory. Geef steeds een zinvolle naam

read_JSON: om eenvoudig een JSON in te lezen

log_info:
- logging functie om doorheen de notebooks de status naar file te kunnen schrijven. 
- de logstatements worden tijdens de uitvoering van de code bewaard in een list. Die kan tussentijds naar het scherm geprint worden of naar een file
- log_info_write_to_file: schrijf de loginformatie naar file 
- log_info_print_on_screen: print alle loginfo naar het scherm

In [None]:
# enkele extra code snippets gebruikt doorheen de oefening


# to plot or not to plot - zet op True om de plots te zien, zet op False om de plots niet te zien bij een Run ALL
plot_graphs = True

# schrijf een visual naar file
IMAGES_PATH = Path() / "images" 
IMAGES_PATH.mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = IMAGES_PATH / f"{fig_id}.{fig_extension}"
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Lezen van de JSON-file

def read_JSON(file_path_read):
    with open(file_path_read, 'r') as file:
        files_from_json = json.load(file)
    return files_from_json


# functies om te loggen naar file

log_info_lijst = []

def log(log_code="INFO", boodschap="euh geen boodschap????"):
    global log_info_lijst
    now = datetime.datetime.now()
    formatted_date = now.strftime("%d/%m/%Y %H:%M:%S")
    log_message = f"{Style.RESET_ALL}{formatted_date} : {log_code} : {boodschap}"
    log_info_lijst.append(log_message)
    print(log_message)
    return

def log_info(boodschap):
    log("Info",boodschap)

def log_info_write_to_file(filename):
    with open(filename, 'w') as file:
        for string in log_info_lijst:
            file.write(string + '\n')  # Voeg een nieuwe regel toe na elke string
    return

def log_info_print_on_screen():
    for boodschap in log_info_lijst:
        print(boodschap)    
    return

In [None]:
# functie: maakt een boxplot van kolommen in een pandaframe
# df_col is een list van de kolomnamen die geplot worden

def plot_boxplot(df, df_col, filenaam):
    if plot_graphs:
        # boxplot van de numerische waarden
        sns.set_theme(style="whitegrid", palette="bright")
        plt.figure(figsize=(15, 15)) 
        for i, col in enumerate(df_col):
            plt.subplot(len(df_col), 2, 2 * i + 1)
            sns.boxplot(x=df[col], orient='h', linewidth=1.5)
            plt.title(f"Boxplot of {col}", fontsize=12, fontweight="bold")
            plt.xlabel(col, fontsize=10)

            plt.subplot(len(df_col), 2, 2 * i + 2)
            sns.histplot(df[col], kde=True,  linewidth=1)
            plt.title(f"Distribution Plot of {col}", fontsize=12, fontweight="bold")
            plt.xlabel(col, fontsize=10)
            plt.ylabel("Density", fontsize=10)

        plt.tight_layout()
        save_fig(filenaam)
        plt.show()

In [None]:
# functie om het percentage outliers te berkenen voor een set van kolommen in een dataframe
def bereken_percentage_aantal_outliers(df , columns_to_use):
    # Initialiseren van het Isolation Forest model
    iso_forest = IsolationForest(n_estimators=100, contamination='auto', random_state=42)

    # Fit het model
    iso_forest.fit(df[columns_to_use])
    # Voorspellingen
    # Het geeft -1 voor outliers en 1 voor inliers
    labels = iso_forest.predict(df[columns_to_use])
    # Toevoegen van de labels aan het DataFrame om outliers te identificeren
    df_intern = df.copy()
    df_intern['outlier'] = labels
    outliers = df_intern[df_intern['outlier'] == -1]
    aantal_outliers = df_intern['outlier'].value_counts()
    print(aantal_outliers)
    percentage_aantal_outliers = (len(outliers) / len(df_intern)) * 100

    return percentage_aantal_outliers


# functie om outliers in een kolom te cappen op een percentiel waarde
def cap_values(df_input, column, lower_percentile=25, upper_percentile=75):
    # voeg code toe om beter de outliers te verwijderen
    log("Info", f"Capping values voor kolom {column} naar lower percentiel {lower_percentile} - upper percentiel {upper_percentile}")
    q1, q3 = np.percentile(df_input[column], [lower_percentile, upper_percentile])  # Calculate the 25th (Q1) and 75th (Q3) percentiles
    iqr = q3 - q1  # Calculate the interquartile range (IQR)
    lower_bound = q1 - 1.5 * iqr  # Calculate lower whisker (Q1 - 1.5 * IQR)
    upper_bound = q3 + 1.5 * iqr  # Calculate upper whisker (Q3 + 1.5 * IQR)

    # lower_bound = df[column].quantile(lower_percentile)
    # upper_bound = df[column].quantile(upper_percentile)
    
    # Waarden cappen met behulp van de numpy.where functie
    df_output = df_input.copy()
    df_output[column] = np.where(df_input[column] < lower_bound, lower_bound, df_input[column])
    df_output[column] = np.where(df_input[column] > upper_bound, upper_bound, df_input[column])    
    return df_output


# hulp klasse om categorische waarden met twee mogelijke waarde naar 0 en 1 om te zetten
# was een try-out om zelf eens een encoder te schrijven
# kan uiteraard eenvoudiger door OneHotEncoding toe te passen, we hebben dan ook OneHotEncoding toegepast 

class BinaryValueEncoder(TransformerMixin, BaseEstimator):
    def __init__(self, string_zero="nul", string_one="een"):
        # Je kunt hier extra initialisatie toevoegen indien nodig
        self.string_zero = string_zero
        self.string_one = string_one
     

    def fit(self, X, y=None):
        # Er is geen fitting nodig voor deze eenvoudige codering
        return self

    def transform(self, X):
        # X wordt aangenomen een pandas DataFrame te zijn
        log("Info", f"BinaryValueEncoder transform opgeroepen voor One_value {self.string_one} en Zero_value {self.string_zero}")
        X = X.copy()  # Kopieer de DataFrame om wijzigingen te voorkomen in het origineel
        X = X.applymap(lambda x: 1 if x == self.string_zero else 0)
        return X
    
    def get_feature_names_out(self, input_features=None):
        # Dit is een eenvoudige passthrough-voorbeeld, waarbij feature namen niet wijzigen.
        if input_features is not None:
            return input_features
        else:
            return np.array(['x{}'.format(i) for i in range(X.shape[1])], dtype=object)



Hier beginnen we er echt met het inlezen van de data en de eerste checks op de data

In [6]:
# data bestand inlezen

insurance_data_filename = 'data/insurance.csv'
df = pd.read_csv(insurance_data_filename)
log_info(f"File {insurance_data_filename}")

# check op duplicates, indien zo verwijder direct
df.drop_duplicates(inplace=True)
df_original = df.copy()
duplicate_waarden = df.duplicated().sum()
log_info(f"Check op duplicates na drop \n{duplicate_waarden}")

# behoudt een copie van de orginele data
df_original = df.copy()

[0m11/02/2025 07:25:54 : Info : File data/insurance.csv
[0m11/02/2025 07:25:54 : Info : Check op duplicates na drop 
0


In [7]:
# enkele eenvoudige controles
log_info(f"df.info : \n{df.info()}")
log_info(f"df.describe : \n{df.describe()}")

# geen nulwaarden 
from summarytools import dfSummary
dfSummary(df)

<class 'pandas.core.frame.DataFrame'>
Index: 1337 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1337 non-null   int64  
 1   sex       1337 non-null   object 
 2   bmi       1337 non-null   float64
 3   children  1337 non-null   int64  
 4   smoker    1337 non-null   object 
 5   region    1337 non-null   object 
 6   charges   1337 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 83.6+ KB
[0m11/02/2025 07:25:54 : Info : df.info : 
None
[0m11/02/2025 07:25:54 : Info : df.describe : 
               age          bmi     children       charges
count  1337.000000  1337.000000  1337.000000   1337.000000
mean     39.222139    30.663452     1.095737  13279.121487
std      14.044333     6.100468     1.205571  12110.359656
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.290000     0.000000   4746.344000
50%      39.000000    30.400000   

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,age [int64],Mean (sd) : 39.2 (14.0) min < med < max: 18.0 < 39.0 < 64.0 IQR (CV) : 24.0 (2.8),47 distinct values,,0 (0.0%)
2,sex [object],1. male 2. female,675 (50.5%) 662 (49.5%),,0 (0.0%)
3,bmi [float64],Mean (sd) : 30.7 (6.1) min < med < max: 16.0 < 30.4 < 53.1 IQR (CV) : 8.4 (5.0),548 distinct values,,0 (0.0%)
4,children [int64],1. 0 2. 1 3. 2 4. 3 5. 4 6. 5,573 (42.9%) 324 (24.2%) 240 (18.0%) 157 (11.7%) 25 (1.9%) 18 (1.3%),,0 (0.0%)
5,smoker [object],1. no 2. yes,"1,063 (79.5%) 274 (20.5%)",,0 (0.0%)
6,region [object],1. southeast 2. southwest 3. northwest 4. northeast,364 (27.2%) 325 (24.3%) 324 (24.2%) 324 (24.2%),,0 (0.0%)
7,charges [float64],Mean (sd) : 13279.1 (12110.4) min < med < max: 1121.9 < 9386.2 < 63770.4 IQR (CV) : 11911.4 (1.1),"1,337 distinct values",,0 (0.0%)


In [8]:
# drie categorische features: smoking, region en sex
# vier numerische features waaronder de target variabele "charges"
# maak de datasets aan 

df_cat_col = ['smoker','region','sex']
df_num_col = ['age', 'bmi','children']
df_label_col = ['charges']

# zijn er nominaal categorische variabelen en één ordinal categorische waarden?
df_cat_nom_col = ['smoker','region']
df_cat_ord_col = list(set(df_cat_col) - set(df_cat_nom_col))

df_num = df[df_num_col]
df_cat = df[df_cat_col]
df_label = df[df_label_col]
df_cat_nom = df[df_cat_nom_col]
df_cat_ord = df[df_cat_ord_col]

In [9]:
# korte analyze van de categorische variabelen
print(df_cat.value_counts())

# check op nullen
nul_waarden = df.isnull().sum()
log_info(f"Check op nulwaarden \n{nul_waarden}")
# geen nullen

smoker  region     sex   
no      southwest  female    141
        southeast  female    139
        northwest  female    135
        southeast  male      134
        northeast  female    132
        northwest  male      131
        southwest  male      126
        northeast  male      125
yes     southeast  male       55
        northeast  male       38
        southwest  male       37
        southeast  female     36
        northwest  male       29
                   female     29
        northeast  female     29
        southwest  female     21
Name: count, dtype: int64
[0m11/02/2025 07:25:54 : Info : Check op nulwaarden 
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [10]:
plot_boxplot(df, df_num_col,"Boxplot en histogram van de numerische waarden")

In [None]:
plot_boxplot(df, df_label_col,"Boxplot en histogram van de target value")

In [None]:
if plot_graphs:
    sns.set_theme(style="whitegrid", palette="bright")
    plt.figure(figsize=(15, len(df_cat_col) * 2))  

    for i, col in enumerate(df_cat_col):
        plt.subplot(1, len(df_cat_col), i + 1)
        sns.countplot(x=col, data=df, palette='bright')
        plt.title(f"Count Plot of {col}", fontsize=14, fontweight="bold")
        plt.xlabel(col, fontsize=12)
        plt.ylabel("Count", fontsize=12)
        plt.xticks(rotation=45)

    plt.tight_layout()
    save_fig("Categorische features countplot")
    plt.show()
    log_info("Check van categorische features")

In [None]:
if plot_graphs:
    sns.set_theme(style="whitegrid", palette="bright")
    plt.figure(figsize=(15, len(df_cat_col) * 2))  

    for i, col in enumerate(df_cat_col):
        plt.subplot(1, len(df_cat_col), i + 1)
        sns.countplot(x=col, data=df, palette='bright', hue='smoker')
        plt.title(f"Count Plot of {col} (Hue: Smoker)", fontsize=14, fontweight="bold")
        plt.xlabel(col, fontsize=12)
        plt.ylabel("Count", fontsize=12)
        plt.xticks(rotation=45)

    plt.tight_layout()
    save_fig("Categorische variabelen tov smoker")
    plt.show()

In [None]:
if plot_graphs:
    sns.pairplot(df, hue='smoker',  kind='reg')
    plt.tight_layout()
    save_fig("Numerische features onderlinge scatter")
    plt.show()

In [None]:
# print de correlation nog af tussen de numerische waarden

if plot_graphs:
    plt.figure(figsize=(10, 6))
    sns.set_theme(style="whitegrid", palette="bright")
    sns.heatmap(df[df_num_col].corr(), annot=True, cmap='Reds')
    save_fig("Numerische features correlatie")
    plt.show()


Dataset = df.copy()
Dataset['sex'] = Dataset['sex'].replace(['male','female'],[0,1])
Dataset['smoker'] = Dataset['smoker'].replace(['yes','no'],[1,0])
Dataset['region'] = Dataset['region'].replace(['southwest','southeast','northwest','northeast'],[0,1,2,3])

sns.heatmap(Dataset.corr(),annot=True,cmap='Blues')
plt.title('correlation of Insurance Data')
plt.show()

In [None]:
def verwijder_outliers(df_input, df_input_col):
    # Bereken de outliers
    percentage_aantal_outliers = bereken_percentage_aantal_outliers(df_input, df_input_col)
    print(percentage_aantal_outliers)
    log_info(f"Check op de outliers in kolom {df_input_col} : {percentage_aantal_outliers}")

    df_output = df_input.copy()
    for col in df_input_col:
        df_output = cap_values(df_output, col)    

    percentage_aantal_outliers = bereken_percentage_aantal_outliers(df_output, df_input_col)
    print(percentage_aantal_outliers)
    # plot_boxplot(df, df_num_col,"Boxplot en histogram na removing van de outliers")
    log_info(f"Check op de outliers in kolom [bmi, charges] na capping op kolom [bmi]  : {percentage_aantal_outliers}")
    log_info("Capping wordt niet toegepast op de dataset")

    return df_output

plot_boxplot(df_original, df_num_col,"Boxplot en histogram na removing van de outliers")
df_zonder_outliers = verwijder_outliers(df_original,['bmi'])
plot_boxplot(df_zonder_outliers, df_num_col,"Boxplot en histogram na removing van de outliers")




In [None]:
# Feature preparation
# standard scaler op de numerische waarden, min-max scaler

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer



num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

#set_matplotlib_closeV_encoder = BinaryValueEncoder("male","female")
#set_matplotlib_smoking_encoder = BinaryValueEncoder("yes","no")
# bv_encoder = PredefinedBinaryCategoricalEncoder(positive_class='female')

male_female_transformer = Pipeline(steps=[
     #('male_female_encoder', BinaryValueEncoder("male","female"))
    ('male_female_encoder', OneHotEncoder(drop='first'))
])

smoking_transformer = Pipeline(steps=[
    #('smoking_encoder', BinaryValueEncoder("yes","no"))
    ('smoking_encoder', OneHotEncoder(drop='first'))
])

regio_transformer = Pipeline(steps=[
    ('regio', OneHotEncoder(drop='first', handle_unknown="ignore"))
])

preprocessing = ColumnTransformer([
    ("num", num_pipeline, df_num_col),
    ("male_female", male_female_transformer, ['sex']), 
    ("smoker", smoking_transformer, ['smoker']), 
    ("regio", regio_transformer, ['region'])],
     remainder='passthrough')

df_features = df_original.drop(['charges'], axis= 1)
np_prepared =  preprocessing.fit_transform(df_features)

# really hacking !!! get_feature_names werkt nog niet correct !!! not used now
df_prepared_col = ['age', 'bmi','children','sex','smoker','northeast','northwest','southeast','southwest']

df_prepared = pd.DataFrame(
    np_prepared,
    #columns=preprocessing.get_feature_names_out(),
    columns = [name.split('__')[-1] for name in preprocessing.get_feature_names_out()],
    #columns = df_prepared_col,
    index=df_original.index)

df_prepared.head()

plot_boxplot(df_prepared,df_num_col,"Boxplot van numerische waarden na standard scaling")




In [None]:
def excute_regression_models(X_train, X_test, y_train,y_test):
    best_XGB_params = {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8}
                       
    models = {
        'Linear Regression': LinearRegression(),
        'Lasso': Lasso(alpha=0.1) ,
        'Ridge': Ridge(alpha=0.01) ,
        'Random Forest': RandomForestRegressor(random_state=42),
        'Gradient Boosting': GradientBoostingRegressor(random_state=42), 
        'Decision Tree': DecisionTreeRegressor(random_state=42), 
        'XGBRegressor': XGBRegressor(random_state=42, params = best_XGB_params)
    }

    results = []
    for name, model in models.items():
        # Cross validation
        mean_rmse = np.sqrt(-cross_val_score(model, X_train, y_train, cv=5, scoring='neg_mean_squared_error').mean())
        mean_r2 = cross_val_score(model, X_train, y_train, cv=5, scoring='r2').mean()      

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        test_r2 = r2_score(y_test, y_pred)
        test_mape = mean_absolute_percentage_error(y_test, y_pred)
        results.append([name, mean_rmse, mean_r2, test_rmse, test_r2, test_mape])
    df = pd.DataFrame(results, columns=['model', 'mean_rmse', 'mean_r2', 'test_rmse', 'test_r2', 'test_mape"'])

    return df


In [None]:
def excute_regression_models_with_pipeline(X_train, X_test, y_train,y_test,preprocessor):
    models = {
        'Linear Regression': LinearRegression(),
        'Lasso': Lasso(alpha=0.1) ,
        'Ridge': Ridge(alpha=0.01) ,
        'Random Forest': RandomForestRegressor(random_state=42),
        'Gradient Boosting': GradientBoostingRegressor(random_state=42), 
        'Decision Tree': DecisionTreeRegressor(random_state=42), 
        'XGBRegressor': XGBRegressor(random_state=42, learning)
    }

   # learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50, 'subsample': 0.8

    results = []
    for name, model in models.items():        
        pipeline = Pipeline([
            ('preprocessor', preprocessor),
            ('poly', PolynomialFeatures(degree=1, include_bias=False)),
            ('regressor', model)
        ])

        pipeline
        
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)

        test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
        test_r2 = r2_score(y_test, y_pred)
        test_mape = mean_absolute_percentage_error(y_test, y_pred)

        results.append([name, test_rmse, test_r2, test_mape])
    df = pd.DataFrame(results, columns=['model', 'test_rmse', 'test_r2', 'test_mape'])

    return df


In [None]:
X = df_prepared.copy()
y = df_original['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

resultaat_eenvoudig = excute_regression_models(X_train, X_test, y_train, y_test)
print(resultaat_eenvoudig)
log_info("")
log_info("Regression models with no stratefy")
log_info(resultaat_eenvoudig)
log_info("")

In [None]:

X = df_prepared.copy()
y = df_original['charges']
df = df_original.copy()
df["charges_cat"] = pd.cut(df["charges"], bins=[0, 20000,30000, np.inf], labels=[1, 2, 3])
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df["charges_cat"] , test_size=0.2, random_state=42)
resultaat_eenvoudig = excute_regression_models(X_train, X_test, y_train, y_test)

print(resultaat_eenvoudig)
log_info("Stratify on charges_cat: extra feature engineering: charges omzetten naar categories : bins=[0, 20000, 30000, np.inf")
log_info("Regresison models with stratefy")
log_info("")
log_info(resultaat_eenvoudig)
log_info("")



In [None]:

X = df_original.drop(['charges'], axis=1)
y = df_original['charges']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
resultaat_eenvoudig = excute_regression_models_with_pipeline(X_train, X_test, y_train, y_test, preprocessing)

print(resultaat_eenvoudig)
log_info("")
log_info("Regression models with polynomial features, no stratefy")
log_info(resultaat_eenvoudig)
log_info("")


In [None]:
X = df_original.drop(['charges'], axis=1)
y = df_original['charges']

df = df_original.copy()
df["charges_cat"] = pd.cut(df["charges"], bins=[0, 20000,30000, np.inf], labels=[1, 2, 3])
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df["charges_cat"] , test_size=0.2, random_state=42)
resultaat_eenvoudig = excute_regression_models_with_pipeline(X_train, X_test, y_train, y_test, preprocessing)

print(resultaat_eenvoudig)
log_info("")
log_info("Stratify on charges_cat: extra feature engineering: charges omzetten naar categories : bins=[0, 20000, 30000, np.inf")
log_info("Regression models polynomial features en stratify on charges_cat")
log_info(resultaat_eenvoudig)
log_info("")

In [None]:
# hyper parameter tuning voor XGSBoost

df_features = df_original.drop(['charges'], axis= 1)
np_prepared =  preprocessing.fit_transform(df_features)

# really hacking !!! get_feature_names werkt nog niet correct !!! not used now
df_prepared_col = ['age', 'bmi','children','sex','smoker','northeast','northwest','southeast','southwest']

df_prepared = pd.DataFrame(
    np_prepared,
    #columns=preprocessing.get_feature_names_out(),
    columns = [name.split('__')[-1] for name in preprocessing.get_feature_names_out()],
    #columns = df_prepared_col,
    index=df_original.index)

df_prepared.head()


X = df_prepared
y = df_original['charges']

df = df_original.copy()
df["charges_cat"] = pd.cut(df["charges"], bins=[0, 20000,30000, np.inf], labels=[1, 2, 3])
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=df["charges_cat"] , test_size=0.2, random_state=42)

xgb_model = XGBRegressor(objective='reg:squarederror')

param_grid = {
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.7, 0.8, 0.9]
}

# GridSearchCV opzetten
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)
grid_search.fit(X, y)

# Beste parameters en beste score printen
print("Beste parameters:", grid_search.best_params_)
print("Beste score:", -grid_search.best_score_)  # converteer negatieve MSE terug naar positieve waarde






In [None]:
log_info_write_to_file("LinReg_logging.log")