## Exercise 2: Use Gradient Boost for Regression

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
Submit your results to:
https://www.kaggle.com/competitions/playground-series-s4e12/overview



In [256]:
import pandas as pd
import seaborn as sns
import xgboost as xgb
import numpy as np

from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer

## Dataset
Train, test and sample submission file can be found in this link
https://www.kaggle.com/competitions/playground-series-s4e12/data

## 1. Load the Data

In [257]:
# put your answer here

df = pd.read_csv("train.csv")
dt = pd.read_csv("test.csv")
sf = pd.read_csv("sample_submission.csv")

df['source'] = 'train'
dt['source'] = 'test'

data = pd.concat([df, dt], axis=0).reset_index(drop=True)

## 2. Perform Data preprocessing

In [260]:
data['Policy Start Date'] = pd.to_datetime(data['Policy Start Date'], errors='coerce')
data['year'] = data['Policy Start Date'].dt.year
data['month'] = data['Policy Start Date'].dt.month
data['day'] = data['Policy Start Date'].dt.day
data['hour'] = data['Policy Start Date'].dt.hour
data['minute'] = data['Policy Start Date'].dt.minute
data['second'] = data['Policy Start Date'].dt.second

most_frequent_date = data['Policy Start Date'].mode()[0]
data['Policy Start Date'] = data['Policy Start Date'].fillna(most_frequent_date)

data.drop(['Policy Start Date'], inplace=True, axis=1)

cat = data.select_dtypes(include=['object']).columns.tolist()
num = data.select_dtypes(include=['float64', 'int64']).columns.tolist()

num_imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
data[num] = num_imputer.fit_transform(data[num])

cat_imputer = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
data[cat] = cat_imputer.fit_transform(data[cat])
        
data[num].fillna(data[num].mean(), inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data[num].fillna(data[num].mean(), inplace=True)


In [262]:
df = data[data['source'] == 'train'].drop(columns=['id','source'])
dt = data[data['source'] == 'test'].drop(columns=['source'])

In [263]:
X = df.drop(columns=['Premium Amount'])
y = df['Premium Amount']

In [264]:
cat = X.select_dtypes(include=['object']).columns.tolist()
num = X.select_dtypes(include=['float64', 'int64']).columns.tolist()

In [265]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(), cat),
        ("num", StandardScaler(), num)
    ]
)

In [266]:
params = {
    "n_estimators": 500,
    "max_depth": 4,
    "min_samples_split": 5,
    "learning_rate": 0.01,
    "loss": "squared_error",
}
     

## 3. Create a Pipeline

In [267]:
model = xgb.XGBRegressor(**params)

In [268]:
# put your answer here

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('model', model)
])

In [269]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Train the Model

In [270]:
pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

Parameters: { "loss", "min_samples_split" } are not used.



## 5. Evaluate the Model

In [271]:
# put your answer here

from sklearn.metrics import root_mean_squared_log_error

rmsle = root_mean_squared_log_error(y_test, y_pred)

print("Root mean squared log error:", rmsle)

Root mean squared log error: 1.1589541934553096


## Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [272]:
id = sf.pop('id')
y_pred = pipeline.predict(dt.drop('id', axis=1))

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'id': id,
    'Premium Amount': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

Submission file created: submission_file.csv
