# Basic XGBoost model with parameter tunning

In this notebook I used XGBoost to fit the data and did some parameter tunning. If you have any hints on how to improve it, please feel free to comment below :)

Thank you!

### Load libraries

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import ParameterGrid

from xgboost import XGBRegressor
import copy
        
input_path = Path('/kaggle/input/tabular-playground-series-feb-2021/')

### Load data

In [None]:
train = pd.read_csv(input_path / 'train.csv', index_col='id')
#display(train.head())

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')
#display(test.head())

In [None]:
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
#display(submission.head())

### Encode categorical variables

In [None]:
for c in train.columns:
    if train[c].dtype=='object': 
        lbl = LabelEncoder()
        lbl.fit(list(train[c].values) + list(test[c].values))
        train[c] = lbl.transform(train[c].values)
        test[c] = lbl.transform(test[c].values)

#display(train.head())

### Pull out the target and make a validation split

In [None]:
target = train.pop('target')

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(train, target, train_size=0.80)

## First model: XGBoost regressor with default settings

In [None]:
# Fit model with default settings
model = XGBRegressor()
model.fit(X_train, y_train)

In [None]:
# Make predictions and compute MSE on the validation set
predictions = model.predict(X_valid)
print("MSE: " + str(mean_squared_error(predictions, y_valid, squared=False)))

In [None]:
# Create first submission file
submission['target'] = model.predict(test)
submission.to_csv('xgboost_1.csv')

With this first submission file, LB score was **0.84924**.

## Second model using some parameter tunning

In [None]:
# Add a few parameters to improve the performance of the model
model = XGBRegressor(n_estimators=500, 
                     learning_rate=0.05, 
                     n_jobs=-1)
model.fit(X_train, y_train, 
          early_stopping_rounds=5,
          eval_set=[(X_valid, y_valid)],
          verbose=False)

In [None]:
# Make predictions and compute MSE on the validation set
predictions = model.predict(X_valid)
print("MSE: " + str(mean_squared_error(predictions, y_valid, squared=False)))

In [None]:
# Create second submission file
submission['target'] = model.predict(test)
submission.to_csv('xgboost_2.csv')

With this second submission file, LB score was **0.84586**. A little better than the first one.

## Thrid model with parameter tunning after a simple grid search

In [None]:
# This cell takes a long time to run, so I have commented it.

#model = XGBRegressor()

# Create a dictionary of hyperparameters to search
#grid = {'max_depth': [6, 7], 'n_estimators': [100, 500, 1000], 'n_jobs': [-1], 'learning_rate': [0.05, 0.10],}

#model_scores = []

# Loop through the parameter grid, set the hyperparameters, and save the scores
#for g in ParameterGrid(grid):
#    model.set_params(**g) 
#    model.fit(X_train, y_train, early_stopping_rounds=5, eval_set=[(X_valid, y_valid)], verbose=False)
#    predictions = model.predict(X_valid)
#    model_score = mean_squared_error(y_valid, predictions, squared=False)
#    model_scores.append(model_score)
#    print('MSE =', f'{model_score:0.5f} ', 'Parameters:', g)

# Find best hyperparameters from the validation score and print
#best_idx = np.argmin(model_scores)
#print()
#print('Best score: ', model_scores[best_idx], ParameterGrid(grid)[best_idx])

These were the results:

`MSE = 0.85165  Parameters: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 100, 'n_jobs': -1}`
`MSE = 0.84347  Parameters: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 500, 'n_jobs': -1}`
`MSE = 0.84347  Parameters: {'learning_rate': 0.05, 'max_depth': 6, 'n_estimators': 1000, 'n_jobs': -1}`
`MSE = 0.85032  Parameters: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 100, 'n_jobs': -1}`
`MSE = 0.84367  Parameters: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 500, 'n_jobs': -1}`
`MSE = 0.84367  Parameters: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 1000, 'n_jobs': -1}`
`MSE = 0.84582  Parameters: {'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 100, 'n_jobs': -1}`
`MSE = 0.84375  Parameters: {'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 500, 'n_jobs': -1}`
`MSE = 0.84375  Parameters: {'learning_rate': 0.1, 'max_depth': 6, 'n_estimators': 1000, 'n_jobs': -1}`
`MSE = 0.84526  Parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100, 'n_jobs': -1}`
`MSE = 0.84426  Parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500, 'n_jobs': -1}`
`MSE = 0.84426  Parameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 1000, 'n_jobs': -1}`

`Best score:  0.8434681452062144 {'n_jobs': -1, 'n_estimators': 500, 'max_depth': 6, 'learning_rate': 0.05}`

The best set of parameters found was ... the same I had tried before! The only difference is that `n_jobs` was set to `-1`.

And it seems that setting `n_estimators` to more than 500 did not make a difference.

In any case, I will fit the model once again.

In [None]:
# Fit model with the best parameters found
model = XGBRegressor(n_estimators=500,
                     learning_rate=0.05,
                     n_jobs=-1)
model.fit(X_train, y_train, 
          early_stopping_rounds=5,
          eval_set=[(X_valid, y_valid)],
          verbose=False)

In [None]:
# Make predictions and compute MSE on the validation set
predictions = model.predict(X_valid)
print("MSE: " + str(mean_squared_error(predictions, y_valid, squared=False)))

In [None]:
# Create third submission file
submission['target'] = model.predict(test)
submission.to_csv('xgboost_3.csv')

With this third submission file, LB score was **0.84586**, the same as the second model's.

### So this is what I have so far. As mentioned before, any tips on how to improve this simple model are welcome. Thank you! :D