# Random Forest and XGBoost on Tabular Playground August 2021
In this notebook, I will train an XGBoost and a Random Forest model on tabular data and evaluate their accuracies.

## Library and Data Imports
I'll start with the basic library and data imports.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Loading in the train and test data
train_data = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
test_data = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')

X = train_data.iloc[:,1:101]
y = train_data['loss']
X.head()

In [None]:
# Attempt to use feature engineering with mutual info regression, but this took up too much space
# from sklearn.feature_selection import mutual_info_regression

# # Function code from Mutual Information lesson in Kaggle Feature Engineering course
# def make_mi_scores(X, y, discrete_features):
#     '''create series for the mutual info scores for each feature'''
#     mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
#     mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
#     mi_scores = mi_scores.sort_values(ascending=False)
#     return mi_scores

# # Remove the discrete features and find the MI scores
# X = X.loc[:,X.dtypes == float]
# mi_scores = make_mi_scores(X, y, False)
# mi_scores.head(20)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size = 0.2, random_state = 1)
print(len(X_train), len(X_valid), len(y_train), len(y_valid))

## Training the Model
Here I'll import and train the Random Forest and XGBoost models on the tabular data.

In [None]:
# models to be used
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
# metric for calculating accuracy
from sklearn.metrics import mean_squared_error

In [None]:
# creating the RandomForestRegressor model with 500 estimators and 4 jobs (for faster processing)
rf_model = RandomForestRegressor(n_estimators = 500, n_jobs = 4, random_state = 0)
rf_model.fit(X_train, y_train)
# Rounding predictions
predictions_rf = rf_model.predict(X_valid).round()
print("Random Forest error: " + str(mean_squared_error(y_valid, predictions_rf)))

In [None]:
# Non-rounding predictions
predictions_rf = rf_model.predict(X_valid)
print("Random Forest error: " + str(mean_squared_error(y_valid, predictions_rf)))

In [None]:
# creating the XGBRegressor model with 500 estimators, a learning_rate of 0.05,
# 5 rounds for early stopping, and 4 jobs (for faster processing)
xgb_model = XGBRegressor(n_estimators = 500, learning_rate = 0.05, n_jobs = 4, random_state = 0)
xgb_model.fit(X_train, y_train,
              early_stopping_rounds = 5,
              eval_set = [(X_valid, y_valid)],
              verbose = False)
# Rounding predictions
predictions_xgb = xgb_model.predict(X_valid).round()
print("XGBoost Error: " + str(mean_squared_error(y_valid, predictions_xgb)))

In [None]:
# Non-rounding predictions
predictions_xgb = xgb_model.predict(X_valid)
print("XGBoost Error: " + str(mean_squared_error(y_valid, predictions_xgb)))

The better model seems to be the **XGBoost model** with non-rounding, so I will use that for the final submission.

## Final Training and Submission
Here I'll do the final training and submission with the better of the two models (XGBoostRegressor).

In [None]:
# create and train the final model
final_model = XGBRegressor(n_estimators = 500, learning_rate = 0.05, n_jobs = 4, random_state = 0)
final_model.fit(X, y, verbose = False)

In [None]:
final_predictions = final_model.predict(test_data.iloc[:,1:])
submission = pd.DataFrame({"id": list(range(250000,400000)),
                          "loss": final_predictions})
submission.to_csv("submission.csv", index=False, header=True)
print("Final submission created!")