<h1>Tabular Playground Series - Aug 2021

**Description**

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, we've launched many Playground competitions that are more approachable than our Featured competitions and thus, more beginner-friendly.

In order to have a more consistent offering of these competitions for our community, we're trying a new experiment in 2021. We'll be launching month-long tabular Playground competitions on the 1st of every month and continue the experiment as long as there's sufficient interest and participation.

The goal of these competitions is to provide a fun, and approachable for anyone, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition. If you're an established competitions master or grandmaster, these probably won't be much of a challenge for you. We encourage you to avoid saturating the leaderboard.

For each monthly competition, we'll be offering Kaggle Merchandise for the top three teams. And finally, because we want these competitions to be more about learning, we're limiting team sizes to 3 individuals.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with calculating the loss associated with a loan defaults. Although the features are anonymized, they have properties relating to real-world features.

Good luck and have fun!

For ideas on how to improve your score, check out the Intro to Machine Learning and Intermediate Machine Learning courses on Kaggle Learn.

**Data Desicription**

For this competition, you will be predicting a target loss based on a number of feature columns given in the data. The ground truth loss is integer valued, although predictions can be continuous.
Files

    train.csv - the training data with the target loss column
    test.csv - the test set; you will be predicting the loss for each row in this file
    sample_submission.csv - a sample submission file in the correct format


**Evaluation**

Submissions are scored on the root mean squared error. RMSE is defined as:

RMSE = \hat{\beta}_{0} + \sum \limits _{j=1} ^{p} X_{j}\hat{\beta}_{j} $

where    
is the predicted value,   is the ground truth value, and   is the number of rows in the test data.


In [None]:
import numpy as np 
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df_train = pd.read_csv('../input/tabular-playground-series-aug-2021/train.csv')
df_test = pd.read_csv('../input/tabular-playground-series-aug-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-aug-2021/sample_submission.csv')

In [None]:
df_train.head()

In [None]:
df_train.shape

In [None]:
df_train.describe()

In [None]:
df_test.head()

In [None]:
df_test.shape

In [None]:
df_test.describe()

Test Data has no 'loss' column hence it doesn't have target value so we have to work on our training set and fit our model on it.

In [None]:
y = df_train.loss

In [None]:
df_train.columns

Selecting our features to train our model and predict target variable.

In [None]:
X = df_train.drop(['id','loss'],axis=1)
X

In [None]:
X.columns

Building Model

In [None]:
# from sklearn.tree import DecisionTreeRegressor

# #Specifying random state
# tree_model = DecisionTreeRegressor(random_state = 1)

# #Fit Model
# tree_model.fit(X,y)

Predictions on first 5 rows to check how our model is performing

In [None]:
# print("Making predictions for the following 5 rows:")
# print(X.head())
# print("The predictions are")
# print(tree_model.predict(X.head()))

Model Validation: To measure the quality of our model, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to what actually happens.

Metric for Summarizing Model Quality is Mean Absolute Error (MAE) which is far more simplier to understand: `error = actual - predicted`

In [None]:
# from sklearn.metrics import mean_absolute_error

# predicted_values = tree_model.predict(X)
# mean_absolute_error(y, predicted_values)

Since we didint break our data in two pieces in our model building process. We need break our training data into training data and validation dato to use the data to calculate MAE better.

https://www.kaggle.com/dansbecker/model-validation

In [None]:
# from sklearn.model_selection import train_test_split

# train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# tree_model = DecisionTreeRegressor()

# tree_model.fit(train_X, train_y)

# #get predicted price on validation data
# val_predictions = tree_model.predict(val_X)
# print(mean_absolute_error(val_y, val_predictions))

To check for Overfitting and Underfitting of our model.

In [None]:
# def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
#     model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
#     model.fit(train_X, train_y)
#     preds_val = model.predict(val_X)
#     mae = mean_absolute_error(val_y, preds_val)
#     return(mae)

We can use a for-loop to compare the accuracy of models built with different values for max_leaf_nodes.

In [None]:
# for max_leaf_nodes in [5, 50, 500, 5000]:
#     my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
#     print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" %(max_leaf_nodes, my_mae))

## Random Forest

Lets apply random forest machine learning algorithm as it has many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

In [None]:
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.metrics import mean_squared_error

# forest_model = RandomForestRegressor(random_state = 1)
# forest_model.fit(train_X, train_y)
# loss_preds = forest_model.predict(val_X)
# print(mean_squared_error(val_y, loss_preds))

In [None]:
# df_test_X = df_test.drop(['id'], axis=1)
# df_test_X.columns

In [None]:
# df_test_preds = forest_model.predict(df_test_X)

In [None]:
# output = pd.DataFrame({'id': df_test.id, 'loss': df_test_preds})
# output.to_csv('submission.csv', index=False)

In [None]:
df_train

In [None]:
useful_features = [c for c in df_train.columns if c not in ("id", "loss")]
useful_features

In [None]:
for col in useful_features:
    df_train[col] = np.log1p(df_train[col])
    df_test[col] = np.log1p(df_test[col])

In [None]:
df_train

# XGBoost

In [None]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [None]:
model = XGBRegressor(n_estimators=1375, max_depth = 3, learning_rate=0.14, colsample_bytree= 0.5, subsample=0.99, 
                     random_state=1, reg_alpha = 25.4, n_jobs = 5, 
                     tree_method='gpu_hist', gpu_id=0, predictor="gpu_predictor")

In [None]:
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

In [None]:
model.fit(train_X, train_y, early_stopping_rounds = 100, eval_set=[(val_X, val_y)], verbose=False)

In [None]:
preds_valid = model.predict(val_X)
rmse = mean_squared_error(val_y, preds_valid, squared=False)
print(rmse)

In [None]:
df_test_X = df_test.drop(['id'], axis=1)
df_test_X.columns

In [None]:
df_test_preds = model.predict(df_test_X)

In [None]:
output = pd.DataFrame({'id': df_test.id, 'loss': df_test_preds})
output.to_csv('Tabular_submission.csv', index=False)

In [None]:
sub = pd.read_csv('./submission.csv')
sub.shape

In [None]:
sub.head(10)