In order to practice machine learning, I started to work with this playground tabular data. This notebook is one of my first work in a long time without create ML models and I tried to use some strategies I read.

I did the following tests here:
1. Trained a random forest regressor with default hiperparameters to set a baseline model and score.
2. Used min max scaler to try to reduce the dimensions on data. Here I made a mistake, i think, because min max is influenced by outliers, which I was trying to "fix".
3. Tried StandardScaler with "with_std=False".
4. Tried a model without scaling the data. This was the best model so far => RandomForestRegressor(n_estimators=150, max_depth=70, min_samples_split=2, min_samples_leaf=4, max_features='sqrt', bootstrap=True,random_state=0)

Also, I used an encoder with target variables and used sklearn pipelines to train and predict with the model.

There is a lot to improve here because the best placement I got was around 900 (by Jun 30). I used a random grid CV to get the best parameters for the model.

I will try a random forest classifier as well to see what changes it will return.

In [None]:
import numpy as np
import pandas as pd

from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

Loading and taking a look at train data

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/train.csv')
test = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2021/test.csv')

In [None]:
train.head()

In [None]:
# Check if there is any NaN in both datasets
print("Number os missing data trainset:", train.isna().any().sum())
print("Number os missing data testset:", test.isna().any().sum())

In [None]:
# Main statistics from train dataset
train.describe()

In [None]:
# Splitting data into X and y values and one hot encoding y value in order to train a model and get predictions in output file desireble format
y = train.target
X = train.drop(['id','target'], axis=1)

encoder = OneHotEncoder(categories = 'auto')
y_enc = encoder.fit_transform(y.values.reshape(X.shape[0],1)).toarray()

# Split validation set from training data
X_train, X_valid, y_train, y_valid = train_test_split(X, y_enc, train_size=0.8, test_size=0.2, random_state=0)

In [None]:
# Those lines below I used once to find best parameters to train a model using sklean's randomized search

#from sklearn.model_selection import RandomizedSearchCV
# # Number of trees in random forest
# n_estimators = [int(x) for x in np.linspace(start = 50, stop = 200, num = 50)]
# # Number of features to consider at every split
# max_features = ['auto', 'sqrt']
# # Maximum number of levels in tree
# max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
# max_depth.append(None)
# # Minimum number of samples required to split a node
# min_samples_split = [2, 5, 10]
# # Minimum number of samples required at each leaf node
# min_samples_leaf = [1, 2, 4]
# # Method of selecting samples for training each tree
# bootstrap = [True, False]
# # Create the random grid
# random_grid = {'n_estimators': n_estimators,
#                'max_features': max_features,
#                'max_depth': max_depth,
#                'min_samples_split': min_samples_split,
#                'min_samples_leaf': min_samples_leaf,
#                'bootstrap': bootstrap}


In [None]:
# # Use the random grid to search for best hyperparameters
# # First create the base model to tune
# rf = RandomForestRegressor()
# # Random search of parameters, using 3 fold cross validation, 
# # search across 100 different combinations, and use all available cores
# rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# # Fit the random search model
# rf_random.fit(X_train, y_train)

In [None]:
# rf_random.best_params_

In [None]:
# I then created a model with best parameters and used a Pipeline to train the model.
model = RandomForestRegressor(n_estimators=150, max_depth=70, min_samples_split=2, min_samples_leaf=4, max_features='sqrt', bootstrap=True,random_state=0)

my_pipeline = Pipeline(steps=[  ('scale', StandardScaler()),
                                ('model', model)])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)


In [None]:
# Removing item ID from test data.
test_ = test.iloc[:,1:]

In [None]:
# Get predictions
output = my_pipeline.predict(test_) # Your code here

In [None]:
pred_df = pd.DataFrame(output, columns = ['Class_1','Class_2','Class_3','Class_4','Class_5','Class_6','Class_7','Class_8','Class_9'])
pred_df['id'] =  test['id'].values

In [None]:
# pred_df

In [None]:
# Creating an output file to submit and verify competition score.
output = pred_df[['id','Class_1','Class_2','Class_3','Class_4','Class_5','Class_6','Class_7','Class_8','Class_9']]
output.to_csv('submission.csv', index=False)