# Intro

In [None]:
import this

Hello, my name is Russell(JapanColorado), I'm new to the field of machine learning, and I'm excited to learn all that I can! I've always been interested in math and programming, and so machine learning/AI seemed liked the natural next step. To learn machine learning, I'm following Aurélien Géron's *Hands-on Machine Learing with Scikit-Learn, Keras and Tensorflow*, as well as a number of Kaggle Learn courses, competitions, and other similair MOOCs. I'm excited to start my machine learning journey, as well as interacting with the community! This notebook is serving as both an introduction, as well as the exercise for chapter 2 of Hands-on ML, so without further ado, let's jump into the code!

btw, some of you might wonder why I'm using the August 2021 Tabular Playground data as my first notebook, and the motivation is to save the titanic competition for after the chapter on classifiers, and the housing prices one for once I know more advanced algorithims and basic EDA(since there are no column labels in the tabular playground data, there's no real need for EDA beyond a correlation matrix or so(of course, if I'm wrong about that, please correct me.)) I'm following *Hands-on Machine Learing with Scikit-Learn, Keras and Tensorflow*'s machine learning checklist. And since I'm only on chapter 2, some of the algorithims and data inspection/preperation will be quite basic. Any feedback for improvement is much appreciated!

OK, now for the code!

## Setup(De facto Step 1)

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression,SGDRegressor,ElasticNet,Lasso
from sklearn.svm import LinearSVR,SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

P.S.A. I'm skipping Step 1(Frame the Problem), as the data is artificially generated, with no real practical use beyond machine learning practice.

## **Step Two:** Get the Data

First things first, we have to load the data:

In [None]:
train_data = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/train.csv", index_col="id")
test_data = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/test.csv", index_col="id")
train_data.head()

## **Step Two:** Explore the Data

In this step, we'll do some basic data exploration, see if there are any categorical values or missing values, graph correlations, etc.

In [None]:
s = (train_data.dtypes == 'object')
object_cols = list(s[s].index)

print(train_data.info())
print([col for col in train_data.columns if train_data[col].isnull().any()])
print(object_cols)
print(train_data.shape)

Seems pretty simple. No missing values, no categorical features. That means the most we'll probably have to do for data preperation is normalization/standardization. Alright, let's look at some graphs!

In [None]:
train_data.hist(bins=50,figsize=(35,30))
plt.show()

Hmm...There seems to be a wide variety of graph types and shapes, as well as scaling, ranging from the ten thousands, all the way down to decimals. My guess is that standardization will preform better, considering the wide range and the number of outliers, but we can try normalization as well. Most notably, loss(the target feature) seems to be very tail heavy. Do I know how to fix that? Not really! Let's move on to the correlation matrix!

In [None]:
corr_matrix = train_data.corr()
corr_matrix["loss"].sort_values(ascending=False)

Again, there seems to be alot of variety, and no features are strongly correlated. It seems like the best solution will probably be throwing a bunch of algorithims at the problem and seeing what sticks, since there is no practical way(to my knowledge) to do feature engineering. Let's go ahead and seperate the data into X and y, as well as make a pipeline for normalization and standardization, then we'll start with models.

In [None]:
y = train_data.loss
X = train_data.drop("loss", axis=1)

norm_pipeline = Pipeline([
    ("normalization", Normalizer())
])
stan_pipeline = Pipeline([
    ("stanardization", StandardScaler())
])

Alright, now for the fun part!

## **Step 3:** Shortlist Promising Models

Alright, for this step, my plan is to write a function that will train many different models using standard parameters on both normalized and standardized data, see which data preforms best, which models preform best, and hopefully find some models that will preform well after further tuning. Let's get started!

In [None]:
def quick_model_eval(models,preprocessor,random_state=69,X=X,y=y):
    for model in models:
        model.random_state = random_state
        model_pipeline = Pipeline([
            ("preprocessor", preprocessor),
            ("model", model)
        ])
        score = -1 * cross_val_score(model_pipeline,X,y,cv=3,n_jobs=-1, scoring="neg_root_mean_squared_error")
        print(f"The {model} scored: {score}")
        print(f"The {model}'s mean score was: {score.mean()}")

Alright, let's try it out!

In [None]:
#quick_model_eval([RandomForestRegressor()],stan_pipeline)

Alright, so cross validation is taking TOO LONG. Linear regression was fine, but doing the function on a random forest never finish and I waited 33 minutes. If it can't handle a random forest with standard parameters, it certainly won't be able to hand XGBoost(If anyone knows a reason why cross validation might be failing beyond the amount of data, let me know). For thoroughness here's the function on linear regression:

In [None]:
quick_model_eval([LinearRegression()],stan_pipeline)

And while that takes not too long, as I said, it probably won't be able to handle complex algorithims like XGBoost. So let's implement a new function but with holdout validation instead!

In [None]:
def quicker_model_eval(models,preprocessor,random_state=69,X=X,y=y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2)
    X_train_trans = preprocessor.fit_transform(X_train)
    X_vaild_trans = preprocessor.transform(X_valid)
    for model in models:
        model.random_state = random_state
        model.fit(X_train_trans,y_train)
        predictions = model.predict(X_vaild_trans)
        score = mean_squared_error(predictions,y_valid, squared=False)#squared=False makes it RMSE
        print(f"The {model} scored: {score}")

Alright, let's try this out. This time, I'm going to start with linear regressions.

In [None]:
quicker_model_eval([LinearRegression()],stan_pipeline)

Ok, so that was alot quicker, and the results are pretty similar to cross validation's. Let's try out a random forest.

In [None]:
#quicker_model_eval([RandomForestRegressor(random_state=69)],stan_pipeline)

So interestingly, this is still taking quite a while. I looked around on stackoverflow, and I think I've figured out why. I haven't constrained the algorithim, so since it is an ensemble method and is training on quite a large amount of data(250000 x 101 data points, including targets), the problem is that for each tree it has to train on all of that data. So I'm going to try my original cross validation algorithim as well as my new holdout validation one on a random forest with a fixed number of trees(eg. 5) and use that as quick shortlisting, then find the optimal number of trees later. Let's try it!

In [None]:
quicker_model_eval([RandomForestRegressor(n_estimators=5)],stan_pipeline)

In [None]:
#quick_model_eval([RandomForestRegressor(random_state=69, n_estimators=5)],stan_pipeline)

Alright, so the holdout validation function took alot less time(~10 minutes), while the cross validation never finished. I think we can go ahead and try this out on some promising models!(btw, following sklearn's algorithim cheat sheet, but also trying out some of my own choice :P)

In [None]:
quicker_model_eval([
    Lasso(),
    SGDRegressor(),
    ElasticNet(),
    RandomForestRegressor(n_estimators=5),
    XGBRegressor(tree_method='gpu_hist', gpu_id=0),
    LinearSVR(),
    #SVR(),
    KNeighborsRegressor()], stan_pipeline)

Alright, so it looks like results are quite varied across the board. Some of the best preformers were Lasso, SGD, ElasticNet, and XGBoost, and this is going to be my shortlist. While I could definitly make the random forest perform much better, the amount of time it takes to train, as well as the fact that it can't utilize gpu means I'm not going to include it in my shortlist. Meanwhile, LinearSVR and KNeighbors both preformed terribly, and the regular SVR hasn't finished training. Alright, I think we should look at the variables that Lasso, SGD, and ElasticNet prioritized, then we can move on to model tuning. Oh wait, these models were defined for the function...Let me redefine the function to print out the coefficients of each model(btw, if anyone thinks I should have tried a different model, let me know! Feedback is always welcome).

In [None]:
def quicker_model_eval_coef(models,preprocessor,random_state=69,X=X,y=y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2)
    X_train_trans = preprocessor.fit_transform(X_train)
    X_vaild_trans = preprocessor.transform(X_valid)
    for model in models:
        model.random_state = random_state
        model.fit(X_train_trans,y_train)
        predictions = model.predict(X_vaild_trans)
        score = mean_squared_error(predictions,y_valid, squared=False)#squared=False makes it RMSE
        coefficients = pd.Series(model.coef_, index=X.columns)
        print(f"The {model} scored: {score}")
        print(f"The {model}'s coefficients:\n{coefficients}, with {sum(coefficients != 0)} picked, and the other {sum(coefficients == 0)} rejected\n")
        #Credit to Alexandru Papiu for his wonderful notebook on Regularized Linear Models for housing, which showed me that you could look at the coefficients for linear models

In [None]:
quicker_model_eval_coef([
    Lasso(),
    SGDRegressor(),
    ElasticNet()], stan_pipeline)

So interestingly, the lasso and elastic net models have 0 features picked. Whether this means that they just supply a dummy value, or something else is going on is beyond me. If anyone has any feedback on why this might be, please let me know. Oddities asides, let's move on to model tuning!

## **Step 4:** Fine-Tune the System

Alright, now it's time to try and boost the preformace of the shortlisted models. Each has at least a couple hyperparameters that can be tuned(especially XGBoost), and let's see if we can get performance a *bit* higher(Keep in mind that the top score is 7.7955, so there's probably not going to be any dramatic increase). Alright! Let' start by identifying what hyperparameterse each learning algorithim has; we can do this through sklearn's get_params() method:

In [None]:
models = [Lasso(), SGDRegressor(), ElasticNet(), XGBRegressor(tree_method='gpu_hist', gpu_id=0)]
for model in models:
    print(f"{model.get_params()}\n")

Alright, so my goal is to have about ~2-3 hyperparameters per algorithim. What I've narrowed down is this:
* Lasso:
  * Alpha(Float)
  * Fit Intercept(Bool)
* SGDRegressor:
  * Alpha(Float)
  * Fit Intercept(Bool)
  * Early Stopping(Bool)
* ElasticNet:
  * Alpha(Float)
  * Fit Intercept(Bool)
* XGBoost:
  * N Estimators(Int)
  * Early Stopping Rounds(Int)
  * Learning Rate(Float)

My plan is to define a parameter tuner function to find the hyperparameters. Let's also not forget to treat the normalization and standardization pipelines as hyperparameters. Let's go ahead and start!

In [None]:
lasso_params = [
    {"fit_intercept":[True, False],"alpha":[0.0001, 0.001, 0.01, 0.1, 1, 10]}
]

SGD_params = [
    {"fit_intercept":[True, False],"alpha":[0.0001, 0.001, 0.01, 0.1, 1, 10], "early_stopping":[True, False]}
]

elastic_params = [
    {"fit_intercept":[True, False],"alpha":[0.0001, 0.001, 0.01, 0.1, 1, 10]}
]

XGB_params = [
    {"n_estimators":[5, 10, 20, 30, 50, 100, 1000, 5000],"early_stopping_rounds":[3, 5, 10],"learning_rate":[0.0001, 0.001, 0.01, 0.1, 0.3, 0.7]}
]

def model_param_tuner(model, preprocessor, tuner, params, random_state=69, cv=5, X=X, y=y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2)
    X_train_trans = preprocessor.fit_transform(X_train)
    X_vaild_trans = preprocessor.transform(X_valid)
    model.random_state = random_state
    tuned_model = tuner(model, params, cv=cv,
                       scoring='neg_mean_squared_error')
    tuned_model.fit(X_train_trans,y_train)
    print(f"The {model}'s ideal parameters were: {tuned_model.best_params_}")
    print(f"The preprocessor was: {preprocessor}")
    print(f"These were the scores: \n")
    cv_results = tuned_model.cv_results_
    for mean_score, params in zip(cv_results["mean_test_score"],cv_results["params"]):
        print(np.sqrt(-mean_score), params)
    predictions = tuned_model.predict(X_vaild_trans)
    score = mean_squared_error(predictions, y_valid, squared=False)
    print(f"The {model}'s validation score: {score}")

Let's go ahead and test it:

In [None]:
model_param_tuner(Lasso(), stan_pipeline, GridSearchCV, lasso_params)

Alright, it works! Let's go ahead and run it on all the models with both normalization and standardization pipelines:

In [None]:
for processor in [stan_pipeline, norm_pipeline]:
    model_param_tuner(Lasso(), processor, RandomizedSearchCV, lasso_params)
    model_param_tuner(SGDRegressor(), processor, RandomizedSearchCV, SGD_params)
    model_param_tuner(ElasticNet(), processor, RandomizedSearchCV, elastic_params)
    model_param_tuner(XGBRegressor(tree_method="gpu_hist", gpu_id=0), processor, RandomizedSearchCV, XGB_params)

As expected, standardization preformed slightly better than normilization, with XGBoost(unsurprisingly) preforming the best. Let's go ahead and submit the results! I'll be retuning the XGBRegressor model on GridSearchCV and standardization for submission.

In [None]:
def model_param_tuner_returned(model, preprocessor, tuner, params, test_set=test_data, random_state=69, cv=5, X=X, y=y):
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2)
    X_train_trans = preprocessor.fit_transform(X_train)
    X_vaild_trans = preprocessor.transform(X_valid)
    model.random_state = random_state
    tuned_model = tuner(model, params, cv=cv,
                       scoring='neg_mean_squared_error')
    tuned_model.fit(X_train_trans,y_train)
    print(f"The {model}'s ideal parameters were: {tuned_model.best_params_}")
    predictions = tuned_model.predict(X_vaild_trans)
    score = mean_squared_error(predictions, y_valid, squared=False)
    print(f"The {model}'s validation score: {score}")
    return tuned_model
xgb_tuned = model_param_tuner_returned(XGBRegressor(tree_method="gpu_hist", gpu_id=0), stan_pipeline, GridSearchCV, XGB_params)

In [None]:
submissions = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/sample_submission.csv")
submissions["loss"] = xgb_tuned.predict(test_data)
submissions.to_csv("submission.csv", index = False)
submissions.head()

## **Conclusion**

Thanks to everyone who got through this notebook! I know my code is hard to read, convoluted, and doesn't follow DRY(Don't Repeat Yourself), but it was still fun to start writing some code, and expirementing along the way. I hope to keep improving, and this is defenitly not the last you'll see of me on Kaggle. Again, thanks, and I can't wait to keep on improving! Feedback is welcome, and I'll try to implement suggestions in futre code. Thanks!