This notebook contains the application of the following regression models:
1. Linear Regression 
2. Ridge Regression 
3. Lasso Regression
4. Poisson Regression 
5. K-Neighbor Regressor
6. LGBM
7. XGB
8. Random Forest

In [None]:
import math
import numpy as np
import pandas as pd

import seaborn as sns
sns.set_theme(color_codes=True)
import matplotlib.pyplot as plt


from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.metrics import make_scorer, accuracy_score

from sklearn.linear_model import LinearRegression

# from cuml.ensemble import RandomForestRegressor as cuRFC
# import cudf
from sklearn.ensemble import RandomForestRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.ensemble import AdaBoostRegressor

from sklearn.linear_model import PoissonRegressor

from sklearn.linear_model import RidgeCV
from sklearn.linear_model import Ridge

from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso

import lightgbm as lgbm
from lightgbm import LGBMRegressor

import xgboost
from xgboost import XGBRegressor

In [None]:
# df = cudf.read_csv("../input/tabular-playground-series-jan-2021/train.csv", index_col = "id")
# tdf = cudf.read_csv("../input/tabular-playground-series-jan-2021/test.csv")

df = pd.read_csv(
    "../input/tabular-playground-series-jan-2021/train.csv", index_col="id"
    )
tdf = pd.read_csv("../input/tabular-playground-series-jan-2021/test.csv")

# sns.regplot(x= df.drop(columns = ["target"]), y = df["target"], data = df)
# sns.lmplot(x= df.drop(columns = ["target"]), y = df["target"], data = df)

In [None]:
df = df.astype("float32")
tdf = tdf.astype("float32")

In [None]:
X = df.drop(columns="target")
y = df["target"]

* To understand the statistical meaning and data distribution of our data, pandas gives a feature: describe()
* Following are the attributes provided:
1. count (total number of values)
2. mean (mean of the data)
3. std (standard deviation)
4. min, max (minimum and maximum value in the data)
5. 25%, 50%, 75% (Respective quartile values)


In [None]:
df.describe()

In [None]:
sns.set_style("dark")
sns.set_color_codes(palette="deep")
f, ax = plt.subplots(figsize=(9, 8))

sns.distplot(df["target"], color="c")

ax.xaxis.grid(False)
ax.set(ylabel="values")
ax.set(xlabel="target")
plt.show()

Observing the corelation of the predictors

In [None]:
df.corr().style.background_gradient(cmap='Blues')

In [None]:
Corelation = sns.heatmap(df.corr(), cmap="YlGnBu")

* We can observe that cont variables 1,6,7,8,9,10,11,12,13 are the most inter-correlated
* Note: the corelation between the parameters and the target variable take maximum absolute value of 0.067.
* Since this value is close to 0, the linear regression model will not be a good fit.


# Regression
* The following are the mathematical models that will help perdict a continuous outcome (result) based on one or more input(s) (predictor variables).
## Linear Regression
* Simple approach for [supervised learning](https://en.wikipedia.org/wiki/Supervised_learning)
* We assume that the _true_ relationship between  X nd Y takes form Y = f(x) + ϵ (f is an unknown function, ϵ is a mean-zero random error term)
* Y = β0 + β1X + ϵ
    * β0 - intercept term (, the expected value of Y when X = 0)
    * β1 - slope (the average increase in Y associated with a one-unit increase in X)
    * ϵ -catch-all for what we miss with this simple model: the true relationship is probably not linear, there may be other variables that cause variation in Y , and there may be measurement error
* Analysing each individual variable
* For estimating coefficients
    * We choose _least squares method_ to choose the coefficients such that we minimise RSS (Residual Sum of Squares)
* To predict the confidence interval: RSE (Residual Standard Error)



### Using [Cross validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) for model selection.

In [None]:
%%time

cv = cross_validate(
        estimator = LinearRegression(n_jobs = -1),
        X = df.drop(columns = ["target"]),
        y = df["target"],
        cv = 5,
        scoring = ["r2","neg_mean_squared_error"],
        verbose = True,
)


In [None]:
cv["test_neg_mean_squared_error"].mean()

Sinc the R2 values are very less, it is pretty evident that the Linear Regression is not a suitable model to explain the variance of our data. 
* Underfitting

In [None]:
%%time
# After crossvalidation, we will try to fit our model
tdf
model = LinearRegression(n_jobs=-1)
# when using GPU
# model = RFC(verbose=True)
model.fit(df.drop(columns=["target"]), df["target"])
# predicting the model
pred = model.predict(tdf.drop(columns=["id"]))

In [None]:
ans = pd.DataFrame({"id": tdf["id"], "target": pred})
ans["id"] = ans["id"].astype(int)
# converting to submission file. Since we have set the id col, setting index = False
ans.to_csv("submission_LinearRegression.csv", index=False)

In [None]:
# test_r2 is default variable of cv: getting mean of it
cv["test_r2"].mean()

The following are some ways in which the simple linear model can be improved,
by replacing plain least squares fitting with some alternative fitting procedures.
# Ridge
* We perform regularisation on our linear regression model. 
* Regularisation will help reducing the coefficients. The parameters that have more role in determining the target/ result value will have less shinking coefficients as the value of alpha increases.
* It shrinks the parameters, therefore it is mostly used to prevent multicollinearity.
* Uses L2 regularization technique.

In [None]:
X = df.drop(columns=["target"])
y = df['target'] 
kf = KFold(n_splits=5)
kf.get_n_splits(X)
print(kf)

* Using K-fold approach to implement RidgeCV model
* RidgeCV will internally apply Cross-validation to choose the optimal value of tuning variable alpha

In [None]:
score = 0
for train_index, test_index in kf.split(X, df["target"]):
    print("TRAIN:", train_index, "TEST:", test_index)
    # train_index, test_index are integer indices based on the number of rows
    # Thus we need iloc to access data
    # iloc: Axes left out of the specification are assumed to be :,
    # e.g. p.iloc['test_index'] is equivalent to p.iloc['test_index', :].
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    model = RidgeCV().fit(X_train, y_train)

    y_pred = model.predict(X_test)
    score += mean_squared_error(y_test, y_pred)
# mean of MSE =  0.5275229
print((score / kf.get_n_splits(X)))


In [None]:
model = RidgeCV()
model.fit(df.drop(columns="target"), df["target"])
# Fit the data and get the optimal value of alpha chosen
# Here: 10
model.alpha_


In [None]:
alphas = np.linspace(1, 100000, 100)
ridge = Ridge(max_iter=10000)
coefs = []

for a in alphas:
    ridge.set_params(alpha=a)
    ridge.fit(df.drop(columns=["target"]), df["target"])
    coefs.append(ridge.coef_)

ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
plt.axis("tight")
plt.xlabel("alpha")
plt.legend(X.columns,bbox_to_anchor=(0.85, -0.25), fancybox=True, shadow=True, ncol=3)
plt.ylabel("Standardized Coefficients")
plt.title("Ridge coefficients as a function of alpha")

* We can observe that, as we increase the value of alpha, the magnitude of the coefficients decreases, where the values reaches to zero but not absolute zero

# Lasso
* Lasso is similar to ridge regression, however here the coefficients can actually take value = 0
* Uses l1 regularisation technique
* Used for feature selection

In [None]:
# Note: we use LassoCV which internally performs cross-validation to choose optimal value of tuning variable- alpha
cv = cross_validate(
    estimator=LassoCV(n_jobs=-1),
    X=df.drop(columns=["target"]),
    y=df["target"],
    verbose=1,
    return_train_score=True,
    scoring=["r2", "neg_mean_squared_error"],
    cv=5,
)

In [None]:
cv["test_neg_mean_squared_error"].mean()

* To better understand the variation of the coefficients wth change in the tuning variable, we will plot the change in coefficients with respect to change in alpha.

In [None]:
model = LassoCV(n_jobs=-1)
model.fit(df.drop(columns="target"), df["target"])
# Fit the data and get the optimal value of alpha chosen
model.alpha_


In [None]:
alphas = np.linspace(2.7800706909230952e-05, 0.01, 100)
lasso = Lasso(max_iter=10000)
coefs = []

for a in alphas:
    lasso.set_params(alpha=a)
    lasso.fit(df.drop(columns=["target"]), df["target"])
    coefs.append(lasso.coef_)

ax = plt.gca()

ax.plot(alphas, coefs)
ax.set_xscale("log")
plt.axis("tight")
plt.xlabel("alpha")
plt.legend(X.columns,bbox_to_anchor=(0.85, -0.25), fancybox=True, shadow=True, ncol=3)
plt.ylabel("Standardized Coefficients")
plt.title("Lasso coefficients as a function of alpha")

# Poisson Regression
* Poisson regression assumes the response variable Y has a Poisson distribution
* It assumes the logarithm of its expected value can be modeled by a linear combination of unknown parameters. 

In [None]:
cv = cross_validate(
    estimator=PoissonRegressor(),
    X=df.drop(columns=["target"]).astype("float32"),
    y=df["target"].astype("float32"),
    verbose=1,
    return_train_score=True,
    scoring=["r2", "neg_mean_squared_error", "neg_mean_poisson_deviance"],
    cv=3,
)


In [None]:
cv["test_neg_mean_squared_error"].mean()

# K-Neighbors Regressor

In [None]:
cv = cross_validate(
    estimator=KNeighborsRegressor(n_neighbors=3, n_jobs=-1),
    X=df.drop(columns="target"),
    y=df["target"],
    verbose=True,
    cv=5,
    scoring=["r2", "neg_mean_squared_error"],
    n_jobs=-1,
)

In [None]:
cv["test_neg_mean_squared_error"].mean()

# Boosting
* Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. The succeeding models are dependent on the previous model.
* When an output is mispredicted by a hypothesis, its weight is increased so that next hypothesis is more likely to classify it correctly. By combining the whole set at the end converts weak learners into better performing model.
* The final model is the weighted mean of all the models (weak learners).


# Light Gradient Boosting Model

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(columns="target"), df["target"], test_size=0.15
)

param = {
    "boosting_type": "gbdt",
    "objective": "regression",
    "metric": "RMSE",
    "learning_rate": 0.0045,
}


model = LGBMRegressor(**param)
model.fit(X_train, y_train)


ypred2 = model.predict(X_test)

# rmse always takes in validation sets, eg. y test, x test predicted.
print(mean_squared_error(y_test, ypred2))


## Light Gradient Boosting Model With k-fold

In [None]:
# Now add this to train and test And you will get the score
X = df.drop(columns=["target"])
kf = KFold(n_splits=5)

for train_index, test_index in kf.split(X, df["target"]):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    param = {
        "boosting_type": "gbdt",
        "objective": "regression",
        "metric": "RMSE",
        "learning_rate": 0.0045,
    }

    model = LGBMRegressor(**param)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    score = np.sqrt(mean_squared_error(y_test, y_pred))

# XGBoost
* XGBoost is short for “eXtreme Gradient Boosting.” 
* The “eXtreme” refers to speed enhancements such as parallel computing and cache awareness that makes XGBoost approximately 10 times faster than traditional Gradient Boosting.
* XGBoost is regularized, so default models often don’t overfit
* It has extensive hyperparameters for fine-tuning


In [None]:
cv = cross_validate(
    estimator=XGBRegressor(),
    X=df.drop(columns="target"),
    y=df["target"],
    scoring=["r2", "neg_mean_squared_error"],
    verbose=True,
    cv=5,
    n_jobs=-1,
)


In [None]:
cv["test_neg_mean_squared_error"].mean()

# Bagging

* Bagging is short for “bootstrap aggregation,” meaning that samples are chosen with replacement (bootstrapping), and combined (aggregated)
* Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.
# Random Forest
* The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree via bagging. 
* It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. 
* This model generates decorelated trees by choosing a fresh sample of m predictors at each split (m ≈ √p)


In [None]:
%%time 

cv = cross_validate(
    estimator=RandomForestRegressor(n_jobs=-1, verbose=True),
    #     estimator=cuRFC(verbose=True),
    X=df.drop(columns=["target"]),
    y=df["target"],
    cv=5,
    scoring=["r2", "neg_mean_squared_error"],
    verbose=True,
)

In [None]:
cv["test_neg_mean_squared_error"].mean()

In [None]:
# We are using the OOB(~ validation score) score to compare the training and test error
# model = RandomForestRegressor(n_jobs=-1, verbose=True, oob_score = True)
# model.fit(X, y)
# # Training error
# model.score(X,y)
# # oob error
# model.oob_score_
# * We see the training score = 0.87 while the test score = 0.05 
# * From the scores, we can say the our random forest is overfitting the training dataset 
# * Note that the CV and oob score are almost similar

# Negetive mean square errors
1. Linear Regression --> -0.5274229884147644
2. Ridge Regression --> (MSE) 0.5274229003487053
3. Lasso Regression --> -0.5274227619171142
4. Poisson Regression --> -0.5332794126312232
5. K-Neighbor Regressor --> -0.656536448001861
6. LGBM --> (MSE) 0.5258897237509819
7. XGB --> -0.49416557550430296
8. Random Forest --> -0.5009868281839414