## Trying out a linear model: 

Author: Alexandru Papiu ([@apapiu](https://twitter.com/apapiu), [GitHub](https://github.com/apapiu))
 
If you use parts of this notebook in your own scripts, please give some sort of credit (for example link back to this). Thanks!


There have been a few [great](https://www.kaggle.com/comartel/house-prices-advanced-regression-techniques/house-price-xgboost-starter/run/348739)  [scripts](https://www.kaggle.com/zoupet/house-prices-advanced-regression-techniques/xgboost-10-kfolds-with-scikit-learn/run/357561) on [xgboost](https://www.kaggle.com/tadepalli/house-prices-advanced-regression-techniques/xgboost-with-n-trees-autostop-0-12638/run/353049) already so I'd figured I'd try something simpler: a regularized linear regression model. Surprisingly it does really well with very little feature engineering. The key point is to to log_transform the numeric variables since most of them are skewed.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt
from scipy.stats import skew
from scipy.stats.stats import pearsonr


%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook
%matplotlib inline

ModuleNotFoundError: No module named 'pandas'

In [None]:
train = pd.read_csv("input/train.csv")
test = pd.read_csv("input/test.csv")

In [None]:
train.head()

In [None]:
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))

### Data preprocessing: 
We're not going to do anything fancy here: 
 
- First I'll transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal    
- Create Dummy variables for the categorical features    
- Replace the numeric missing values (NaN's) with the mean of their respective columns

In [None]:
matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)
prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

In [None]:
#log transform the target:
train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

In [None]:
all_data = pd.get_dummies(all_data)

In [None]:
#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())

In [None]:
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

In [None]:
from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, Lasso, LassoCV, LassoLarsCV
from sklearn.model_selection import cross_val_score

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)

In [None]:
model_ridge = Ridge(alpha = 0.1)

The main tuning parameter for the Ridge model is alpha - a regularization parameter that measures how flexible our model is. The higher the regularization the less prone our model will be to overfit. However it will also lose flexibility and might not capture all of the signal in the data.

In [None]:
from sklearn.metrics import mean_squared_error
model_ridge.fit(X_train,y)
pred = model_ridge.predict(X_train) # using X_train since X_test has no SalesPrice to compare to
rmse = np.sqrt(mean_squared_error(y,pred))

In [None]:
print("Ridge Pred: " + str(pred))
print("Ridge RMSE: " + str(rmse))

In [None]:
model_lasso = Lasso(alpha = 0.1)
model_lasso.fit(X_train, y)
pred = model_lasso.predict(X_train) # using X_train since X_test has no SalesPrice to compare to
rmse = np.sqrt(mean_squared_error(y,pred))

In [None]:
print("Lasso Pred: " + str(pred))
print("Lasso RMSE: " + str(rmse))

In [None]:
model_ridge = RidgeCV(alphas = [0.0005, 0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]).fit(X_train, y)
best_alpha = model_ridge.alpha_
model_ridge = Ridge(alpha=best_alpha)
model_ridge.fit(X_train,y)
print("Best alpha for Ridge:  " + str(best_alpha))
print("Score from best alpha: " + str(model_ridge.score(X_train, y)))

In [None]:
model_lasso = LassoCV(alphas = [0.0005, 0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]).fit(X_train, y)
best_alpha = model_lasso.alpha_
model_lasso = Lasso(alpha=best_alpha)
model_lasso.fit(X_train,y)
print("Best alpha for Ridge:  " + str(best_alpha))
print("Score from best alpha: " + str(model_lasso.score(X_train, y)))

Nice! The lasso performs even better so we'll just use this one to predict on the test set. Another neat thing about the Lasso is that it does feature selection for you - setting coefficients of features it deems unimportant to zero. Let's take a look at the coefficients:

In [None]:
alphas = [0.0005, 0.001, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
l0_norms = []

for alpha in alphas:
    model_lasso = Lasso(alpha=alpha)
    model_lasso.fit(X_train, y)
    coefs = model_lasso.coef_
    l0_norm = 0
    for coef in coefs:
        if coef != 0:
            l0_norm += 1
    l0_norms.append(l0_norm)

plt.plot(alphas, l0_norms)
plt.show()
plt.plot(alphas, l0_norms)
plt.xscale('log')
plt.show()

In [None]:
model_lasso = Lasso(alpha=10)
model_lasso.fit(X_train, y)
output_lasso = model_lasso.predict(X_train)
model_ridge = Ridge(alpha=0.0005)
model_ridge.fit(X_train, y)
output_ridge = model_ridge.predict(X_train)

In [None]:
X_train['lasso'] = pd.Series(output_lasso, index=X_train.index)
X_train['ridge'] = pd.Series(output_ridge, index=X_train.index)

model_ridge = Ridge(alpha=0.0005)
model_ridge.fit(X_train, y)
print("Ensemble Ridge Score: " + str(model_ridge.score(X_train,y)))

In [None]:
import sys
!{sys.executable} -m pip install xgboost

In [None]:
import xgboost as xgb

In [None]:
#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

In [None]:
model_xgb = xgb.XGBRegressor(max_depth=2) #the params were tuned using xgb.cv
model_xgb.fit(X_train, y)

In [None]:
xgb_preds = model_xgb.predict(X_train)

u = mean_squared_error(y,xgb_preds)
true_mean = [y.mean()] * y.shape[0]
v = mean_squared_error(y,true_mean)

score = 1 - (u/v)
print("XGB Score: " + str(score))