# Prediction Model

We will be creating multiple regression models that will try and predict housing prices based on the features in our cleaned datasets. The models will be evaluated based on the root mean squared error (RMSE) of their predictions against the validation set. Once the best production model is found, the final model will then be retrained using the entire training dataset then its test set predictions will be submitted to Kaggle to determine the actual test score.

## Contents:
- [Regression Modelling](#Regression-Modelling)
- [Initial Kaggle Submission](#Initial-Kaggle-Submission)
- [Model Improvement](#Model-Improvement)
- [Final Kaggle Submission](#Final-Kaggle-Submission)
- [Conclusions](#Conclusions)

------------------------------------------------

In [None]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNet, ElasticNetCV
from sklearn.feature_selection import RFE, RFECV
import sklearn.metrics as metrics
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold

In [None]:
train=pd.read_csv('./datasets/training_model.csv')
valid=pd.read_csv('./datasets/validation_model.csv')
test=pd.read_csv('./datasets/test_kaggle.csv')
ori_test=pd.read_csv('./datasets/test.csv')

In [None]:
X_train = train.drop(columns='saleprice')
X_valid = valid.drop(columns='saleprice')
y_train = train['saleprice']
y_valid = valid['saleprice']

X_test = test

In [None]:
#checking the number of rows for each dataset
print(f'Number of columns for X_train is {X_train.shape[1]}')
print(f'Number of columns for X_valid is {X_valid.shape[1]}')
print(f'Number of columns for X_train is {X_test.shape[1]}')

------------------------------------------------
## Regression Modelling

### Linear Regression

In [None]:
# Instantiate
lr = LinearRegression()

#cross-validation
np.abs(cross_val_score(lr, X_train, y_train, scoring='neg_root_mean_squared_error').mean())

In [None]:
#check r2 scores
lr_scores = cross_val_score(lr, X_train, y_train, cv=3)

print(lr_scores)
lr_scores.mean() 

In [None]:
# Evaluation against validation set
lr.fit(X_train, y_train)

np.sqrt(metrics.mean_squared_error(y_valid, lr.predict(X_valid)))

### Ridge Regression

In [None]:
# Finding best alpha term
r_alphas = np.linspace(.1, 10, 100)
ridge_cv = RidgeCV(alphas=r_alphas)
ridge_cv.fit(X_train, y_train)

In [None]:
# Best alpha
ridge_cv.alpha_

In [None]:
# Instantiate
ridge = Ridge(alpha=ridge_cv.alpha_)

#cross-validation
np.abs(cross_val_score(ridge, X_train, y_train, scoring='neg_root_mean_squared_error').mean())

In [None]:
#check r2 scores
ridge_scores = cross_val_score(ridge, X_train, y_train, cv=3)

print(ridge_scores)
ridge_scores.mean() 

In [None]:
## Evaluation against validation set
ridge.fit(X_train, y_train)

np.sqrt(metrics.mean_squared_error(y_valid, ridge.predict(X_valid)))

### Lasso Regression

In [None]:
# Finding best alpha term
lasso_cv = LassoCV(n_alphas=200)

#fit model us
lasso_cv.fit(X_train, y_train)

In [None]:
# Best alpha
lasso_cv.alpha_

In [None]:
# Instantiate
lasso = Lasso(alpha=lasso_cv.alpha_, max_iter=10000)

#cross-validation
np.abs(cross_val_score(lasso, X_train, y_train, scoring='neg_root_mean_squared_error').mean())

In [None]:
#check r2 scores
lasso_scores = cross_val_score(lasso, X_train, y_train, cv=3)

print(lasso_scores)
lasso_scores.mean() 

In [None]:
## Evaluation against validation set
lasso.fit(X_train, y_train)

np.sqrt(metrics.mean_squared_error(y_valid, lasso.predict(X_valid)))

### Elastic Net

In [None]:
# Finding best alpha term
enet_alphas = np.linspace(0.5, 1.0, 100)# Return evenly spaced numbers over a specified interval
enet_cv = ElasticNetCV(alphas=enet_alphas, cv=5) #l1_ratiofloat, default=0.5

# Fit model using optimal alpha.
enet_cv.fit(X_train, y_train)

In [None]:
# Best alpha
enet_cv.alpha_

In [None]:
# Instantiate
enet = ElasticNet(alpha=enet_cv.alpha_)

enet_scores = cross_val_score(lasso, X_train, y_train, cv=7)

In [None]:
#cross-validation
np.abs(cross_val_score(enet, X_train, y_train, scoring='neg_root_mean_squared_error').mean())

In [None]:
#check r2 score
enet_scores = cross_val_score(enet, X_train, y_train, cv=7)

print(enet_scores)
enet_scores.mean() 

In [None]:
## Evaluation against validation set
enet.fit(X_train, y_train)

np.sqrt(metrics.mean_squared_error(y_valid, enet.predict(X_valid)))


### Initial Kaggle Submission

In [None]:
def kaggle_submission(preds, model_name):
    
    submission = pd.DataFrame(data=preds)
    submission = pd.merge(ori_test['Id'], submission, left_index = True, right_index = True)
    
    submission.rename({'Id' : 'ID',
                      0 : 'SalePrice'},
                     inplace = True,
                     axis = 1)
    
    submission.to_csv(f'./datasets/submission_{model_name}.csv', index=False)
    
    return submission

In [None]:
test_preds_lr = lr.predict(X_test)
kaggle_submission(test_preds_lr, 'linear_reg_01')

In [None]:
test_preds_ridge = ridge.predict(X_test)
kaggle_submission(test_preds_ridge, 'ridge_01')

In [None]:
test_preds_lasso = lasso.predict(X_test)
kaggle_submission(test_preds_lasso, 'lasso_01')

In [None]:
test_preds_enet = enet.predict(X_test)
kaggle_submission(test_preds_enet, 'enet_01')

We run through our datasets through 4 regression models; Linear, Ridge, Lasso and ElasticNet.


|Model | Penalty | α | Train Score | Cross-Validation Score | Kaggle Score (Public) |
|---|---|---|---|---|---|
| Linear Regression | - | - | -2.0611 | 1.6718e^14 | 7.0848e^14 |
| Ridge Regression | L2 | 4.5 | 0.906 | 24810.7735 | 23432.86209 |
| Lasso Regression | L1 | 65.3995 | 0.9075 | 24670.8261 | 23531.36260 |
| ElasticNet Regression | L1+L2 | 0.5 | 0.8884 | 26979.0956 | 28275.50000 |

For Linear, the scores show that the model is not useful in this dataset, especially when there are 191 features, after pre-processing, and Linear Regression does not regularization and so, the 'noise' that may be present in the model is still taken into consideration.

As for the Ridge and Lasso models, they had the better scores across the scoring metric out of the four models, with Lasso being slightly better. They also scored well on Kaggle upon the first submission on the website. The main reason for this is that these regressions regularize the models and shrink the regression coefficients. Features that are deemed to be unimportant in contributing to the predictive value of the model are given smaller coefficients. Thus, the models tend to have lower variance and can generalise to new data better. Even though both ridge and lasso saw a significant result and prediction score, lasso saw a best score among the three regression models. This is because Lasso shrinks the irrelevant features to zero and allows the model for a better prediction model.

Despite the good results, we will not stop here. There are 191 features in the dataset (58 before OneHot encoding) and with such a high number of features, the model may interpret these features differently. Therefore, to improve the model, we must reduce the complexity of it. Even with regularization, there may still be 'noise' in the model which may affect the model predictability.

We will seek to reduce number of features through recursive feature elimination [(RFE)](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html) which select features by recursively considering smaller and smaller sets of features and filter those with the higher correlation with `'saleprice'`.

Following which, hyperparamter tuning will be done on the reduced features selected by RFE to see if any improvements are seen.

------------------------------------------------
## Model Improvement

### Feature Selection by RFE

In [None]:
# Using similar alpha for ease of computation
model = Lasso(alpha=lasso_cv.alpha_, max_iter=10000)
rfe = RFE(model, n_features_to_select=50)

# Fitting to training data
X_train_rfe = rfe.fit_transform(X_train, y_train)
model.fit(X_train_rfe, y_train)

# Tabulating RFE results
rfe_results = [np.array(X_train.columns), rfe.ranking_]
rfe_results_df = pd.DataFrame(rfe_results).T
rfe_results_df.columns = ['Feature', 'RFE Ranking']

# Finding features used by lasso (1 means feature was used)
rfe_lasso_features = rfe_results_df.loc[rfe_results_df['RFE Ranking'] == 1, 'Feature'].tolist()

#filtering the top features by RFE
X_train_reduced = X_train[rfe_lasso_features]

#Instantiate and fit
lasso_cv_reduced = LassoCV(n_alphas=100, max_iter=10000)
lasso_cv_reduced.fit(X_train_reduced, y_train)

#modelling
lasso_reduced = Lasso(alpha=lasso_cv_reduced.alpha_, max_iter=10000)

print(f'CVS for top 50 is : {np.abs(cross_val_score(lasso_reduced, X_train_reduced, y_train, scoring="neg_root_mean_squared_error").mean())}')
print(f'R2 cross for top 50 is {(cross_val_score(lasso_reduced, X_train_reduced, y_train, cv=3)).mean()}')

lasso_reduced.fit(X_train_reduced, y_train)

print(f'Validation score for top 50 is : {np.sqrt(metrics.mean_squared_error(y_valid, lasso_reduced.predict(X_valid[rfe_lasso_features])))}')
print('-'*30)

In [None]:
lasso_reduced = Lasso(alpha=lasso_cv_reduced.alpha_, max_iter=10000)
lasso_reduced.fit(X_train_reduced, y_train)

In [None]:
#predicting for X_test for the top 50 features for lasso
test_preds_lasso_reduced = lasso_reduced.predict(X_test[rfe_lasso_features])
kaggle_submission(test_preds_lasso_reduced, 'lasso_02')

Based on our top 50 features based on RFE, the new Kaggle scores is shown to be a no improvement, despite having an improved scores on our model scoring metrics.
Kaggle score was **23643.55423**

### Hyperparameter Tuning

In [None]:
# Lasso 
lasso_reduced_hyper = Lasso()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
lasso_params = {'alpha':[0.005, 0.02, 0.03, 0.05, 0.06]}
# define search
search = GridSearchCV(lasso_reduced_hyper , lasso_params, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X_train[rfe_lasso_features], y_train)

In [None]:
results.best_score_

In [None]:
results.best_params_

In [None]:
# Instantiate
lasso_reduced_hyper = Lasso(alpha=0.06, max_iter=50000)

#check r2 score
print((cross_val_score(lasso_reduced_hyper , X_train[rfe_lasso_features], y_train)).mean())

#cross-validation
np.abs(cross_val_score(lasso_reduced_hyper , X_train[rfe_lasso_features], y_train, scoring='neg_root_mean_squared_error').mean())

In [None]:
lasso_reduced_hyper.fit(X_train[rfe_lasso_features], y_train)
test_preds_lasso_hyper = lasso_reduced_hyper.predict(X_test[rfe_lasso_features])

#making kaggle submission dataset
kaggle_submission(test_preds_lasso_hyper , 'lasso_03')

Based on our top 50 features based on RFE and after hyperparameter tuning, the new Kaggle scores is shown to be a no improvement as well, despite having an improved scores on our model scoring metrics.
Kaggle score was **23605.50435**

------------------------------------------------
## Conclusion

The final production model used was the Lasso Regression Model with 50 features selected using Recursive Feature Elimination (Lasso Regression). Even though it showed good metrics scoring result after validating with our valid dataset, the final Kaggle score showed no improvement, especially after the hyperparameter tuning process. This is likely due to the fact that CV models were already used initially to find the best alpha to fit our respective model. Thus, our initial model showed the best result, albeit only a slight change in scores after RFE filtering. 

The final tabulation of the models are as below:

| Model | Penalty | α | Train Score | Cross-Validation Score | Kaggle Score (Public) |
|---|---|---|---|---|---|
| Linear Regression | - | - | -2.0611 | 1.6718e^14 | 7.0848e^14 |
| Ridge Regression | L2 | 4.5 | 0.906 | 24810.7735 | 23429.53810 |
| Lasso Regression | L1 | 65.3995 | 0.9075 | 24670.8261 | 23531.36260 |
| ElasticNet Regression | L1+L2 | 0.5 | 0.8884 | 26979.0956 | 28275.50000 |
| Lasso Regression (after RFE) | L1 | 0.5 | 0.913 | 24019.6993 | 23643.55423 |
| Lasso Regression (after hyperparameter tuning) | L1 | 0.06 | 0.915 | 23582.1916 | 23932.42010 |

From the Kaggle scores, it seems that Ridge was the slightly better model in predicting the saleprice of property on the test dataset, however Lasso seemed to performed better during the modelling stage. Albeit the scores were only off by a few 100 points, both Ridge and Lasso did a respectable job in prediction of the test dataset. 

Our final best Kaggle score was **23531**.

### Interpreting Coefficients

In [None]:
lasso_reduced.coef_

In [None]:
lasso_reduced.intercept_

In [None]:
#making dataframe with coefficients of the reduced lasso features based on RFE 
coefs = pd.DataFrame([rfe_lasso_features, lasso_reduced.coef_]).T

In [None]:
coefs.columns = ['feature', 'Coefficient']

In [None]:
#top 10 features with positive coefficient
coefs.sort_values(by='Coefficient', ascending=False).head(10)

In [None]:
#top 10 features with negative coefficient
coefs.sort_values(by='Coefficient', ascending=True).head(10)

Feature coefficients can be simply interpreted as `Bx` in the linear regression equation: `Y = B0 + B1X1 + ... + BxXx + e`
`B0` represents the intercept while `Bx` represents the slope parameter. For example, `'property_age'` impacts on our model as follows: 

`'sale_price'` = -11325.42 + 168148.42(`'overall_qual'`)

Meaning to say, a one unit increase in age of the property before it was sold would relate to a predicted decrease in sale price of $16,8148.42.

From the coefficients, we see that generally neighbourhoods and area/size of property contributes the most in sale price. Generally, locations that are 'prime' will lead to an increase in sale price and same goes to the size of the property as it shows a positive . Conversely, the features that had the greatest negative effect on sale price were the age of the house and certain roof styles (Mansard). The older the age of property, the lower the sale price and the more type of 'less-quality' roof styles (that are least popular among homeowners), the lower the sale price as well. 

Therefore, we can answer our project statement whereby the features listed in the coefficient dataframe has highlighted several similarly-described features has an impact on saleprice of properties in Ames, Iowa; namely **neighbourhood/location, size & area of properties and building types** contributes the most to the housing market.

### Recommendations

Our model is able to predict housing prices relatively well. It can also be fitted based on the needs of the stakeholders. Our dataset does not have some features such as economic indicators, like  employment or wage growth of the area, and interest rates. Based on our outside research, more macro features has an impact on the housing price in the market. An increase in interest rate can influence one's ability to afford a home as the individual's budget will be more focused on managing to pay off additional interest rates (for example, for credit card or short-term loan). [(source)](https://www.opendoor.com/w/blog/factors-that-influence-home-value)

Therefore, we can include such features to better predict housing market based on more macro features, if the needs of our stakeholders require it to. In this case, our stakeholders are catered more to homeowners that may want to find ways to help alleviate their position in selling their properties at a higher price as the features describe more on the description of the property itself. If our stakeholders are more of real estate developers or investors, whose sole purpose is to see intrinsic value due to long-term price appreciation [(source)](https://www.investopedia.com/articles/investing/110614/most-important-factors-investing-real-estate.asp#:~:text=Expected%20cash%20flow%20from%20rental,to%20get%20a%20better%20price), then a more macro view and data is required in which our model is able to run prediction. 

Even with added features, our model will still be able to instantiate them and evaluate if the features are deemed significant in its predictive power. Especially with Lasso Regression Model, it can sieve out noise better by zeroing the coefficient. With proper datasets and training, we can recommend that the model be trained in more macro-related features to boost its predictive power in the housing market.