## Background
This house prices prediction project is part of the Machine Learning course in Kaggle where we use RandomForestRegressor to predict the house prices based on various attributes of real estates. The analysis is done using Python with pandas and sklearn libraries.

## Ask
House prices are affected by many factors not limited to the number of rooms, size, floors or whether there is a built in fire place. We are given a set of historical sales price data of houses described by 79 variables/descriptors. The goal of this project is to use machine learning to predict the sales price of houses based on their unique variables/descriptors. 

## Prepare
We are provided with 2 data sets as follow:
* train.csv - the training set
* test.csv - the test set

The train.csv data set contains explanatory varaibles/descriptors of houses together with their historical sales price. The data set contains 2930 observations and 79 explanatory variables involved in assessing house values of individual residential property in Ames, Iowa from 2006 to 2010. This data set shall be used to train our machine learning model. 

Description of the 79 explanatory variables/descriptors can be found in the link [https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data?select=train.csv]

The test.csv data set contains a set of houses observations with 79 explanatory variables/descriptors which shall be used to predict the sale price according to the trained machine learning model. 

These data sets were compiled and provided by Dean De Cock for use in data science education and this exercise. Dean De Cock is a Professor of Statistics at Truman State University. 

Further analysis of the data set to evaluate the accuracy, currentness and completness is not relevant for the project.

## Process

We will skip the process phase as the data set provided here have already been cleaned from the original raw data provided by Ames Assessor's Office. We shall focus on the machine learning portion to predict the sale prices of houses.

## Analysis

To being the analysis phase of the project, we first import the libraries used in the evaluation and machine learning of the data set. We have choosen "pandas" for its ability to manupulate data for analysis and "sklearn" for the RandomTreeRegressor machine learning function. 

In [None]:
# Import helpful libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

We follow by importing the train.csv data set into our analysis using read_csv function in the pandas library.

In [None]:
# Load the data, and separate the target
iowa_file_path = '../input/house-prices-advanced-regression-techniques/train.csv'
home_data = pd.read_csv(iowa_file_path)

For the machine learning to take place, we have to define (1) prediction target by default as "y" and (2) Features associated to the prediction target by default as "X". The whole idea is to train the machine learning model to associate the features in X to result in y. By repeating this training process over many observations, ideally, the model will be able to predict y base on X. 

In the train.csv data, we define y and X. Present the first 5 rows of X for a brief overview.

In [None]:
# Create prediction target y as sale price 
y = home_data.SalePrice

# Create features X as the various variables/descriptors of houses
features = [ 'MSSubClass','LotArea', 'HalfBath', 'ScreenPorch', 'OverallQual', 'YearBuilt', 'OverallCond', 'YearRemodAdd', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'WoodDeckSF', 'OpenPorchSF', '3SsnPorch', 'PoolArea', 'MoSold', 'YrSold']
X = home_data[features]
X.head()

RandomForestRegressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. There are multitudes of machine learning models available but for the purpose of this project, we decided to use RandomForestRegressor as a start. 

Before we carry out the actual training on the full data set to predict the sales price of hosues in test.csv, we conduct a test phase to calculate the mean absolute error of the RandomForestRegressor model. This can be done by spliting the training data set into (1) a subset of data for training the RandomForestRegressor model and (2) another subset of data to verify and compare the predicted sales price against the actual sales price to calculate the mean absolute error.   

We can use the follow code chunck to identify the features (X) sufficiently meaningful to training the RandomTreeRegressor model to improve/reduce the mean absolute error for better result.  

In [None]:
# Use train_test_split function to split the training data set into validation and training data
# train_X and train_y will be used to train the RandomForestRegressor model
# val_X and val_y will be used to evaluate the predicted sales price to calculate MAE
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Define a random forest model
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Once we are satisfied with the mean absolute error of the model, we proceed to train the RandomForestRegressor model with the full training data to improve the accuracy of the model.

In [None]:
rf_model_on_full_data = RandomForestRegressor()
rf_model_on_full_data.fit(X, y)

After the RandomForestRegressor model has been trained with the full training data set, we proceed to import the test.csv data set containing 79 variables/descriptors to predict the sale price of these houses. The predicted sales price will be submitted to the competition for evaluation. 

In [None]:

# path to file on the variables/descriptors of hosues
test_data_path = '../input/house-prices-advanced-regression-techniques/test.csv'

# read test data file using pandas read_csv function
test_data = pd.read_csv(test_data_path)

# create test_X from test_data but includes only the columns you used for prediction which has a satified MAE
# The list of columns is stored in a variable called features. 
test_X = test_data[features]

# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(test_X)

We create a csv output using the prediction result for submission. 

In [None]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

# Share and Act

Share, Act phase and data visualisation of data analysis is irrelevant for this project. 

# Other Remarks
There are many multitudes of machine learning models available for predicting sales prices. Regression algorithms, Regularization algorithms, Bayesian algorithms,... ... Artificial neural network algorithms to Deep learning algorithms are available to be considered for future projects. The reasult of which may or may not improve upon the RandomTreeRegressor model which we used in this exercise depending on the MAE output. 