**This notebook is an exercise in the [Introduction to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning) course.  You can reference the tutorial at [this link](https://www.kaggle.com/alexisbcook/machine-learning-competitions).**

---


![alt text for screen readers](https://thumbs.dreamstime.com/b/outline-ames-iowa-skyline-white-buildings-vector-illustration-business-travel-tourism-concept-historic-architecture-203382169.jpg "Credits: dreamstime.com")

<span style="color:PaleVioletRed;font-weight:600;font-size:50px;font-family:monotype;">
    Housing Prices Competition: Ames Housing dataset 
</span>
    


        

    This is a tutorial for beginners to work through the Housing Prices Cometition for Kaggle Learn Users [Competition](https://www.kaggle.com/competitions/home-data-for-ml-course/)
    
    Here is a short problem description:
    "With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home."
    The task is to predict the selling price of a home based on a dataset describing exisiting homes, and their features and the final price.
    
    In this kernel, we will work through the kaggle dataset Ames Housing to understand basic concepts of a Random forest Regressor.
    
    Hope you enjoy the kernel! 

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
<span style="color:pink;font-weight:600;font-size:30px;font-style:serif">
    Objective
</span>

Predict the selling price of a home based on a dataset describing exisiting homes, and their features and the final price

<span style="color:pink;font-weight:600;font-size:30px;font-style:serif">
    File description
</span>
    
- Starter: Housing Price prediction | Random Forest.ipynb : Tutorial and steps to load, wrangle the data, apply logistic regression classifier and test model performance.
- home-data-for-ml-course : data file consisiting of the Ames Housing dataset

<span style="color:pink;font-weight:600;font-size:30px;font-style:serif">
   Data description
</span> 
 
 Tabular data consisting of 1460 entries, each with 81 attributes, including 'SalePrice', which would be the target for prediction 
 
**Source:** The Ames Housing dataset was compiled by Dean De Cock for use in data science education. 
</span>


<a id='21'>
</a>
<span style="color:pink;font-weight:600;font-size:30px;font-style:serif">
   Data Analysis Content
</span>
<span style="font-weight:600;font-size:20px;font-style:serif">
    
1. [Import Dependencies](#1)
1. [Load Data](#2)
1. [Data preprocessing](#3)
    1. [Remove null values](#4)
    1. [Remove highly correlated features](#5) 
    1. [Feature engineering](#6)
    1. [Split dataset](#7)
    1. [One hot encode](#8)
    1. [Feature scaling](#8b)
1. [ Random Forest Regressor](#9)
    1. [Performance Evaluation](#10)
    1. [Train model on all data](#11)
1. [Load raw data for prediction](#12)
1. [Test-data preprocessing](#13)
    1. [Feature match with training data](#14) 
    1. [Feature engineering](#15)
    1. [Replace null values](#16)
    1. [One hot encode](#17)
    1. [Feature scaling](#18)
1. [Data prediction on test data](#19)
1. [Generate submission](#20)

</span>

<a id='1'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Import Dependencies
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
    First we will import our helper modules that we will use throughout this kernel
</span>


In [1]:
# Import helpful libraries
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler 
from sklearn.decomposition import PCA
from sklearn.cluster import k_means

<a id='2'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Load Data
</span>


In [2]:
# Load the data, and separate the target
iowa_file_path = '../input/train.csv'
df = pd.read_csv(iowa_file_path)


pd.options.display.max_rows = 20

In [3]:
df.columns

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

<a id='4'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Data preprocessing
</span>


In [4]:
# Split dataset into independent & dependent datasets
X = df.drop('SalePrice', axis=1)
y = df.SalePrice

<a id='3'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Remove null values
</span>


In [5]:
# Detect null values
X.isna().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MiscVal            0
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
Length: 80, dtype: int64

In [6]:
# Drop null values
def nulldrop(df):
    df.dropna(axis=1, inplace=True)
    return df
       
X = nulldrop(X)
X.isna().sum()

Id               0
MSSubClass       0
MSZoning         0
LotArea          0
Street           0
                ..
MiscVal          0
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
Length: 61, dtype: int64

In [7]:
X.shape

(1460, 61)

<a id='4'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Remove highly correlated features
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
Hghly correlated features can lead to multicollinearity among independent variables. This can affect model performance. Hene, it is important to remove such features. 
</span>

In [8]:
# removing highly correlated features to avoid multicolinearity
def corrdrop(df, maxcorr):
    corr_matrix = df.corr().abs() 
    mask = np.triu(np.ones_like(corr_matrix, dtype = bool))
    tri_df = corr_matrix.mask(mask)
    to_drop = [x for x in tri_df.columns if any(tri_df[x] > maxcorr)]
    df = df.drop(to_drop, axis = 1)
    return df

X = corrdrop(X, 0.95)
print(f"The reduced dataframe has {X.shape[1]} columns.")

The reduced dataframe has 61 columns.


In [9]:
# Note down column names for later
corr_cols = X.columns

<a id='5'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Feature engineering
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
Here, we will create new features from exisiting ones that would act as good indicators of the price of a house, for example, the total number of bathrooms a house has.
</span>

In [10]:
def create_extra_features(df):    
    df['TotBath'] = df['FullBath'] + (0.5* df['HalfBath']) + df['BsmtFullBath'] + (0.5*df['BsmtHalfBath'])    
    df['Total_Home_Quality'] = df['OverallQual'] + df['OverallCond']
    df["HighQualSF"] = df["1stFlrSF"] + df["2ndFlrSF"]
    return df

X = create_extra_features(X)
X.shape

(1460, 64)

In [11]:
X.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,TotBath,Total_Home_Quality,HighQualSF
0,1,60,RL,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,2,2008,WD,Normal,3.5,12,1710
1,2,20,RL,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,5,2007,WD,Normal,2.5,14,1262
2,3,60,RL,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,9,2008,WD,Normal,3.5,12,1786
3,4,70,RL,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,0,0,0,2,2006,WD,Abnorml,2.0,12,1717
4,5,60,RL,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,12,2008,WD,Normal,3.5,13,2198


<a id='6'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Split dataset
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
    We need data to build the model and then data to validate the model. For this purpose we split out dataset into a training set and a test set, respectively. So that we get the same split every tim this notebook is run, we will set random_state as a fixed number. This is considered good practice.
</span>

In [12]:
# Split into validation and training data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [13]:
X_train.shape
X_test.shape

(365, 64)

<a id='7'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    One hot encode
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
    In order for a machine learning algorithm to work with your data, it is important that all values are entrerd in a numerical (machine-readable) form. We need to first explore the datatypes present in our dataset. get_dummies from pandas library is a great tool for one hot encoding the features of a tabular dataset, like ours.
</span>


In [14]:
def onehotencode(df):
    encoded = pd.get_dummies(df)
    return encoded

In [15]:
X_train_encoded = onehotencode(X_train)
X_train_encoded.shape

(1095, 215)

In [16]:
X_test_encoded = onehotencode(X_test)
X_test_encoded.shape

(365, 195)

In [17]:
y_train.shape
y_test.shape

(365,)

In [18]:
X_train_encoded_final , X_test_encoded_final = X_train_encoded.align(X_test_encoded, join='inner', axis=1) 

<a id='8'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Feature scaling
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
    We need to bring our feature values within a comparable range, while still maintaining their independence. For this we will use the fit_transform method on the training dataset. This method learns about our data (mean and variance) and then uses this information to transform the data into a zero mean, unit variance range. We use the same information to tranform the test data as well, instead of learning the  test data as well, so that the model still is 'surprised' by the test data (so no-peeking!)
</span>

In [19]:
scaler = StandardScaler()
X_train_encoded_scaled = scaler.fit_transform(X_train_encoded_final)
X_test_encoded_scaled = scaler.transform(X_test_encoded_final)

In [20]:
X_train_encoded_scaled.shape
X_test_encoded_scaled.shape

(365, 191)

<span style="color:pink;font-weight:600;font-size:16px;font-style:serif">
    Choosing a Machine Learning Model
</span>


    We choose the machine learning model according to the problem we are trying to solve. Since our data is already labeled (Malignant/Benign), we will use the supervised learning approach. Two different kinds of problems in supervised learning are classification and regression. Since our aim is to categorize the data into malignant or benign, we are dealing with a classification problem.
    Here, we will use a logistic regression model.

![alt text for screen readers](https://www.frontiersin.org/files/Articles/284242/fnagi-09-00329-HTML/image_m/fnagi-09-00329-g001.jpg)


**Source:** Sarica, Cerasa & Quattrone, 2017

<span style="color:pink;font-weight:600;font-size:16px;font-style:serif">
    What is a Random Forest Regressor?
</span>

    - Random Forest Regression is a supervised learning algorithm, used for regression tasks. It uses the ensemble learning method, which is a technique that combines predictions from several machine learning algorithms to make a predition, enabling it t make more accurate predtions than a single model.

    - For this, a Random Forest constructs several decision trees during model training, and uses the result from all these trees togenerate a strong prediction. This is a very powerful method, however with a caveat. Overfitting can easily occur in a Random Forest model. It is advantageos to test outputs from several values of tree depth to determine the optimal tree depth. Mean absolute error is a useful metric to do this. 

![alt text for screen readers](https://www.kdnuggets.com/wp-content/uploads/rand-forest-2.jpg "Credits: kdnuggets.com")


<a id='9'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Random Forest Regressor
</span>

[Top](#21)


In [21]:
# Define a random forest model
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train_encoded_scaled, y_train)

RandomForestRegressor(random_state=0)

<a id='10'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Performance evaluation
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
    Now that we have our model, it is important to do a basic evaluation, to see how our model is performing. The mean_absolute_error is a simple way to observe this. It calculates the mean absolute error regression loss, using the true values and predicted values of the prediction target, in our case, the final price of the house. Note: The order in which the arguments are provided does not matter.
</span>

In [22]:
rf_val_predictions = rf_model.predict(X_test_encoded_scaled)
rf_val_mae = mean_absolute_error(rf_val_predictions, y_test)

print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))

Validation MAE for Random Forest Model: 15,507


In [23]:
rf_val_predictions.shape
y_test.shape

(365,)

<a id='11'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Train model on all data
</span>


In [24]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor(random_state=0)

# fit rf_model_on_full_data on all data from the training data
X_encoded = onehotencode(X)
scaler_2 = StandardScaler()
X_encoded_scaled = scaler.fit_transform(X_encoded)

rf_model_on_full_data.fit(X_encoded_scaled,y)

RandomForestRegressor(random_state=0)

<a id='12'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Load raw data for prediction
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
This data will be refered to as test data from now on, not to be consfused with validation data used to assess model performance after model training
</span>

In [25]:
# path to file you will use for predictions
test_data_path = '../input/test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)
test_data.shape

(1459, 80)

<a id='13'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Test-data preprocessing
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
Test data must match the structure that the random forest model is expecting
</span>

In [26]:
test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


<a id='14'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Feature match with training data
</span>

<span style="color:grey;font-weight:500;font-size:20px;font-style:serif">
Training data and test data must have the same features (columns)
    </span>
    
    




In [27]:
test_data = test_data[corr_cols]
test_data.shape

(1459, 61)

In [28]:
test_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,11622,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,120,0,0,6,2010,WD,Normal
1,1462,20,RL,14267,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,36,0,0,0,0,12500,6,2010,WD,Normal
2,1463,60,RL,13830,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,34,0,0,0,0,0,3,2010,WD,Normal
3,1464,60,RL,9978,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,36,0,0,0,0,0,6,2010,WD,Normal
4,1465,120,RL,5005,Pave,IR1,HLS,AllPub,Inside,Gtl,...,82,0,0,144,0,0,1,2010,WD,Normal


<a id='15'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Featre engineering
</span>


In [29]:
test_data = create_extra_features(test_data)
test_data.shape

(1459, 64)

<a id='16'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Replace null values
</span>


In [30]:
test_data.isna().sum()

Id                    0
MSSubClass            0
MSZoning              4
LotArea               0
Street                0
                     ..
SaleType              1
SaleCondition         0
TotBath               2
Total_Home_Quality    0
HighQualSF            0
Length: 64, dtype: int64

In [31]:
test_data = test_data.fillna(0)           
test_data.isna().sum()

Id                    0
MSSubClass            0
MSZoning              0
LotArea               0
Street                0
                     ..
SaleType              0
SaleCondition         0
TotBath               0
Total_Home_Quality    0
HighQualSF            0
Length: 64, dtype: int64

<a id='17'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    One hot encode
</span>


In [32]:
test_data = pd.get_dummies(test_data)

In [33]:
test_data.shape

(1459, 212)

In [34]:
X_encoded.shape

(1460, 219)

In [35]:
final_train, final_test = X_encoded.align(test_data, join='left', axis=1, fill_value=0)  # inner join

In [36]:
final_test.shape

(1459, 219)

In [37]:
final_train.shape

(1460, 219)

<a id='18'></a>
<span style="color:pink;font-weight:600;font-size:20px;font-style:serif">
    Feature scaling
</span>


In [38]:
final_test_scaled = scaler.transform(final_test)
final_test_scaled.shape

(1459, 219)

In [39]:
final_test_scaled.shape

(1459, 219)

<a id='19'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Data prediction on test data
</span>


In [40]:
# make predictions which we will submit. 
test_preds = rf_model_on_full_data.predict(final_test_scaled)

In [41]:
final_test.shape

(1459, 219)

<a id='20'></a>
<span style="color:pink;font-weight:600;font-size:25px;font-style:serif">
    Generate submission
</span>


In [42]:
# Run the code to save predictions in the format used for competition scoring
output = pd.DataFrame({'Id': test_data.Id,
                       'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)

<span style="font-weight:600;font-size:25px;font-style:serif;">
    
    I hope you enjoyed this kernel!
    If you have any questions or tips, feel free to reach out :)
</span>



[Top](#21)

<div class="alert" style="background-color:thistle; text-align:center; color:white; weight:200; font-size:30px;">
    🕊 Feedback & upvotes much appreciated! 🕊
</div>