Import the libraries we will use.

- **Pandas** - for storing, exploring and manipulating our data in a DataFrame. 
- **scikit-learn** (sklearn) - for regression models (Decision Tree and Random Forest)

In [1]:
# Import libraries: pandas, scikit-learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

The Iowa homes prices data is stored in a csv file specified in the path then we load the data to a pandas DataFrame type.

In [2]:
# Dataset: home prices in Iowa
iowa_file_path = "./data/iowa_home.csv"

# Read csv file and load it as a DataFrame in pandas
iowa_data = pd.read_csv(iowa_file_path)

**Exploring our data**

Now that we have our dataset stored in a DataFrame, we can now analyze and view summary statistics of our data. We'll use the following:
- DataFrame.shape
- DataFrame.describe()
- DataFrame.head() 


From our output, we're dealing with 1460 data of Iowa Homes with its features and price.


In [3]:
# Print a summary statistics of data
print(iowa_data.shape)
print(iowa_data.describe())
#print(iowa_data.head())

(1460, 81)
                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  ...  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  ...   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726  ...   
std       1.112799    30.202904     20.645407   181.066207   456

In [None]:
# Histogram
#iowa_data.hist(column=["YrSold","YearBuilt","YearRemodAdd","GarageYrBlt"]);

**Data Cleaning** 

Since we have 81 variables/columns of Iowa data consisting of features and our target (Price), we will just choose the numeric types that we consider as a good feature of a price.  

After checking for NaN values in our dataset, we can see that the features: Alley, PoolQC, Fence and MiscFeature have too many NaNs. It will be better to remove them.

In [None]:
# Check for NaN values in our features
null_columns=iowa_data.columns[iowa_data.isnull().any()]
iowa_data[null_columns].isnull().sum()


# Selecting prediction target using dot notation in pandas
iowa_target = iowa_data.SalePrice

# First, we drop the target column to have DataFrame of features
features = iowa_data.drop(['SalePrice'], axis=1)

to_drop = ['Alley', 'PoolQC', 'Fence', 'MiscFeature'] # drop columns because of many NaNs
# For this project, we will only include numeric features.
features = features.drop(to_drop, axis=1).select_dtypes(exclude=['object'])

# possible features
# numeric: LotFrontage, LotArea, YearBuilt, MasVnrArea, GarageArea,
#"1stFlrSF", "2ndFlrSF", "FullBath", "BedroomAbvGr", "TotRmsAbvGrd"

iowa_features = ['LotFrontage', 'LotArea', 'YearBuilt', 'MasVnrArea', 'GarageArea',
                 '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
features = features[iowa_features]

# change NA values in Lot Frontage as 0
values = {'LotFrontage': 0, 'MasVnrArea': 0}
features = features.fillna(value=values)
#features.head()

In [None]:
# Split data to training and validation data
X_train, X_test, y_train, y_test = train_test_split(features, 
                                                    iowa_target,
                                                    train_size=0.7, 
                                                    test_size=0.3, 
                                                    random_state=0)

**Modeling** <br>
We first explore Decision Tree then we'll check Random Forest.

**Model Validation** <br>
Mean Absolute Error (MAE) - one of the metrics for summarizing model quality.

In [None]:
# Define and train model using training data
# Specify a number for random_state to ensure same results each run
iowa_model = DecisionTreeRegressor(random_state=0)
iowa_model.fit(X_train, y_train)

# Predict using validation/test data
predictions = iowa_model.predict(X_test)

# Model validation: measure the quality of models (Predictive Accuracy)
print("On average, our predictions are off by about %.2f by using a Decision Tree" %(mean_absolute_error(y_test, predictions)))

In [None]:
# Concepts of underfitting and overfitting

# compare MAE with differing values of max_leaf_nodes
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
scores = {leaf_size: get_mae(leaf_size, X_train, X_test, y_train, y_test) for leaf_size in candidate_max_leaf_nodes}
#print(scores)

# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = min(scores, key=scores.get)
print("Best tree size (value of max_leaf_nodes) is %d with MAE score of %.2f" %(best_tree_size, scores[best_tree_size]))


In [None]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    # Help compare mae scores
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

** Random Forest **

Let's check a new model called Random Forest and if it will provide a better predictive accuracy than Decision Trees.

*Overview:* The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

*Parameter Tuning:*

In [None]:
# Random Forest - ML model
forest_model = RandomForestRegressor(random_state=0, n_estimators=10000)
forest_model.fit(X_train, y_train)
iowa_preds = forest_model.predict(X_test)
print("On average, our predictions are off by about %.2f by using Random Forest" %(mean_absolute_error(y_test, iowa_preds)))