# House Prices - Random Forest
For this problem I made a random forest with an optimal number of trees to avoid overfitting and underfitting. To achieve this I made a list with several numbers of trees, created a model for each amount of trees, calculated MAE for each model and chose the model with the lowest MAE.

## Implementation
### Imports

In [1]:
import pandas as pd
from pandas import DataFrame
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from datetime import datetime
import pickle
from numpy import ndarray

### Features and prediction target
These are the house features that the model will learn from and use to predict house prices.

In [2]:
HOUSE_FEATURES = ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', '1stFlrSF', 
                  '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 
                  'TotRmsAbvGrd', 'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 
                  'PoolArea', 'MiscVal', 'MoSold', 'YrSold']
TARGET_PREDICTION = "SalePrice"

### Load data function
This function will load data from a csv file.

In [3]:
def load_data(path: str, describe: bool = False) -> DataFrame:
    """Loads a dataset from a csv file.

    Args:
        path (str): Path to the csv file.
        describe (bool, optional): Boolean value used to choose whether to print or not the description of the dataset. Defaults to False.

    Returns:
        DataFrame: the dataset from the csv file
    """
    dataset = pd.read_csv(path)
    if describe:
        print("Dataset details: \n", dataset.describe())
    return dataset
load_data

<function __main__.load_data(path: str, describe: bool = False) -> pandas.core.frame.DataFrame>

### Create model function
This function will create a random forest model with a given number of trees and will return the model and its mae.

In [4]:
def random_forest(train_dataset: DataFrame, test_size: float = 0.3, nodes: int = 100, save: bool = False) -> tuple[RandomForestRegressor, float]:
    """This function splits the train_dataset into two subsets: one is used for fitting and the other for validation.
    Then it creates the model, trains it with the fitting data, makes prediction using validation data then calculates mae.

    Args:
        train_dataset (DataFrame): The data that will be used for training and validation.
        test_size (float, optional): The amount of data that will be used as validation data. Defaults to 0.3.
        nodes (int, optional): The number of trees in the forest. Defaults to 100.
        save (bool, optional): A value that indicates whether or not the model will be saved as a .pkl file. Defaults to False.

    Returns:
        tuple[RandomForestRegressor, float]: A tuple that contains the model and its MAE.
    """
    X = train_dataset[HOUSE_FEATURES]
    y = train_dataset[TARGET_PREDICTION]
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=test_size)

    model = RandomForestRegressor(
        random_state=1,
        n_estimators=nodes
    )
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    mae = mean_absolute_error(y_test, predictions)

    if save:
        with open(f"models/random_forrest_n{nodes}.pkl", 'wb') as f:
            pickle.dump(model, f)
    return (model, mae)

random_forest

<function __main__.random_forest(train_dataset: pandas.core.frame.DataFrame, test_size: float = 0.3, nodes: int = 100, save: bool = False) -> tuple[sklearn.ensemble._forest.RandomForestRegressor, float]>

### Best random forest
This function will return the most optimal random forest, thus avoiding overfitting and underfitting.

In [5]:
def best_random_forest(train_dataset: DataFrame, validation_data: DataFrame, 
                       nodes_list: list[int],  test_size: float = 0.3, 
                       save: bool = False)-> tuple[RandomForestRegressor, ndarray, float, int]:
    """This function iterates through a list of number of trees and creates and fits model and calculate their mae for 
    each number of trees. The function return only the model wiht the lowest mae.

    Args:
        train_dataset (DataFrame): Data used for fitting.
        validation_data (DataFrame): New data used for predictions.
        nodes_list (list[int]): A list with numbers of trees.
        test_size (float, optional): The amount of training data that will be used for validation. Defaults to 0.3.
        save (bool, optional): A value that indicates whether or not the model will be saved as a .pkl file. Defaults to False.

    Returns:
        tuple[RandomForestRegressor, ndarray, float, int]: The best model, the predictions from new data, mae and the number of trees.
    """
    best_nodes = nodes_list[0]
    best_model, min_mae = random_forest(train_dataset, nodes=nodes_list[0], test_size=test_size)
    for nodes in nodes_list:
        model, mae = random_forest(train_dataset=train_dataset, nodes=nodes, test_size=test_size)
        if mae < min_mae:
            best_nodes = nodes
            min_mae = mae
            best_model = model
    if save:
        with open(f"models/random_forrest_n{best_nodes}.pkl", 'wb') as f:
            pickle.dump(best_model, f)
    X_val = validation_data[HOUSE_FEATURES]
    predictions = best_model.predict(X_val)
    return (best_model, predictions, min_mae, best_nodes)

### Finding and training the best model for the house prices problem

In [8]:
nodes_list = [50, 500, 1000, 2000, 3000, 5000] # The list with the number of trees for each model
train_dataset = load_data("dataset\\train.csv")
test_dataset = load_data("dataset\\test.csv")
model, prediction, mae, nodes = best_random_forest(
    train_dataset=train_dataset,
    validation_data=test_dataset, 
    nodes_list=nodes_list, 
    save=True) # the best model will be saved as a .pkl file

print(f"Number of trees: {nodes}\n")
print(f"MAE: {mae}\n")

Number of trees: 3000

MAE: 18934.660216496344

