Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Grid search is a great method for checking off both of these tasks.
- Do a google search for hyperparameter ranges for each type of model.
- Check out RandomizedSearchCV for faster computation with large grids.
- If you have access to a GPU, xgboost can make use of it, but requires additional parameters.

In [1]:
# perform tuning and cross validation here

import pandas as pd

cities = pd.read_csv('../data/cities.csv')
X_train = pd.read_csv('../data/X_train.csv')
y_train = pd.read_csv('../data/y_train.csv')
X_test = pd.read_csv('../data/X_test.csv')
y_test = pd.read_csv('../data/y_test.csv')

CV_df = pd.concat([cities, X_train, y_train], axis = 1)

In [2]:
print("hello world")

hello world


In [2]:
from sklearn.model_selection import KFold


In [3]:
CV_df.drop(columns= 'city_mean_sold_price', inplace= True)

In [4]:
def custom_cross_validation(train_df, n_splits=5):
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    train_df_copy = train_df.copy()
    
    training_folds = []
    validation_folds = []
    
    for train_index, val_index in kf.split(train_df_copy):
        train_fold, val_fold = train_df_copy.iloc[train_index], train_df_copy.iloc[val_index]

        city_mean_train = train_fold.groupby('location.address.city')['description.sold_price'].mean()

        # Merge mean price into both training and validation folds
        train_fold = train_fold.merge(city_mean_train, left_on='location.address.city', right_index=True, how='left', suffixes=('', '_city_mean'))
        val_fold = val_fold.merge(city_mean_train, left_on='location.address.city', right_index=True, how='left', suffixes=('', '_city_mean'))

        # Fill missing values in both folds with global mean sold price
        global_mean = train_df_copy['description.sold_price'].mean()
        train_fold['description.sold_price_city_mean'].fillna(global_mean, inplace=True)
        val_fold['description.sold_price_city_mean'].fillna(global_mean, inplace=True)

        # Drop the city column from both folds
        train_fold.drop(columns=['location.address.city'], inplace=True)
        val_fold.drop(columns=['location.address.city'], inplace=True)

        training_folds.append(train_fold)
        validation_folds.append(val_fold)

    return training_folds, validation_folds

In [5]:
training_folds, validation_folds = custom_cross_validation(CV_df)

In [6]:
param_grid = {
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4],
        'max_features': ['sqrt', 'log2']
    }

In [7]:
training_folds[0].shape 

(3363, 57)

In [8]:
validation_folds[0].shape

(841, 57)

In [9]:
for i, train_fold in enumerate(training_folds):
    nan_indices = train_fold.isnull().any(axis=1)
    if nan_indices.any():
        print(f"Training fold {i+1} contains NaN values.")
        print(train_fold[nan_indices])
    else:
        print(f"Training fold {i+1} does not contain any NaN values.")

Training fold 1 does not contain any NaN values.
Training fold 2 does not contain any NaN values.
Training fold 3 does not contain any NaN values.
Training fold 4 does not contain any NaN values.
Training fold 5 does not contain any NaN values.


In [10]:
from sklearn.ensemble import RandomForestRegressor
import numpy as np
import itertools

def hyperparameter_search(training_folds, validation_folds, param_grid):
    all_r2_scores = []
    all_best_params_list = []

    for params in itertools.product(*param_grid.values()):
        r2_scores = []
        best_params_list = []

        for train_fold, val_fold in zip(training_folds, validation_folds):
            rf = RandomForestRegressor(**dict(zip(param_grid.keys(), params)))

            X_train_fold = train_fold.drop(columns=['description.sold_price'])
            y_train_fold = train_fold['description.sold_price']
            X_val_fold = val_fold.drop(columns=['description.sold_price'])
            y_val_fold = val_fold['description.sold_price']

            rf.fit(X_train_fold, y_train_fold)

            r2_score = rf.score(X_val_fold, y_val_fold)

            r2_scores.append(r2_score)
            best_params_list.append(params)

        all_r2_scores.append(r2_scores)
        all_best_params_list.append(best_params_list)

    avg_r2_scores = np.mean(all_r2_scores, axis=1)
    best_params_idx = np.argmax(avg_r2_scores)
    best_params = all_best_params_list[best_params_idx][0]  

    return avg_r2_scores, best_params


In [11]:
# avg_r2_scores, best_params = hyperparameter_search(training_folds, validation_folds, param_grid)

In [12]:
# best_params

We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods.

In [13]:
best_model = RandomForestRegressor(n_estimators= 100, max_depth= 30, 
                                   min_samples_split= 2, min_samples_leaf= 1, 
                                   max_features= 'sqrt').fit(X_train, np.array(y_train).ravel())

print(best_model.score(X_test, y_test))

0.9958753792639732


In [14]:
import pickle

with open('../models/tuned_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

Once you've identified which model works the best, implement a pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.

In [15]:
# Build pipeline 
from functions_variables import *
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline

directory = '../data'
train = pd.read_csv('../data/X_train.csv')
columns_to_keep = list(train.columns)

columns_to_drop_na = ['description.type', 'description.year_built',
                      'description.lot_sqft', 'description.sqft', 'location.address.coordinate.lon',
                      'location.address.coordinate.lat']

columns_fill_values = {
    'description.baths_3qtr': 0,
    'description.baths_full': 0,
    'description.baths_half': 0,
    'description.baths': 0,
    'description.garage': 0,
    'description.beds': 0,
    'description.sub_type': 'N/A',
    'location.address.city': 'N/A',
    'description.stories': 1
}

columns_to_log = ['description.lot_sqft', 'description.sqft']

In [16]:
class LogTransform(TransformerMixin):
    def __init__(self, columns):
        self.columns = columns

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_transformed = X.copy()

        # Drop rows with zero values in specified columns
        for col in self.columns:
            X_transformed = X_transformed[X_transformed[col] != 0]

        # Apply logarithmic transformation
        for col in self.columns:
            X_transformed[col] = np.log(X_transformed[col])

        return X_transformed

In [17]:
pipeline = Pipeline([
    ('get_dataframe', FunctionTransformer(get_dataframe)),
    ('encode_tags', TagsEncoder()),
    ('drop_NAs', DropMissingValues(columns=columns_to_drop_na)),
    ('fill_NAs', FillMissingValues(fill_values_dict=columns_fill_values) ),
    ('transform_types', TypeTransformer()),
    ('merge_city_means', MergeAndImputeTransformer('../data/city_means.csv')),
    ('log_transform', LogTransform(columns_to_log)),
   ('select_columns', ColumnSelector(columns_to_keep)),
   ('scale', PretrainedMinMaxScale('../models/scaler.pkl')),
    ('predict', PredictionsFromModel('../models/tuned_model.pkl'))
])

df, pred = pipeline.fit_transform(directory)

In [18]:
df.shape

(6354, 56)

In [21]:
pred.shape

(6354,)

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.

In [19]:
# save your pipeline here