Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.
- BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [36]:
import os
import joblib
import pandas as pd
import itertools
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

In [37]:
def custom_cross_validation(training_data, city_prefix, target_column, n_splits=5):
    '''Creates n_splits sets of training and validation folds'''
    city_columns = [col for col in training_data.columns if col.startswith(city_prefix)]
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    training_folds = []
    validation_folds = []

    for train_index, val_index in kf.split(training_data):
        train_fold = training_data.iloc[train_index].copy()
        val_fold = training_data.iloc[val_index].copy()

        city_means = {}
        for city in city_columns:
            city_mean = train_fold[train_fold[city] == 1][target_column].mean()
            city_means[city] = city_mean

        for city in city_columns:
            train_fold.loc[train_fold[city] == 1, 'city_mean_price'] = city_means[city]
            val_fold.loc[val_fold[city] == 1, 'city_mean_price'] = city_means[city]

        training_folds.append(train_fold)
        validation_folds.append(val_fold)

    return training_folds, validation_folds

def hyperparameter_search(training_folds, validation_folds, param_grid):
    '''Outputs the best combination of hyperparameter settings in the param grid'''
    best_params = None
    best_score = float('inf')
    param_combinations = list(itertools.product(*param_grid.values()))

    for params in param_combinations:
        param_dict = dict(zip(param_grid.keys(), params))
        scores = []

        for train_fold, val_fold in zip(training_folds, validation_folds):
            model = RandomForestRegressor(**param_dict)
            X_train, y_train = train_fold.drop(columns=['description.sold_price']), train_fold['description.sold_price']
            X_val, y_val = val_fold.drop(columns=['description.sold_price']), val_fold['description.sold_price']

            model.fit(X_train, y_train)
            y_pred = model.predict(X_val)
            score = mean_squared_error(y_val, y_pred)
            scores.append(score)

        avg_score = np.mean(scores)

        if avg_score < best_score:
            best_score = avg_score
            best_params = param_dict

    return best_params, best_score

def load_model(model_path):
    '''Load the trained model from the specified path.'''
    return joblib.load(model_path)

def preprocess_data(data):
    '''Preprocess the data (impute missing values) using the same pipeline as the training.'''
    imputer = SimpleImputer(strategy='mean')
    return imputer.fit_transform(data)

def predict(data, model):
    '''Predict using the loaded model.'''
    predictions = model.predict(data)
    return predictions

def main(input_json_path, model_path, output_csv_path):
    '''Main function to load data, preprocess, load model, and predict.'''
    # Load new data
    new_data = pd.read_json(input_json_path)

    # Extract features
    features = new_data.drop(columns=['description.sold_price'])

    # Preprocess features
    preprocessed_features = preprocess_data(features)

    # Load model
    model = load_model(model_path)

    # Predict
    predictions = predict(preprocessed_features, model)

    # Save predictions
    output_df = new_data.copy()
    output_df['predictions'] = predictions
    output_df.to_csv(output_csv_path, index=False)
    print(f"Predictions saved to {output_csv_path}")

We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods. 
- you may want to create a new `models/` subdirectory in your repo to stay organized

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [38]:
# Main workflow
df = pd.read_csv('e:/Vocational/Lighthouse Labs/Flex Course/Projects/P02_Midterm_Supervised Learning/data_project_midterm/data/processed_data.csv') 

# Define the features and target
features = df.drop(columns=['description.sold_price'])
target = df['description.sold_price']

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

# Create training and validation folds from the training set
city_prefix = 'location.address.city_'
target_column = 'description.sold_price'
training_folds, validation_folds = custom_cross_validation(pd.concat([X_train, y_train], axis=1), city_prefix, target_column, n_splits=5)

# Define the parameter grid for hyperparameter search
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Perform hyperparameter search using training folds
best_params, best_score = hyperparameter_search(training_folds, validation_folds, param_grid)

print("Best Parameters:", best_params)
print("Best Score (MSE):", best_score)

# Train the best model on the entire training set and evaluate on the test set
best_model = RandomForestRegressor(**best_params)
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
best_model.fit(X_train_imputed, y_train)
y_pred_test = best_model.predict(X_test_imputed)
test_score = mean_squared_error(y_test, y_pred_test)

print("Test Score (MSE):", test_score)

# Create 'models' directory if it doesn't exist
models_dir = os.path.join(os.path.dirname('e:/Vocational/Lighthouse Labs/Flex Course/Projects/P02_Midterm_Supervised Learning/data_project_midterm/'), 'models')
os.makedirs(models_dir, exist_ok=True)

# Save the best model
model_path = os.path.join(models_dir, 'best_model.joblib')
joblib.dump(best_model, model_path)

print(f"Model saved to {model_path}")

Best Parameters: {'n_estimators': 100, 'max_depth': 20, 'min_samples_split': 2, 'min_samples_leaf': 1}
Best Score (MSE): 2309105936.732564
Test Score (MSE): 5129586332.630376
Model saved to e:/Vocational/Lighthouse Labs/Flex Course/Projects/P02_Midterm_Supervised Learning/data_project_midterm\models\best_model.joblib
