## Hyperparam Tuning

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.

There is a fairly significant issue with this approach for this particular problem (described below). But in the interest of creating a basic functional pipeline, you can just use the default Sklearn methods for now.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started. 

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1 
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [133]:
from sklearn.model_selection import ParameterGrid
from sklearn.metrics import mean_squared_error
import numpy as np
from sklearn.model_selection import KFold

#Developing a Custom Cross-Validation Function

def custom_cross_validation(X_train, y_train, n_splits=5):
    '''Creates n_splits sets of training and validation folds using K-Fold cross-validation.

    Args:
      training_data (pd.DataFrame): The dataframe of features and target to be divided into folds.
      n_splits (int): The number of sets of folds to be created.

    Returns:
      tuple: A tuple of lists, where the first index is a list of the training folds, 
             and the second index is the corresponding validation folds.

    Example:
        >>> output = custom_cross_validation(train_df, n_splits=10)
        >>> output[0][0] # The first training fold
        >>> output[1][0] # The first validation fold
        >>> output[0][1] # The second training fold
        >>> output[1][1] # The second validation fold... etc.
    '''
    training_data = pd.concat([X_train, y_train], axis=1)
    kfold = KFold(n_splits=n_splits, shuffle=True, random_state=42)  # Shuffle
    train_folds = []
    val_folds = []

    for train_index, val_index in kfold.split(training_data):
        train_fold = training_data.iloc[train_index] 
        val_fold = training_data.iloc[val_index]  
        train_folds.append(train_fold)
        val_folds.append(val_fold)

    return train_folds, val_folds

In [134]:
#Hyperparameter Search Function Creation 
from sklearn.metrics import mean_squared_error
import numpy as np
import itertools

def hyperparameter_search(training_folds, validation_folds, param_grid, model, scoring=mean_squared_error, higher_is_better=False):
    '''
    Performs a custom grid search for the best hyperparameters using k-fold validation.
    
    Args:
      training_folds (list): List of training fold dataframes (features and target concatenated).
      validation_folds (list): List of validation fold dataframes (features and target concatenated).
      param_grid (dict): Dictionary of possible hyperparameter values.
      model: Model that will be used to fit.
      scoring (function): Scoring function to evaluate model performance. Default is mean_squared_error.
      higher is better (bool): If True, higher scores are better; if False, lower scores are better. Default is False. This is to take into account R2 where the larger the score is better
      
    Returns:
      dict: Best hyperparameter settings based on the chosen metric.
    '''
    param_combinations = list(itertools.product(*param_grid.values()))
    param_names = list(param_grid.keys())
    
    best_score = float('-inf') if higher_is_better else float('inf')
    best_params = None
    
    for combination in param_combinations:
        params = dict(zip(param_names, combination))
        scores = []
        print(f"Testing parameters: {params}")
        
        for train_fold, val_fold in zip(training_folds, validation_folds):
            X_train, y_train = train_fold.iloc[:, :-1], train_fold.iloc[:, -1]
            X_val, y_val = val_fold.iloc[:, :-1], val_fold.iloc[:, -1]
            
            model.set_params(**params)
            model.fit(X_train, y_train)
            predictions = model.predict(X_val)
            
            score = scoring(y_val, predictions)
            scores.append(score)
        
        avg_score = np.mean(scores)
        print(f"Average Score: {avg_score:.4f}\n")
        
        if (higher_is_better and avg_score > best_score) or (not higher_is_better and avg_score < best_score):
            best_score = avg_score
            best_params = params
    
    print(f"Best Parameters: {best_params}")
    print(f"Best Score: {best_score:.4f}")
    
    return best_params

## Hyperparam Tuning

In [None]:
# perform tuning and cross validation here 
# using GridsearchCV/ RandomsearchCV (MVP)
# or your custom functions

In [109]:
import pandas as pd
data = pd.read_csv('chosen_features.csv')
data.head()

X = data.dropX = data.drop(columns = ['sold_price'], axis=1) #Dropping Target
y = data['sold_price'] #Target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

  X = data.dropX = data.drop(columns = ['sold_price'], axis=1) #Dropping Target


In [None]:
test = custom_cross_validation(data, n_splits=5)
test

([          sqft  baths  price_reduced_amount         Stdev           Mean  \
  0     2.116150    2.0              0.000000  55875.000000   91146.750000   
  1     2.108643    2.0              2.197969  22159.000000   32307.000000   
  2     2.155242    2.0              0.000000  37301.400000   54429.200000   
  3     2.112335    2.0              2.313039  22159.000000   32307.000000   
  4     2.139868    2.0              2.253121  43379.000000   51585.000000   
  ...        ...    ...                   ...           ...            ...   
  1187  2.210339    2.0              0.000000  32922.500000   38341.500000   
  1188  2.092173    1.0              0.000000  72521.000000  103147.000000   
  1189  2.066444    1.0              0.000000  55144.694444   63721.111111   
  1190  2.115496    1.0              0.000000  72521.000000  103147.000000   
  1191  2.225506    2.0              0.000000  32922.500000   38341.500000   
  
        waterfront  garage  cost_of_living_housing  total_pop

In [111]:
train_folds, val_folds = custom_cross_validation(data, n_splits=5)
train_folds

[      sold_price      sqft  baths  price_reduced_amount         Stdev  \
 0       2.547453  2.116150    2.0              0.000000  55875.000000   
 1       2.516952  2.108643    2.0              2.197969  22159.000000   
 2       2.556025  2.155242    2.0              0.000000  37301.400000   
 3       2.491730  2.112335    2.0              2.313039  22159.000000   
 4       2.567842  2.139868    2.0              2.253121  43379.000000   
 ...          ...       ...    ...                   ...           ...   
 1187    2.502682  2.210339    2.0              0.000000  32922.500000   
 1188    2.525959  2.092173    1.0              0.000000  72521.000000   
 1189    2.424709  2.066444    1.0              0.000000  55144.694444   
 1190    2.564711  2.115496    1.0              0.000000  72521.000000   
 1191    2.538213  2.225506    2.0              0.000000  32922.500000   
 
                Mean  waterfront  garage  cost_of_living_housing  \
 0      91146.750000           0     2.0  

In [108]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20]
}
test1 = hyperparameter_search(train_folds, val_folds, param_grid, RandomForestRegressor(), scoring=r2_score)
test1

Testing parameters: {'n_estimators': 50, 'max_depth': 5}
Average Score: 0.7416

Testing parameters: {'n_estimators': 50, 'max_depth': 10}
Average Score: 0.8714

Testing parameters: {'n_estimators': 50, 'max_depth': 20}
Average Score: 0.8745

Testing parameters: {'n_estimators': 100, 'max_depth': 5}
Average Score: 0.7386

Testing parameters: {'n_estimators': 100, 'max_depth': 10}
Average Score: 0.8704

Testing parameters: {'n_estimators': 100, 'max_depth': 20}
Average Score: 0.8689

Testing parameters: {'n_estimators': 200, 'max_depth': 5}
Average Score: 0.7355

Testing parameters: {'n_estimators': 200, 'max_depth': 10}
Average Score: 0.8664

Testing parameters: {'n_estimators': 200, 'max_depth': 20}
Average Score: 0.8730

Best Parameters: {'n_estimators': 200, 'max_depth': 5}
Best Score: 0.7355


{'n_estimators': 200, 'max_depth': 5}

In [95]:
import xgboost as xgb

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20]
}
test2 = hyperparameter_search(train_folds, val_folds, param_grid, xgb.XGBRegressor(objective='reg:absoluteerror'), scoring=r2_score)
test2

Testing parameters: {'n_estimators': 50, 'max_depth': 5}
Average Score: 0.6580

Testing parameters: {'n_estimators': 50, 'max_depth': 10}
Average Score: 0.8386

Testing parameters: {'n_estimators': 50, 'max_depth': 20}
Average Score: 0.8242

Testing parameters: {'n_estimators': 100, 'max_depth': 5}
Average Score: 0.6580

Testing parameters: {'n_estimators': 100, 'max_depth': 10}
Average Score: 0.8427

Testing parameters: {'n_estimators': 100, 'max_depth': 20}
Average Score: 0.8242

Testing parameters: {'n_estimators': 200, 'max_depth': 5}
Average Score: 0.6580

Testing parameters: {'n_estimators': 200, 'max_depth': 10}
Average Score: 0.8427

Testing parameters: {'n_estimators': 200, 'max_depth': 20}
Average Score: 0.8242

Best Parameters: {'n_estimators': 50, 'max_depth': 5}
Best Score: 0.6580


{'n_estimators': 50, 'max_depth': 5}

We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods. 
- you may want to create a new `models/` subdirectory in your repo to stay organized

In [None]:
# save your best model here

## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

In [73]:
# Build pipeline here
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
import joblib
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
import itertools

#Build Test Data
data = pd.read_csv('chosen_features.csv')

data_split = train_test_split(data, test_size=0.2, random_state=42)

In [149]:
def train_and_evaluate(X_train, y_train, X_test, y_test, best_params, model):
    # Set the best hyperparameters
    model.set_params(**best_params)
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Evaluate Model
    accuracy = r2_score(y_test, y_pred)
    print(f"Model Accuracy (R2 Score): {accuracy:.4f}")

    return model, y_pred

In [None]:
#Test Params: 

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20]
}

pipe = Pipeline(["Cross Validation", custom_cross_validation(X_train,y_train, n_splits=5 ),
                     "Model Prediction", train_and_evaluate(X_train, y_train, X_test,y_test, hyperparameter_search(train_folds, val_folds, param_grid, RandomForestRegressor(), scoring=r2_score),RandomForestRegressor()),
                     ])

Testing parameters: {'n_estimators': 50, 'max_depth': 5}
Average Score: 0.7412

Testing parameters: {'n_estimators': 50, 'max_depth': 10}
Average Score: 0.8674

Testing parameters: {'n_estimators': 50, 'max_depth': 20}
Average Score: 0.8701

Testing parameters: {'n_estimators': 100, 'max_depth': 5}
Average Score: 0.7335

Testing parameters: {'n_estimators': 100, 'max_depth': 10}
Average Score: 0.8681

Testing parameters: {'n_estimators': 100, 'max_depth': 20}
Average Score: 0.8659

Testing parameters: {'n_estimators': 200, 'max_depth': 5}
Average Score: 0.7360

Testing parameters: {'n_estimators': 200, 'max_depth': 10}
Average Score: 0.8725

Testing parameters: {'n_estimators': 200, 'max_depth': 20}
Average Score: 0.8752

Best Parameters: {'n_estimators': 100, 'max_depth': 5}
Best Score: 0.7335
Model Accuracy (R2 Score): 0.5202


Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [None]:
# save your pipeline here