## Hyperparam Tuning

Now that we know which models are performing better, it's time to perform cross validation and tune hyperparameters.
- Do a google search for hyperparameter ranges for each type of model.

GridSearch/RandomSearch are a great methods for checking off both of these tasks.

There is a fairly significant issue with this approach for this particular problem (described below). But in the interest of creating a basic functional pipeline, you can just use the default Sklearn methods for now.

## Preventing Data Leakage in Tuning - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it, if you have time!**

BUT we have a problem - if we calculated a numerical value to encode city (such as the mean of sale prices in that city) on the training data, we can't cross validate 
- The rows in each validation fold were part of the original calculation of the mean for that city - that means we're leaking information!
- While sklearn's built in functions are extremely useful, sometimes it is necessary to do things ourselves

You need to create two functions to replicate what Gridsearch does under the hood. This is a challenging, real world data problem! To help you out, we've created some psuedocode and docstrings to get you started. 

**`custom_cross_validation()`**
- Should take the training data, and divide it into multiple train/validation splits. 
- Look into `sklearn.model_selection.KFold` to accomplish this - the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) shows how to split a dataframe and loop through the indexes of your split data. 
- Within your function, you should compute the city means on the training folds just like you did in Notebook 1 - you may have to re-join the city column to do this - and then join these values to the validation fold

This psuedocode may help you fill in the function:

```python
kfold = KFold() # fit sklearn k folds on X_train
train_folds = []
val_folds = []
for training_index, val_index in kfold.split(X_train):
    train_fold, val_fold = #.iloc loop variables on X_train

    # recompute training city means like you did in notebook 1 
    # merge to validation fold
        
    train_folds.append(train_fold)
    val_folds.append(val_fold)

    return train_folds, val_folds
```


**`hyperparameter_search()`**
- Should take the validation and training splits from your previous function, along with your dictionary of hyperparameter values
- For each set of hyperparameter values, fit your chosen model on each set of training folds, and take the average of your chosen scoring metric. [itertools.product()](https://docs.python.org/3/library/itertools.html) will be helpful for looping through all combinations of hyperparameter values
- Your function should output the hyperparameter values corresponding the highest average score across all folds. Alternatively, it could also output a model object fit on the full training dataset with these parameters.


This psuedocode may help you fill in the function:

```python
hyperparams = # Generate hyperparam options with itertools
hyperparam-scores = []
for hyperparam-combo in hyperparams:

    scores = []

    for folds in allmyfolds:
        # score fold the fold with the model/ hyperparams
        scores.append(score-fold)
        
    score = scores.mean()
    hyperparam-scores.append(score)
# After loop, find max of hyperparam-scores. Best params are at same index in `hyperparams` loop iteratble
```

Docstrings have been provided below to get you started. Once you're done developing your functions, you should move them to `functions_variables.py` to keep your notebook clean 

Bear in mind that these instructions are just one way to tackle this problem - the inputs and output formats don't need to be exactly as specified here.

In [37]:
#This cell loaded in the training/ testing input data & their corresponding target values (prices) 

import pandas as pd

X_train = pd.read_csv("/Users/zarahbaloch/Downloads/X_train_scaled.csv")
X_test = pd.read_csv("/Users/zarahbaloch/Downloads/X_test_scaled.csv")
y_train = pd.read_csv("/Users/zarahbaloch/Downloads/y_train.csv", header=None).iloc[:, 0]
y_test = pd.read_csv("/Users/zarahbaloch/Downloads/y_test.csv", header=None).iloc[:, 0]

training_data = X_train.copy()
train_df["target"] = y_train

training_data.head()

Unnamed: 0,list_price,description.year_built,description.lot_sqft,description.sqft,description.baths,description.garage,description.stories,location.address.coordinate.lon,location.address.coordinate.lat,city_mean_price,state_mean_price,total_rooms
0,2.778383,-1.948745,0.507606,3.041264,3.417552,0.844051,0.546019,0.006553,1.013723,0.373691,0.118948,3.667236
1,-0.086086,0.193304,-0.202453,0.560428,-0.124811,-0.003403,-0.784779,1.15187,0.686155,0.845089,0.711957,0.051856
2,-0.551562,-1.559281,-1.710199,-0.141726,-0.124811,-0.850858,,1.005019,0.908108,-0.540734,-0.622041,1.084821
3,-0.004727,1.556426,1.186098,-0.021335,-0.124811,0.844051,0.546019,0.88879,0.00158,-0.14475,-0.394479,0.051856
4,-0.504815,-0.196159,-0.03157,-1.20179,-1.010402,-0.003403,-0.784779,0.22197,0.172715,-1.393877,-0.709501,-0.464627


In [150]:
#This function splits the training data into 5 for manual k-fold cross-validation (n=5)
def custom_cross_validation(training_data, n_splits=5):

    #Shuffling the dataset to prevent bias (50 was a random number chosen)
    training_data = training_data.sample(frac=1, random_state=50).reset_index(drop=True)

    #Calculates number of folds + empty lists to store the splits
    fold_size = len(training_data) // n_splits
    validation_folds = []
    training_folds = []

    #Loop over the number of folds (5 times over) 
    for i in range(n_splits):
        start = i * fold_size
        end = start + fold_size if i < n_splits - 1 else len(training_data)

        #Assigns data to the validation set (and remainder is assigned to training) 
        validationfold = training_data.iloc[start:end]
        trainfold = pd.concat([training_data.iloc[:start], training_data.iloc[end:]])

        #Adds each fold pair to lists 
        validation_folds.append(validationfold)
        training_folds.append(trainfold)
    
    return training_folds, validation_folds


## Hyperparam Tuning

In [153]:
#This cell tests the custom cross-validation function to ensure it is working

#Calling on custom_cross_validation() to split data in n-folds
training_folds, validation_folds = custom_cross_validation(train_df, n_splits=5)

print(f"Total folds created: {len(training_folds)}")
print(f"Size of 1st training fold: {training_folds[0].shape}")
print(f"Size of 1st validation fold: {validation_folds[0].shape}")

Total folds created: 5
Size of 1st training fold: (4184, 13)
Size of 1st validation fold: (1045, 13)


In [162]:
#Defining & Establishing a dictionary of hyperparameter values 
        #max_depth= how deep a tree can go 
        #min_samples_split= minimum number of samples required to split a node
        #min_samples_leaf= minimum number of samples that must be in a leaf node
param_grid = {
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

#generate every possible combination of values
from itertools import product

#creating the list of keys & their corresponding values
keys = list(param_grid.keys())
values = list(param_grid.values())
for combo in combinations:
    print(combo)

(10, 2, 1)
(10, 2, 2)
(10, 5, 1)
(10, 5, 2)
(20, 2, 1)
(20, 2, 2)
(20, 5, 1)
(20, 5, 2)


In [168]:
#This cell is trying all combinations of hyperparameters, evaluating each using k-fold cross-validation then picking the combination that performs the best (lowest RMSE) 

#Defining the function
def hyperparameter_search(training_folds, validation_folds, param_grid):
    from itertools import product
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import root_mean_squared_error
    import numpy as np

   
    keys = list(param_grid.keys())
    values = list(param_grid.values())
    combinations = [dict(zip(keys, v)) for v in product(*values)]

    #start with a high RMSE so anything lower will be best & holds the best combination once found
    bestscore = float("inf")
    bestparameter = None

    for combo in combinations:
        scores = []

        #inner loop for k-fold cross-validation (manual) 
        for train_fold, val_fold in zip(training_folds, validation_folds):
            X_train = train_fold.drop(columns=["target"])
            y_train = train_fold["target"]
            X_val = val_fold.drop(columns=["target"])
            y_val = val_fold["target"]

            #Creates model with hyperparameter combo, fits data and predicts housing prices on fold
            model = RandomForestRegressor(**combo, random_state=50)
            model.fit(X_train, y_train)
            y_pred = model.predict(X_val)

            #calculates and saves folds score to list
            rmse = root_mean_squared_error(y_val, y_pred)
            scores.append(rmse)

        #determines best paramters through average RMSE
        averagescore = np.mean(scores)
        if averagescore < bestscore:
            bestscore = averagescore
            bestparameter = combo
            
    return bestparameter

#Tells us which tree configuration performed the best on average throughout all folds.
best_combo = hyperparameter_search(training_folds, validation_folds, param_grid)
print(best_combo)


{'max_depth': 20, 'min_samples_split': 2, 'min_samples_leaf': 1}


We want to make sure that we save our models.  In the old days, one just simply pickled (serialized) the model.  Now, however, certain model types have their own save format.  If the model is from sklearn, it can be pickled, if it's xgboost, for example, the newest format to save it in is JSON, but it can also be pickled.  It's a good idea to stay with the most current methods. 
- you may want to create a new `models/` subdirectory in your repo to stay organized

In [244]:
# Training model through chosen hyperparameters

from sklearn.ensemble import RandomForestRegressor

#drop rows where target is missing
train_df = train_df.dropna(subset=["target"])

#splitting of the training data into features & targets 
X_final_train = train_df.drop(columns=["target"])
y_final_train = train_df["target"]

#final creation of model instance with the best hyperparameter combo
final_model = RandomForestRegressor(**best_combo, random_state=50)

#training the model in question 
final_model.fit(X_final_train, y_final_train)


In [246]:
#create the 'models' folder
import os
os.makedirs("models", exist_ok=True)

#saving the trained model into the 'models' folder
import joblib
joblib.dump(final_model, "models/final_rf_model.pkl")

['models/final_rf_model.pkl']

## Building a Pipeline (Stretch)

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its highly recommended you complete it if you have time!**

Once you've identified which model works the best, implement a prediction pipeline to make sure that you haven't leaked any data, and that the model could be easily deployed if desired.
- Your pipeline should load the data, process it, load your saved tuned model, and output a set of predictions
- Assume that the new data is in the same JSON format as your original data - you can use your original data to check that the pipeline works correctly
- Beware that a pipeline can only handle functions with fit and transform methods.
- Classes can be used to get around this, but now sklearn has a wrapper for user defined functions.
- You can develop your functions or classes in the notebook here, but once they are working, you should import them from `functions_variables.py` 

In [249]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import joblib

#load features & target 
X = train_df.drop(columns=["target"])
y = train_df["target"]

#detect column types (numeric or categoric) 
numericfeatures = X.select_dtypes(include=["int64", "float64"]).columns.tolist()
categoricalfeatures = X.select_dtypes(include=["object", "category"]).columns.tolist()

#create pre-processing steps depending on data types 
#numeric data preprocessing
numericprocessor = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

#categorical data preprocessing
categoricalprocessor = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

# create one ultimate preprocessor 
preprocessor = ColumnTransformer([
    ("num", numericprocessor, numericfeatures),
    ("cat", categoricalprocessor, categoricalfeatures)
])

# construct the final pipeline
final_pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(
        max_depth=20, min_samples_split=2, min_samples_leaf=1, random_state=50))
])

# train the pipeline
final_pipeline.fit(X, y)



In [251]:
#final pipeline saved to 'models' folder
joblib.dump(final_pipeline, "models/final_pipeline.pkl")

['models/final_pipeline.pkl']

In [284]:
#Test the pipeline with a random dataset provided 

#Load & Normalize Data 
import pandas as pd
import json

with open("/Users/zarahbaloch/Downloads/MI_Lansing_2.json", "r") as f:
    results = raw_json["data"]["results"]
    df = pd.json_normalize(results)

#carrying out feature engineering to help with model context
df["total_rooms"] = df["description.beds"].fillna(0) + df["description.baths"].fillna(0)
df["list_price"] = pd.to_numeric(df["list_price"])
df["city_mean_price"] = df.groupby("location.address.city")["list_price"].transform("mean")
df["state_mean_price"] = df.groupby("location.address.state_code")["list_price"].transform("mean")

#defining input columns for pipelines
pipelinecolumns = [
    "description.beds",
    "description.baths",
    "description.sqft",
    "description.year_built",
    "description.garage",
    "description.lot_sqft",
    "description.type",
    "description.stories",
    "location.address.city",
    "location.address.postal_code",
    "location.address.state_code",
    "location.address.coordinate.lat",
    "location.address.coordinate.lon",
    "list_price",
    "state_mean_price",
    "city_mean_price",
    "total_rooms" 
]

pipelinedata = df[pipelinecolumns]

#loading the model & predicting
finalpipeline = joblib.load("models/final_pipeline.pkl") 
predictions = finalpipeline.predict(pipelinedata)

#attach predictions and view (made a copy to avoid error message)
pipelinedata = df[pipelinecolumns].copy()
pipelinedata.loc[:, "predicted_price"] = predictions
print(pipelinedata.head(10))

   description.beds  description.baths  description.sqft  \
0                 2                  1             995.0   
1                 2                  2             984.0   
2                 3                  1            1028.0   
3                 4                  2               NaN   
4                 3                  1            1000.0   
5                 3                  2            1954.0   
6                 2                  1             832.0   
7                 3                  3            1015.0   
8                 2                  1               NaN   
9                 3                  2             960.0   

   description.year_built  description.garage  description.lot_sqft  \
0                    1949                 1.0                6098.0   
1                    1924                 NaN                8276.0   
2                    1979                 NaN               13939.0   
3                    1984                 NaN          

Pipelines come from sklearn.  When a pipeline is pickled, all of the information in the pipeline is stored with it.  For example, if we were deploying a model, and we had fit a scaler on the training data, we would want the same, already fitted scaling object to transform the new data with.  This is all stored when the pipeline is pickled.
- save your final pipeline in your `models/` folder

In [290]:
#predictions generated through model were saved
pipelinedata.to_csv("models/predicted_listings.csv")