# Case Study

## Part 1

### Git & version control

1. Create a Github repository called "ames-housing-analysis".
1. Copy the ames.csv data from the `data/` directory into this repository.
1. Update the README with a short synopsis of this repo.
1. Create a folder called `notebooks/`
1. Add, commit, and push what you have so far. Verify in that it appears in GitHub on your repository page.

### Exploratory data analysis

1. In the repo's `notebooks/` folder, create a new notebook: `eda.ipynb`.
2. Load the ames.csv data.
3. Assess the distribution of the response variable (`Sale_Price`).
4. How many features are numeric vs. categorical?
5. Pick a numeric feature that you believe would be influential on a home's `Sale_Price`. Assess the distribution of the numeric feature. Assess the relationship between that feature and the `Sale_Price`.
6. Pick a categorical feature that you believe would be influential on a home's `Sale_Price`. Assess the distribution of the categorical feature. Assess the relationship between that feature and the `Sale_Price`.

### Modular code & Scikit-learn model

1. Copy `my_module.py` (that we created together) into the notebooks folder.
2. Import your module and use `get_features_and_target` to load the numeric features of the Ames data, along with the "Sale_Price" as a target column.

With your features and target prepared:
1. Split the data into training and test sets. Use 75% of the data for training and 25% for testing.
2. Fit a default `sklearn.neighbors.KNeighborsRegressor` model on the training data and score on the test data. Note that scoring on regression models provides the $R^2$.
3. Fit a default `sklearn.linear_model.LinearRegression` model on the training data and score on the test data.
4. Fit a default `sklearn.ensemble.RandomForestRegressor` model on the training data and score on the test data.

### Feature engineering

1. Fill in the blanks to standardize the numeric features and then apply a linear regression model. Does standardizing the numeric features improve the linear regression's $R^2$?

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import ________

lm_model_scaled = make_pipeline(__________, LinearRegression())
lm_model_scaled.fit(X_train, y_train)
lm_model_scaled.score(X_test, y_test)

2. Using the code chunks below, which computes the following:

- identifies numeric, categorical, and ordinal columns in our full feature set,
- replaces unique values in our ordinal columns (i.e. "No_basement", "No_garage"), and
- creates our encoders for the numeric, categorical, and ordinal columns.

<div class="admonition note alert alert-info">
    <p class="first admonition-title" style="font-weight: bold;"><b>Note</b></p>
<p class="last">Run the following two code cells without changing anything.</p>
</div>

In [None]:
######## RUN THIS CODE CELL AS-IS ########

# get columns of interest
numerical_columns = num_features.columns
ordinal_columns = cat_features.filter(regex='Qual').columns
categorical_columns = cat_features.drop(columns=ordinal_columns).columns

# replace unique values in our ordinal columns (i.e. "No_basement", "No_garage") with 'NA'
for col in ordinal_columns:
    features[col] = features[col].replace(to_replace='No_.*', value='NA', regex=True)
    
# split full feature set (numeric, categorical, & ordinal features) into train & test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=123)

In [None]:
######## RUN THIS CODE CELL AS-IS ########

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder

# create our numeric, categorical, and ordinal preprocessor encoders
numerical_preprocessor = StandardScaler()
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

ordinal_categories = [
    "NA", "Very_Poor", "Poor", "Fair", "Below_Average", "Average", "Typical",
    "Above_Average", "Good", "Very_Good", "Excellent", "Very_Excellent"
]
list_of_ord_cats = [ordinal_categories for col in ordinal_columns]
ordinal_preprocessor = OrdinalEncoder(categories=list_of_ord_cats)

2. Continued...

Now fill in the blanks to create our `ColumnTransformer` that:

- standardizes numerical columns (preprocessor: `numerical_preprocessor`; columns of interest: `numerical_columns`) 
- one-hot encodes categorical columns (preprocessor: `categorical_preprocessor`; columns of interest: `categorical_columns`) 
- ordinal encodes ordinal columns (preprocessor: `ordinal_preprocessor`; columns of interest: `ordinal_columns`) 

In [None]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer([
    ('standard_scaler', __________, __________),
    ('one_hot_encoder', __________, __________),
    ('ordinal_encoder', __________, __________),
])

3. Now create a pipeline that includes the preprocessing step and applies a linear regression model. Does this improve the linear regression's $R^2$?

In [None]:
lm_full = make_pipeline(___________, ___________)
_ = lm_full.fit(X_train, y_train)
lm_full.score(X_test, y_test)

4. If time allows, create a pipeline that applies these preprocessing steps with a default random forest model and see if performance improves.

### GitHub Check-in

Add, commit (with a good message!), and push your code to this point.

## Part 2

### Model evaluation & selection

1. Using same preprocessing pipeline you created in Part 1, fit a default random forest model using a 5-fold cross validation procedure using the root mean squared error metric (`'neg_root_mean_squared_error'`).

2. Run the following two code chunks as is without making any changes. This will create a random forest model pipeline and create specified hyperparameter distributions to draw from.

In [None]:
######## RUN THIS CODE CELL AS-IS ########

from scipy.stats import loguniform


class loguniform_int:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)

In [None]:
######## RUN THIS CODE CELL AS-IS ########

from sklearn.pipeline import Pipeline

# create preprocessor & modeling pipeline
rf = RandomForestRegressor(random_state=123)
pipeline = Pipeline([('prep', preprocessor), ('rf', rf)])

# specify hyperparameter distributions to randomly sample from
param_distributions = {
    'rf__n_estimators': loguniform_int(50, 1000),
    'rf__max_features': loguniform(.1, .8),
    'rf__max_depth': loguniform_int(2, 30),
    'rf__min_samples_leaf': loguniform_int(1, 100),
    'rf__max_samples': loguniform(.5, 1),
}

2. Continued...

Fill in the blanks to perform a random hyperparameter search based on the following:

- use the parameter distributions specified above,
- perform 25 random searches,
- use a 5-fold cross-validation procedure, and
- use root mean squared error (RMSE) as our scoring metric.

What are the hyperparameters that provide the lowest RMSE? What is the lowest cross validated RMSE?

In [None]:
%%time
from sklearn.model_selection import ___________

random_search = RandomizedSearchCV(
    pipeline, 
    param_distributions=___________, 
    n_iter=__,
    cv=__, 
    scoring='___________',
    verbose=1,
    n_jobs=-1,
)

results = random_search.___________

### Unit Tests

1. TBD
1. TBD
1. TBD

### ML lifecycle management

1. Create and set an MLflow experiment titled "UC Advanced Python Case Study"
2. Re-perform the random hyperparameter search executed above while logging the hyperparameter search experiment with MLflow's autologging. Title this run "rf_hyperparameter_tuning".

### Reproducibility with dependency tracking

1. TBD
1. TBD
1. TBD