### 3. Load Data and Model

**[3.1]** Import the pandas and numpy package

In [1]:
# Solution
import pandas as pd
import numpy as np

**[3.2]** Load the prepared dataset from `data/interim` into a dataframe called `df`



In [2]:
#Solution:
df = pd.read_csv('../data/interim/Mall_Customers.csv')

**[3.3]** Create a copy of `df` and save it into a variable called `df_cleaned`

In [3]:
# Solution
df_cleaned = df.copy()

**[3.4]** Import `OneHotEncoder` from `sklearn.preprocessing`

In [4]:
# Solution
from sklearn.preprocessing import StandardScaler, OneHotEncoder

**[3.5]** Instantiate a `OneHotEncoder` with `sparse=False` and `drop='first'` and save it to a variable called `ohe`

In [5]:
# Solution
ohe = OneHotEncoder(sparse=False, drop='first')

**[3.6]** Fit and transform the `Gender` feature of `df_cleaned` and replace the data into it

In [6]:
# Solution
df_cleaned['Gender'] = ohe.fit_transform(df_cleaned[['Gender']])

**[3.7]** Import `split_sets_random`, `save_sets` from `src.data.sets`

In [7]:
# Solution
from src.data.sets import split_sets_random, save_sets

**[3.8]** Split the data intro training, validation and testing sets with 80-20 ratio

In [8]:
# Solution
X_train, y_train, X_val, y_val, X_test, y_test = split_sets_random(df_cleaned, target_col='Spending Score (1-100)', test_ratio=0.2, to_numpy=False)

**[3.9]** Save the sets into `data/processed` folder

In [9]:
# Solution
save_sets(X_train, y_train, X_val, y_val, X_test, y_test, path='../data/processed/')

# 4. Configure MLflow

**[4.1]** Import mlflow and mlflow.sklearn


In [10]:
# Solution
import mlflow
import mlflow.sklearn

**[4.2]** Set the MLflow Server URI to `http://mlflow:5000` using `.set_tracking_uri()`

In [11]:
# Solution
mlflow.set_tracking_uri('http://mlflow:5000')

**[4.3]** Define `xgboost_spending` as the MLflow experiment to be used with `.set_experiment()`

In [12]:
# Solution
mlflow.set_experiment('xgboost_spending')

INFO: 'xgboost_spending' does not exist. Creating a new experiment


**[4.4]** Start the tracking with Mlflow using `.start_run()`

In [13]:
# Solution
run = mlflow.start_run()

### 5. Train RandomForest and log MLflow

**[5.1]** Set a MLflow tag with `model.description` as key and `RandomForest with default hyperparameter` as value using `.set_tag()` 

In [14]:
# Solution
mlflow.set_tag("model.description", "RandomForest with default hyperparameter")

**[5.2]** Set a MLflow tag with `model.version` as key and `0.1` as value using `.set_tag()` 

In [15]:
# Solution
mlflow.set_tag("model.version", "0.1")

**[5.3]** Turn on automatic logging with sklearn

In [16]:
# Solution
mlflow.sklearn.autolog()

**[5.4]** Import `RandomForestRegressor` from `sklearn.ensemble` and instantiate it into a variable called `rf1` with `random_state=8`

In [17]:
# Solution
from sklearn.ensemble import RandomForestRegressor

rf1 = RandomForestRegressor(random_state=8)

**[5.5]** Fit the model on the training set

In [18]:
# Solution
rf1.fit(X_train, y_train)

2022/03/10 01:38:27 INFO mlflow.utils.autologging_utils: sklearn autologging will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow to the MLflow run with ID '69dc7cc9c1684330ba25da0e01337e5b'
2022/03/10 01:38:27 INFO mlflow.utils.autologging_utils: sklearn autologging will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow to the MLflow run with ID '69dc7cc9c1684330ba25da0e01337e5b'
2022/03/10 01:38:27 INFO mlflow.utils.autologging_utils: sklearn autologging will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow to the MLflow run with ID '69dc7cc9c1684330ba25da0e01337e5b'
2022/03/10 01:38:27 INFO mlflow.utils.autologging_utils: sklearn autologging will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow to the MLflow run

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=8, verbose=0, warm_start=False)

**[5.6]** Import `infer_signature` from `mlflow.models.signature`

In [19]:
# Solution
from mlflow.models.signature import infer_signature

**[5.7]** Apply `infer_signature()` on the training set and save the results on a variable called `signature` 

In [20]:
# Solution
signature = infer_signature(X_train, y_train)

  outputs = _infer_schema(model_output) if model_output is not None else None


**[5.8]** Log the trained model with its signature to the path `model` and `sklearn-rf-spending` as name 

In [21]:
mlflow.sklearn.log_model(rf1, artifact_path="model", signature=signature, registered_model_name="sklearn-rf-spending") 

Successfully registered model 'sklearn-rf-spending'.
2022/03/10 01:38:31 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: sklearn-rf-spending, version 1
Created version '1' of model 'sklearn-rf-spending'.


**[5.9]** Close the MLflow experiment run 




In [22]:
# Solution
mlflow.end_run()

**[5.10]** Open and browser and navigate to http://127.0.0.1:5000/#/

**[5.11]** Navigate into `xgboost_spending` and select the experiment run