## Try this Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truefoundry/mlfoundry-examples/blob/main/examples/sklearn/ca_housing_regression.ipynb)

## Install dependencies

In [1]:
! pip install --quiet "numpy>=1.0.0,<2.0.0" "pandas>=1.0.0,<2.0.0" scikit-learn shap==0.40.0
! pip install -U mlfoundry

You should consider upgrading via the '/Users/chiragjn/Library/Caches/pypoetry/virtualenvs/mlfoundry-jYktQAfc-py3.9/bin/python -m pip install --upgrade pip' command.[0m[33m
You should consider upgrading via the '/Users/chiragjn/Library/Caches/pypoetry/virtualenvs/mlfoundry-jYktQAfc-py3.9/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m

## Initialize MLFoundry Client

In [2]:
import os
import getpass
import urllib.parse
import mlfoundry as mlf

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
TFY_URL = os.environ.get('TFY_URL', 'https://app.truefoundry.com/')
TFY_API_KEY = os.environ.get('TFY_API_KEY')
if not TFY_API_KEY:
    print(f'Paste your TrueFoundry API key\nYou can find it over at {urllib.parse.urljoin(TFY_URL, "settings")}')
    TFY_API_KEY = getpass.getpass()

In [4]:
client = mlf.get_client(api_key=TFY_API_KEY)

---

## California Housing Price Prediction as a Regression problem

In [5]:
import shap
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import mlfoundry as mlf

### Load the California Housing dataset

In [6]:
data = datasets.fetch_california_housing(as_frame=True)
print(data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [7]:
print(data.DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block
        - HouseAge      median house age in block
        - AveRooms      average number of rooms
        - AveBedrms     average number of bedrooms
        - Population    block population
        - AveOccup      average house occupancy
        - Latitude      house block latitude
        - Longitude     house block longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
http://lib.stat.cmu.edu/datasets/

The target variable is the median house value for California districts.

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bur

In [8]:
data.frame.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Split Dataset into Training and Validation

In [9]:
# Create a Pandas dataframe with all the features
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
feature_columns = X_train.columns.tolist()
X_train = X_train[feature_columns]
X_test = X_test[feature_columns]

print('Feature columns:', feature_columns)
print('Train samples:', len(X_train))
print('Test samples:', len(X_test))

Feature columns: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Train samples: 16512
Test samples: 4128


### Start a MLFoundry Run

In [11]:
run = client.create_run(project_name='sklearn-ca-housing-example')
print('RUN ID:', run.run_id)
print(f'You can track your runs live at {urllib.parse.urljoin(TFY_URL, "mlfoundry")}')

[mlfoundry] 2022-05-16T16:27:07+0530 INFO Run is created with id b614e94615d244e0bc389d19886246b0 and name would-wrong-month
RUN ID: b614e94615d244e0bc389d19886246b0
You can track your runs live at https://app.truefoundry.com/mlfoundry


### Set tags for our run

In [12]:
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_leaf=30)
run.set_tags({'framework': 'sklearn', 'task': 'regression'})

### Training Model

In [13]:
rf_reg.fit(X_train, y_train)

RandomForestRegressor(max_depth=15, min_samples_leaf=30)

### Logging Parameters & Model

In [14]:
print(rf_reg.get_params())
run.log_params(rf_reg.get_params())
run.log_model(rf_reg, framework=mlf.ModelFramework.SKLEARN)

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': 15, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 30, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
[mlfoundry] 2022-05-16T16:27:24+0530 INFO Parameters logged successfully
[mlfoundry] 2022-05-16T16:27:55+0530 INFO Model logged successfully


### Computing Predictions

In [15]:
y_pred_train = rf_reg.predict(X_train)
y_pred_test = rf_reg.predict(X_test)

### Logging metrics

In [16]:
metrics_dict = {
    'train/mae': mean_absolute_error(y_true=y_train, y_pred=y_pred_train),
    'train/mse': mean_squared_error(y_true=y_train, y_pred=y_pred_train),
    'train/r2_score': r2_score(y_true=y_train, y_pred=y_pred_train),
    'test/mae': mean_absolute_error(y_true=y_test, y_pred=y_pred_test),
    'test/mse': mean_squared_error(y_true=y_test, y_pred=y_pred_test),
    'test/r2_score': r2_score(y_true=y_test, y_pred=y_pred_test)
}
print(metrics_dict)
run.log_metrics(metrics_dict)

{'train/mae': 0.34231054424771856, 'train/mse': 0.2599938097546138, 'train/r2_score': 0.8055071458663416, 'test/mae': 0.37744712287022186, 'test/mse': 0.3186573910717649, 'test/r2_score': 0.756826001375897}
[mlfoundry] 2022-05-16T16:27:59+0530 INFO Metrics logged successfully


### Log the dataset

In [17]:
run.log_dataset(
    dataset_name='train',
    features=X_train,
    predictions=y_pred_train,
    actuals=y_train,
)

In [18]:
run.log_dataset(
    dataset_name='test',
    features=X_test,
    predictions=y_pred_test,
    actuals=y_test,
)

### Logging Dataset Stats

In [19]:
# shap value computation
explainer = shap.TreeExplainer(rf_reg)
shap_values = explainer.shap_values(X_test)


X_test_df = X_test.copy()
X_test_df['targets'] = y_test
X_test_df['predictions'] = y_pred_test

run.log_dataset_stats(
    X_test_df, 
    data_slice='test',
    data_schema=mlf.Schema(
        feature_column_names=feature_columns,
        prediction_column_name='predictions',
        actual_column_name='targets'
    ),
    model_type='regression',
    shap_values=shap_values
)

WARN: Missing config
[mlfoundry] 2022-05-16T16:29:21+0530 INFO Metrics logged successfully
[mlfoundry] 2022-05-16T16:29:33+0530 INFO Dataset stats have been successfully computed and logged


### Optionally, you can also log training data slice stats as follows:
It might take longer as training split is relatively much larger than test split

```python
# shap value computation
explainer = shap.TreeExplainer(rf_reg)
shap_values = explainer.shap_values(X_train)


X_train_df = X_train.copy()
X_train_df['targets'] = y_train
X_train_df['predictions'] = y_pred_train


run.log_dataset_stats(
    X_train_df, 
    data_slice='train',
    data_schema=mlf.Schema(
        feature_column_names=feature_columns,
        prediction_column_name='predictions',
        actual_column_name='targets'
    ),
    model_type='regression',
    shap_values=shap_values
)
```

In [20]:
run.end()

[mlfoundry] 2022-05-16T16:29:34+0530 INFO Shutting down background jobs and syncing data for run with id 'b614e94615d244e0bc389d19886246b0', please don't kill this process...
