## Try this Notebook in Google Colab

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truefoundry/mlfoundry-examples/blob/main/examples/sklearn/ca_housing_regression.ipynb)

## Install dependencies

In [None]:
! pip install --quiet "numpy>=1.0.0,<2.0.0" "pandas>=1.0.0,<2.0.0" "matplotlib>=3.5.2,<3.6.0" scikit-learn shap==0.40.0
! pip install -U mlfoundry

## Initialize MLFoundry Client

In [2]:
import mlfoundry as mlf
client = mlf.get_client(api_key="djE6dHJ1ZWZvdW5kcnk6TmlraGlsOjlkMDcxNw==")


---

## California Housing Price Prediction as a Regression problem

In [26]:
import os
import getpass
import urllib.parse

import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import mlfoundry as mlf

### Load the California Housing dataset

In [27]:
data = datasets.fetch_california_housing(as_frame=True)
print(data.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [28]:
data['DESCR']

'.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000

In [30]:
data.frame.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### Split Dataset into Training and Validation

In [31]:
# Create a Pandas dataframe with all the features
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
feature_columns = X_train.columns.tolist()
X_train = X_train[feature_columns]
X_test = X_test[feature_columns]

print('Feature columns:', feature_columns)
print('Train samples:', len(X_train))
print('Test samples:', len(X_test))

Feature columns: ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
Train samples: 16512
Test samples: 4128


In [33]:
dict(X_train.iloc[0])

{'MedInc': 3.2596,
 'HouseAge': 33.0,
 'AveRooms': 5.017656500802568,
 'AveBedrms': 1.0064205457463884,
 'Population': 2300.0,
 'AveOccup': 3.691813804173355,
 'Latitude': 32.71,
 'Longitude': -117.03}

### Start a MLFoundry Run

In [34]:
run = client.create_run(project_name='sklearn-ca-housing-example')

[mlfoundry] 2022-08-05T01:55:08+0530 INFO No run_name given. Using a randomly generated name ruby-cheetah. You can pass your own using the `run_name` argument
[mlfoundry] 2022-08-05T01:55:08+0530 INFO project sklearn-ca-housing-example does not exist. Creating sklearn-ca-housing-example.
Link to the dashboard for the run: https://app.truefoundry.com/mlfoundry/244/57dc144b0b0747fba41fcaf159afeb18/
[mlfoundry] 2022-08-05T01:55:15+0530 INFO Run 'truefoundry/Nikhil/sklearn-ca-housing-example/ruby-cheetah' has started.


### Set tags for our run

In [35]:
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=15, min_samples_leaf=30)
run.set_tags({'framework': 'sklearn', 'task': 'regression'})

[mlfoundry] 2022-08-05T01:55:16+0530 INFO Tags set successfully


### Training Model

In [36]:
rf_reg.fit(X_train, y_train)

RandomForestRegressor(max_depth=15, min_samples_leaf=30)

### Logging Parameters & Model

In [37]:
print(rf_reg.get_params())
run.log_params(rf_reg.get_params())
run.log_model(rf_reg, framework=mlf.ModelFramework.SKLEARN)

{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 15, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 30, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
[mlfoundry] 2022-08-05T01:55:20+0530 INFO Parameters logged successfully
[mlfoundry] 2022-08-05T01:55:42+0530 INFO Model logged successfully


### Computing Predictions

In [38]:
y_pred_train = rf_reg.predict(X_train)
y_pred_test = rf_reg.predict(X_test)

### Logging metrics

In [39]:
metrics_dict = {
    'train/mae': mean_absolute_error(y_true=y_train, y_pred=y_pred_train),
    'train/mse': mean_squared_error(y_true=y_train, y_pred=y_pred_train),
    'train/r2_score': r2_score(y_true=y_train, y_pred=y_pred_train),
    'test/mae': mean_absolute_error(y_true=y_test, y_pred=y_pred_test),
    'test/mse': mean_squared_error(y_true=y_test, y_pred=y_pred_test),
    'test/r2_score': r2_score(y_true=y_test, y_pred=y_pred_test)
}
print(metrics_dict)
run.log_metrics(metrics_dict)

{'train/mae': 0.34019707891900497, 'train/mse': 0.258091398185982, 'train/r2_score': 0.8069302776557842, 'test/mae': 0.37597906136966514, 'test/mse': 0.318378757736289, 'test/r2_score': 0.7570386321958185}
[mlfoundry] 2022-08-05T01:55:43+0530 INFO Metrics logged successfully


### Log the dataset

In [40]:
run.log_dataset(
    dataset_name='train',
    features=X_train,
    predictions=y_pred_train,
    actuals=y_train,
)

[mlfoundry] 2022-08-05T01:55:43+0530 INFO Logging Dataset, this might take a while ...
[mlfoundry] 2022-08-05T01:56:10+0530 INFO Dataset logged successfully
To visualize the logged dataset, click on the link https://app.truefoundry.com/mlfoundry/244/57dc144b0b0747fba41fcaf159afeb18/?tab=data-feature-metrics


In [41]:
run.log_dataset(
    dataset_name='test',
    features=X_test,
    predictions=y_pred_test,
    actuals=y_test,
)

[mlfoundry] 2022-08-05T01:56:10+0530 INFO Logging Dataset, this might take a while ...
[mlfoundry] 2022-08-05T01:56:31+0530 INFO Dataset logged successfully
To visualize the logged dataset, click on the link https://app.truefoundry.com/mlfoundry/244/57dc144b0b0747fba41fcaf159afeb18/?tab=data-feature-metrics


In [42]:
run.end()

[mlfoundry] 2022-08-05T01:56:32+0530 INFO Shutting down background jobs and syncing data for run 'truefoundry/Nikhil/sklearn-ca-housing-example/ruby-cheetah', please don't kill this process...
[mlfoundry] 2022-08-05T01:56:33+0530 INFO Finished syncing data for run 'truefoundry/Nikhil/sklearn-ca-housing-example/ruby-cheetah'. Thank you for waiting!
Link to the dashboard for the run: https://app.truefoundry.com/mlfoundry/244/57dc144b0b0747fba41fcaf159afeb18/


## Log predictions and actuals

In [43]:
monitoring_client = mlf.get_monitoring_client(monitoring_uri='https://ml-monitoring-server.tfy-ctl-euwe1-devtest.devtest.truefoundry.tech/', model_id="0bde36c8e9f249a882061fff60db1e71" , model_version="0.0")

In [44]:
X = pd.concat([X_train, X_test])

In [46]:
Y = pd.concat([y_train, y_test])

In [56]:
Y_pred = np.concatenate([y_pred_train, y_pred_test])

In [65]:
import time
import uuid
from datetime import datetime
for i in range(len(Y_pred)):
    features = dict(X.iloc[i])
    prediction_data = {
        "value": Y_pred[i],
        "probabilities": {},
        "shap_values": {}
    }
    actual = Y.iloc[0]
    id1 = monitoring_client.generate_id_from_data(features=features, timestamp=datetime.utcnow())
    monitoring_client.log_prediction(
        mlf.Prediction(
            data_id=id1,
            features=features,
            prediction_data=prediction_data,
            raw_data={}
        )
    )
    monitoring_client.log_actual(
        mlf.Actual(
            data_id=id1,
            value=Y.iloc[i]
        )
    )
    if (i+1)%1000==0:
        time.sleep(10)

1.03