## Example of using LTFMSelector for Regression
As an example, we will experiment with the California Housing dataset. The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average number of rooms and bedrooms in this dataset are provided per household, these columns may take surprisingly large values for block groups with few households and many empty houses, such as vacation resorts.

In [1]:
from ltfmselector import LTFMSelector

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing

ModuleNotFoundError: No module named 'env'

In [None]:
# Loading the California Housing Dataset
housing = fetch_california_housing()

# Get data
X = housing['data']

# Get target
y = housing['target']

# Get feature names
feature_names = housing['feature_names']

# Get description
dataset_description = housing['DESCR']
print(dataset_description)

In [None]:
# Convert data into pandas DataFrame
housing_df = pd.DataFrame(
    np.c_[X, y], columns = np.append(feature_names, ['target'])
)

The data will then be split for training and testing.

Note: It is important that the training datasets (`X`) are passed as `pandas.DataFrame` and the label (`y`) as `pandas.Series`. Other forms will be accomodated for in later versions.

In [None]:
# Split the dataset for training and test
X_df = housing_df.drop(['target'], axis=1)
y_df = housing_df['target']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_df, y_df, test_size=0.2, random_state=5)

y_train = y_train.reset_index(drop=True)
y_test  = y_test.reset_index(drop=True)

We will now train an agent using LTFMSelector to select features and a prediction models, tailored to each sample.

When initializing LTFMSelector, one necessary hyperparameter is the number of episodes, over which an agent is trained. My personal recommendation is set roughly 2-3 times the number of training examples.
 - So for example here, we have 16512 training examples: Hence, ~32000 episodes

Another hyperparameter that should be set is `ptype`, which should be set to `regression` for this example.

If `pModels=None`, a default choice of:
 - Support Vector Machine
 - Random Forest
 - Ridge Regression (Linear least squares with L2 regularization)
will be implemented, all using the scikit-learn library with default hyperparameters. Users can also pass a list of regression model objects, which must have `fit` and `predict` call functions.

In [None]:
# Training an agent using LTFMSelector to select features and an appropriate prediction model tailored to each sample
AgentSelector = LTFMSelector(100, pType='regression') # If you got time, go for 20000

Train the agent by passing the training examples and label.

The hyperparameter `agent_neuralnetwork` receives as an input a PyTorch neural network which will be used to learn the agent's policy. If `None`, a feed-forward (multilayer-perceptron) of with two hidden layers, each with 1024 units will be used.

`lr` refers to the learning rate of the `AdamW` optimizer, used to update the policy network.

The `fit` function returns a `dict<dict>` object, storing meta-information during the training process.

Note: Just for demo purposes, training an agent over 1300 episodes may take some time but if you are simply interested in getting a feel for the interface then just set the number of episodes to 30 or less for now.

In [None]:
# Now letting the agent train, this could take some time ...
doc = AgentSelector.fit(X_train, y_train, agent_neuralnetwork=None, lr=1e-5)

In [None]:
# Let's check out the regression model performance in terms of the coefficient of determination
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print(f"R2: {r2}")

For examples of how you can investigate the features and models selected per sample, simply refer to the other previous notebooks.