Getting started with predictsignauxfaibles - Training a logistic regression
===

In this notebook, we'll focus on using predictsignauxfaibles to train a logistic regression, in a way that much of the code here can be reused to quickly test other models.\
\
In `predictsignauxfaibles`, our models are "declared and specified" in `models/<MODEL_NAME>/model_conf.py`\
Our processing pipeline works as following:
- fetching input vairables for train, test and prediction (when pertains) sets
- pre-processing our data to produce model features
- feed this pre-processed data into a model, produce evaluation metrics and predictions
- log training/testing/prediction statistics
\
Here we will assume that you wish to train a model that uses the same pre-processing steps as in `models/default/model_conf.py`

In [None]:
from pathlib import Path
import importlib.util

import logging
logging.getLogger().setLevel(logging.INFO)

from sklearn.base import BaseEstimator
from sklearn.linear_model import LogisticRegression

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import fbeta_score, balanced_accuracy_score
from sklearn_pandas import DataFrameMapper

import pandas as pd

import predictsignauxfaibles.models
from predictsignauxfaibles.data import SFDataset
from predictsignauxfaibles.config import OUTPUT_FOLDER, IGNORE_NA
from predictsignauxfaibles.pipelines import run_pipeline
from predictsignauxfaibles.utils import load_conf

Logging a preprocessing & model configuration
---
The following code fetches the configuration module for model `default`, so that we can easily access, use and adapt the preprocessing steps, train and test sets 

In [None]:
conf = load_conf("default")

We can then look into and modify the current configuration:
- `conf.VARIABLES`contains the list of variables to be fetched
- `conf.FEATURES` contains the list of features to be produced from those variables during pre-processing steps
- `conf.TRANSFO_PIPELINE` contains the pre-processing pipeline, which is a list of `predictsignauxfaibles.Preprocessor` objects. Each preprocessor is defined by a function, a set of inputs and a set of outputs
- `conf.MODEL_PIPELINE` contains a `sklearn.pipeline` with `fit` and `predict` methods

In [None]:
conf.VARIABLES

In [None]:
conf.TRANSFO_PIPELINE

In [None]:
conf.FEATURES

In [None]:
conf.MODEL_PIPELINE

In [None]:
train = conf.TRAIN_DATASET
train.sample_size = 1e4

test = conf.TEST_DATASET
test.sample_size = 1e4

Fetching data
---
At this point, we have allocated datasets but we have not fetched any data into it:

In [None]:
conf.TRAIN_DATASET

### Option 1: Load data from MongoDB (requires an authorized connection to our database):

In [None]:
train.fetch_data().raise_if_empty()
test.fetch_data().raise_if_empty()
logging.info("Succesfully loaded Features data from MongoDB")

if True:
    savepath = "path/to/local_features_save."
    train.data.to_csv(f"{savepath}_train.csv")
    test.data.to_csv(f"{savepath}_test.csv")
    logging.info(f"Saved Features extract to {savepath}")

### Option 2: Load data from a local file, for instance a csv:

In [None]:
train_filepath = "/path/to/train_dataset.csv"
test_filepath = "/path/to/test_dataset.csv"

train.data = pd.read_csv(train_filepath)
test.data = pd.read_csv(test_filepath)

logging.info("Succesfully loaded Features data from %s", features_filepath)

### Option 3: Perform your train/test split a posteriori from a single saved extract from Features:

In [None]:
features_filepath = "/home/simon.lebastard/predictsignauxfaibles/data/features_default_10k.csv" #"/path/to/features_extract.csv"
df = SFDataset(
    date_min="2018-01-01",
    date_max="2018-12-31",
    fields=conf.VARIABLES,
    sample_size=2e4,
)
df.data = pd.read_csv(features_filepath)

X_train, X_test, _, _ = train_test_split(
    df.data,
    df.data["outcome"],
    test_size=0.33,
    random_state=42
)
train = SFDataset()
train.data = X_train

test = SFDataset()
test.data = X_test

Pre-processing our data
---

To remove any bias in evaluation, our test set should not contain any SIRET that belong to the same SIREN as any SIRET in train:

In [None]:
train_siren_set = train.data["siren"].unique().tolist()
test.remove_siren(train_siren_set)

We then run the trasnformation (=pre-processing) pipeline on both sets:

In [None]:
train.replace_missing_data().remove_na(ignore=IGNORE_NA)
train.data = run_pipeline(train.data, conf.TRANSFO_PIPELINE)

test.replace_missing_data().remove_na(ignore=IGNORE_NA)
test.data = run_pipeline(test.data, conf.TRANSFO_PIPELINE)

Training our model
---
To train any model on our data, you can create and modify you own modeling pipeline

In [None]:
model_pp = conf.MODEL_PIPELINE

In [None]:
fit = model_pp.fit(train.data, train.data["outcome"])

In [None]:
params = fit.get_params()

Model evaluation
---

In [None]:
def evaluate(
    model, dataset, beta
):  # To be turned into a SFModel method when refactoring models
    """
    Returns evaluation metrics of model evaluated on df
    Args:
        model: a sklearn-like model with a predict method
        df: dataset
    """
    balanced_accuracy = balanced_accuracy_score(
        dataset.data["outcome"], model.predict(dataset.data)
    )
    fbeta = fbeta_score(dataset.data["outcome"], model.predict(dataset.data), beta=beta)
    return {"balanced_accuracy": balanced_accuracy, "fbeta": fbeta}

In [None]:
eval_metrics = evaluate(fit, test, conf.EVAL_BETA)

In [None]:
eval_metrics