# Signaux Faibles - Data Science Démo

The purpose of this repo is to get your started using the `predictsignauxfaibles` repository.

In this notebook, we will retrieve some data in a `SFDataset` object, train a basic `SFModelGAM` on it and make some predictions using our trained model.

### Setup

You should have created a `.env` file at the root of your local copy of the repo. The required entries are documented in `.env.example`. _Never_ commit your `.env` file.

In [None]:
%config Completer.use_jedi = False

In [None]:
# Add root of the repo to PYTHONPATH
import sys
sys.path.append("../.")

# mute warnings (! do not do this when working in prod !)
# TODO: fix pyGAM warnings https://github.com/signaux-faibles/predictsignauxfaibles/issues/12
import warnings
warnings.filterwarnings('ignore')

# Set logging level to INFO
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Import required libraries and modules
import pandas as pd
import predictsignauxfaibles.config as config
from predictsignauxfaibles.data import SFDataset
from predictsignauxfaibles.models import SFModelGAM
import json

Make sure that you have access to MongoDB. If you are ensure how to do this, just ask.

### Load Data

The easiest way to load a dataset is via the `SFDataset` class. It can be instantiated in two ways :
- via its constructor method `dataset = SFDataset(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFDataset.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

There is also a `OversampledSFDataset` class available that lets your ask for a given proportion of positiuve observations in the resulting dataset.

In [None]:
from predictsignauxfaibles.data import SFDataset

In [None]:
MY_FEATURES = [
    "montant_part_ouvriere_past_1",
    "montant_part_patronale_past_1",
    "ratio_dette",
]

# It's always a good idea to query periods, siret, and outcomes too
FIELDS_TO_QUERY =  ["siret", "siren", "periode", "outcome"] + MY_FEATURES

dataset = SFDataset(
    date_min="2015-01-01",
    date_max="2020-06-30",
    fields=FIELDS_TO_QUERY,
    sample_size=100
)

We have successfully created an (empty) dataset. Use the `fetch_data` method to fill it. The data is stored as a Pandas DataFrame in the `.data` attribute.

In [None]:
dataset.fetch_data()

# show first 5 rows of dataset
dataset.data.head()

Run `prepare_data()` for standard data preprocessing. This method :
- fills missing values with their defaults defined in `config.py`
- drops any remaining observation with NAs
- optionally removes "strong signals"


You can also manipulate `dataset.data` yourself if you want to perform your own transformation of the data. Look into the `predictsignauxfaibles.preprocessors` for common preprocessing functions

In [None]:
dataset.prepare_data()

It is also possible to load the json file "variables.json" to get the entire list of features, if needed. 

In [None]:
with open("../variables.json",  encoding='utf-8') as json_file:
    doc = json.load(json_file)
doc = pd.DataFrame(doc)
MY_FEATURES_ALL = list(doc.name.unique())
dataset_all = SFDataset(
    date_min="2015-01-01",
    date_max="2016-06-30",
    fields=MY_FEATURES_ALL, # NB: the default value is "all" too :)
    sample_size=10_000
)
dataset_all.fetch_data()
dataset_all.data.shape
# We got all the variables, ie more than 300 features. 

### Train a model

Just like datasets, models can be instantiated in two ways :
- via its constructor method `dataset = SFModel(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFModel.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

Once you are done developping a new model, don't forget to write your configuration file so that your coworkers can reproduce and audit your work :)

In [None]:
gam = SFModelGAM(dataset, features=MY_FEATURES, target="outcome")

Train a model using its `train` method. The (trained) model is stored in the `.model` attribute.

In [None]:
gam.train()
gam.model.summary()

### Evaluate the model

Signaux Faible uses a fairly specific way to evaluate a model. This evaluation process is implemented in `SFModelEvaluator`.

First, start by querying a validation dataset :

In [None]:
validation_set = SFDataset(
        date_min="2018-01-01",
        date_max="2018-06-30",
        fields=FIELDS_TO_QUERY,
        sample_size=5_000
)
validation_set.fetch_data().prepare_data()

Run the cross-validation evaluation method on our model :

In [None]:
from predictsignauxfaibles.model_selection import SFModelEvaluator

cv_scores = SFModelEvaluator(model = gam).cv_evaluation(num_folds=5, validate_set=validation_set)
cv_scores

Compute the average performance of the model :

In [None]:
average_score = sum(cv_scores.values()) / len(cv_scores)
round(average_score, 3)

### Make predictions on new data

In [None]:
new_data = validation_set.data[MY_FEATURES]

# predict probabilities (a float)
pred_probas = gam.predict_proba(new_data)

# predict outcome (True/False)
pred_outcomes = gam.predict(new_data)

pred_probas[:5], pred_outcomes[:5]

### Save the model

Work in progress :) In the meantime, you can use `pickle` to serialize any python object.