# Signaux Faibles - Data Science Démo

<div class="alert alert-warning">
Never commit this notebook along with its output.

Please make sure that you have implemented filter "strip-notebook-output" in .git/config:
```python
[filter "strip-notebook-output"]
        clean = "jupyter nbconvert --to=notebook --ClearOutputPreprocessor.enabled=True --stdout %f"
        smudge = cat
        required
```
    
Only if this first option does not work and you must stage notebooks, please run `jupyter nbconvert --clear-output --inplace 00-get_started.ipynb`
or use Kernel > Restart Kernel and Clear all Outputs...
</div>

The purpose of this repo is to get your started using the `predictsignauxfaibles` repository.

### Setup

You should have created a `.env` file at the root of your local copy of the repo. The required entries are documented in `.env.example`. _Never_ commit your `.env` file.

Make sure that the latest version of the `predictsignauxfaibles` package is installed.

```sh
cd predictsignauxfaibles
git pull origin #<the branch you are interested in trying>
pip install -e .
```

In [None]:
# Set logging level to INFO
import logging
logging.getLogger().setLevel(logging.INFO)

# Import required libraries and modules
import pandas as pd
import json

Make sure that you have access to MongoDB. If you are ensure how to do this, just ask.

### Load Data

The easiest way to load a dataset is via the `SFDataset` class which does all of the MongoDB-related heavy-lifting for you.

There is also a `OversampledSFDataset` class available that lets your ask for a given proportion of positive observations in the resulting dataset.

The package (should be) well-documented, use `help(SFDataset)` for help on how to use these objects.

In [None]:
from predictsignauxfaibles.data import SFDataset, OversampledSFDataset

In [None]:
MY_FEATURES = [
    "montant_part_ouvriere_past_1",
    "montant_part_patronale_past_1",
    "ratio_dette",
]

# It's always a good idea to query periods, siret, and outcomes too
FIELDS_TO_QUERY =  ["siret", "siren", "periode", "outcome", "time_til_outcome"] + MY_FEATURES

dataset = SFDataset(
    date_min="2015-01-01",
    date_max="2020-06-30",
    fields=FIELDS_TO_QUERY,
    sample_size=100
)

We have successfully created an (empty) dataset. Use the `fetch_data` method to fill it. The data is stored as a Pandas DataFrame in the `.data` attribute.

In [None]:
dataset.fetch_data()

# show first 5 rows of dataset
dataset.data.head()

Some commonly-used preprocessing tasks are implemented as SFDataset methods :
- fill missing values with their defaults values defined in `config.py`
- drop any remaining observation with NAs
- remove "strong signals"

In [None]:
dataset.replace_missing_data().remove_na(ignore = ["time_til_outcome"]).remove_strong_signals()

You can also manipulate `dataset.data` yourself if you want to perform your own transformation of the data.

Look into the `predictsignauxfaibles.preprocessors` module for common preprocessing functions.

We also use transformation pipelines for model-specific preprocessing tasks. (see `predictsignauxfaibles.pipelines`)

## Run a model

Models are configured in a python conf file stored in `models/<model_name>/model_conf.py`. Some conf values can be changed at run time via the CLI (`python -m predictsignauxfaibles --help` for more info)

Every model run produces 2 files in `model_runs/<model_id>/` (**which is never commited to Git**):
- the model's predictions in csv format
- some model's statistics and information in json format

The environment variable `ENV` allows you to run the model in `develop` mode (by default, on a few thousands of observations) or in `prod` (using much more data but taking longer to run).

```sh
export ENV=prod # or develop to make this faster
python -m predictsignauxfaibles
```