# Signaux Faibles - Data Science Démo

The purpose of this repo is to get your started using the `predictsignauxfaibles` repository.

In this notebook, we will retrieve some data in a `SFDataset` object, train a basic `SFModelGAM` on it and make some predictions using our trained model.

### Setup

You should have created a `.env` file at the root of your local copy of the repo. The required entries are documented in `.env.example`. _Never_ commit your `.env` file.

In [1]:
# Add root of the repo to PYTHONPATH
import sys
sys.path.append("../.")

# Set logging level to INFO
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Import required libraries and modules
import pandas as pd

import config
from lib.data import SFDataset
from lib.models import SFModelGAM

Make sure that you have access to MongoDB. If you are ensure how to do this, just ask.

### Load Data

The easiest way to load a dataset is via the `SFDataset` class. It can be instantiated in two ways :
- via its constructor method `dataset = SFDataset(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFDataset.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

In [2]:
MY_FEATURES = [
    "montant_part_ouvriere_past_1",
    "montant_part_patronale_past_1",
    "ratio_dette",
]

# It's always a good idea to query periods, siret, and outcomes too
FIELDS_TO_QUERY =  ["siret", "periode", "outcome"] + MY_FEATURES

dataset = SFDataset(
    date_min="2015-01-01",
    date_max="2016-06-30",
    fields=FIELDS_TO_QUERY,
    sample_size=10_000,
    batch_id="2009_5"
)

dataset


        -----------------------
        Signaux Faibles Dataset
        -----------------------

        batch_id : 2009_5
        ---------- 

        Fields:
        -------
            ['siret', 'periode', 'outcome', 'montant_part_ouvriere_past_1', 'montant_part_patronale_past_1', 'ratio_dette']

        MongoDB Aggregate Pipeline:
        ---------------------------
            []
        

We have successfully created an (empty) dataset. Use the `fetch_data` method to fill it. The data is stored as a Pandas DataFrame in the `.data` attribute.

In [3]:
dataset.fetch_data()

dataset.data.head()

Unnamed: 0,siret,periode,outcome,montant_part_ouvriere_past_1,montant_part_patronale_past_1,ratio_dette
0,32657703800028,2015-04-01,False,0.0,0.0,0.0
1,32018930100015,2016-04-01,False,0.0,0.0,0.0
2,33037752400203,2015-11-01,False,0.0,0.0,
3,44937840500012,2016-05-01,False,0.0,0.0,0.0
4,42481349100018,2015-08-01,False,0.0,0.0,


Run `prepare_data()` for standard data preprocessing. This method :
- creates a `siren` column from the `siret`
- fills missing values with their defaults defined in `config.py`
- drops any remaining observation with NAs


You can also manipulate `dataset.data` yourself when 

In [4]:
dataset.prepare_data()

INFO:root:Creating a `siren` column
INFO:root:Replacing missing data with default values
INFO:root:Drop observations with missing required fields.
INFO:root:Removing NAs from dataset.
INFO:root:Number of observations before: 10000
INFO:root:Number of observations after: 10000


### Train a model

Just like datasets, models can be instantiated in two ways :
- via its constructor method `dataset = SFModel(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFModel.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

Once you are done developping a new model, don't forget to write your configuration file so that your coworkers can reproduce and audit your work :)

In [5]:
gam = SFModelGAM(dataset, features=MY_FEATURES, target="outcome")

Train a model using its `train` method. The (trained) model is stored in the `.model` attribute.

In [6]:
gam.train()
gam.model.summary()

LogisticGAM                                                                                               
Distribution:                      BinomialDist Effective DoF:                                       7.648
Link Function:                        LogitLink Log Likelihood:                                  -417.5112
Number of Samples:                         7000 AIC:                                              850.3184
                                                AICc:                                             850.3422
                                                UBRE:                                               2.1223
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.4499
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
s(0)                              [0.

  return dist.levels/(mu*(dist.levels - mu))
  return sp.sparse.diags((self.link.gradient(mu, self.distribution)**2 *
 
Please do not make inferences based on these values! 

Collaborate on a solution, and stay up to date at: 
github.com/dswah/pyGAM/issues/163 

  gam.model.summary()


### Evaluate the model

In [7]:
gam.evaluate()

0.9886666666666667

### Save the model

Work in progress :)