# Signaux Faibles - Data Science Démo

The purpose of this repo is to get your started using the `predictsignauxfaibles` repository.

In this notebook, we will retrieve some data in a `SFDataset` object, train a basic `SFModelGAM` on it and make some predictions using our trained model.

### Setup

You should have created a `.env` file at the root of your local copy of the repo. The required entries are documented in `.env.example`. _Never_ commit your `.env` file.

In [7]:
# Add root of the repo to PYTHONPATH
import sys
sys.path.append("../.")

# mute warnings (! do not do this when working in prod !)
# TODO: fix pyGAM warnings https://github.com/signaux-faibles/predictsignauxfaibles/issues/12
import warnings
warnings.filterwarnings('ignore')

# Set logging level to INFO
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

# Import required libraries and modules
import pandas as pd
import config
from predictsignauxfaibles.data import SFDataset
from predictsignauxfaibles.models import SFModelGAM
import json

Make sure that you have access to MongoDB. If you are ensure how to do this, just ask.

### Load Data

The easiest way to load a dataset is via the `SFDataset` class. It can be instantiated in two ways :
- via its constructor method `dataset = SFDataset(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFDataset.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

In [21]:
MY_FEATURES = [
    "montant_part_ouvriere_past_1",
    "montant_part_patronale_past_1",
    "ratio_dette",
]

# It's always a good idea to query periods, siret, and outcomes too
FIELDS_TO_QUERY =  ["siret", "siren", "periode", "outcome"] + MY_FEATURES

dataset = SFDataset(
    date_min="2015-01-01",
    date_max="2016-06-30",
    fields=FIELDS_TO_QUERY,
    sample_size=10_000
)

dataset


Signaux Faibles Dataset (batch_id : 2102)
----------------------------------------------------
Empty Dataset
[...]
----------------------------------------------------
Number of observations = 0
        

We have successfully created an (empty) dataset. Use the `fetch_data` method to fill it. The data is stored as a Pandas DataFrame in the `.data` attribute.

In [22]:
dataset.fetch_data()

# hide siret and siren as the repo is public
dataset.data.head().loc[:, ~dataset.data.columns.isin(["siret", "siren"])]

Unnamed: 0,periode,outcome,montant_part_ouvriere_past_1,montant_part_patronale_past_1,ratio_dette
0,2015-04-01,False,0.0,0.0,0.0
1,2016-04-01,False,0.0,0.0,0.0
2,2015-11-01,False,0.0,0.0,
3,2016-05-01,False,0.0,0.0,0.0
4,2015-04-01,False,0.0,0.0,0.0


Run `prepare_data()` for standard data preprocessing. This method :
- creates a `siren` column from the `siret`
- fills missing values with their defaults defined in `config.py`
- drops any remaining observation with NAs


You can also manipulate `dataset.data` yourself if you want to perform your own transformation of the data.

In [23]:
dataset.prepare_data()

INFO:root:Replacing missing data with default values
INFO:root:Drop observations with missing required fields.
INFO:root:Removing NAs from dataset.
INFO:root:Number of observations before: 10000
INFO:root:Number of observations after: 9865
INFO:root:Resetting index for DataFrame.



Signaux Faibles Dataset (batch_id : 2102)
----------------------------------------------------
            siret    periode  outcome  montant_part_ouvriere_past_1  \
0  32657703800028 2015-04-01    False                           0.0   
1  32018930100015 2016-04-01    False                           0.0   
2  33037752400203 2015-11-01    False                           0.0   
3  44937840500012 2016-05-01    False                           0.0   
4  77977749900079 2015-04-01    False                           0.0   

   montant_part_patronale_past_1      siren  ratio_dette  
0                            0.0  326577038          0.0  
1                            0.0  320189301          0.0  
2                            0.0  330377524          0.0  
3                            0.0  449378405          0.0  
4                            0.0  779777499          0.0  
[...]
----------------------------------------------------
Number of observations = 9865
        

It is also possible to load the json file "variables.json" to get the entire list of features, if needed. 

In [18]:
with open("../variables.json",  encoding='utf-8') as json_file:
    doc = json.load(json_file)
doc = pd.DataFrame(doc)
MY_FEATURES_ALL = list(doc.name.unique())
dataset_all = SFDataset(
    date_min="2015-01-01",
    date_max="2016-06-30",
    fields=MY_FEATURES_ALL, # NB: the default value is "all" too :)
    sample_size=10_000
)
dataset_all.fetch_data()
dataset_all.data.shape
# We got all the variables, ie more than 300 features. 

(10000, 316)

### Train a model

Just like datasets, models can be instantiated in two ways :
- via its constructor method `dataset = SFModel(...)`, better for developping and exploring the data
- via a yaml configuration file `dataset = SFModel.from_config_file("../models/rocketscience/model.yml")`, which is best for ensuring reproducibility and for production use.

Once you are done developping a new model, don't forget to write your configuration file so that your coworkers can reproduce and audit your work :)

In [24]:
gam = SFModelGAM(dataset, features=MY_FEATURES, target="outcome")

Train a model using its `train` method. The (trained) model is stored in the `.model` attribute.

In [25]:
gam.train()
gam.model.summary()

LogisticGAM                                                                                               
Distribution:                      BinomialDist Effective DoF:                                      6.1966
Link Function:                        LogitLink Log Likelihood:                                  -979.1556
Number of Samples:                         9865 AIC:                                             1970.7044
                                                AICc:                                            1970.7164
                                                UBRE:                                               2.2003
                                                Scale:                                                 1.0
                                                Pseudo R-Squared:                                   0.4898
Feature Function                  Lambda               Rank         EDoF         P > x        Sig. Code   
s(0)                              [0.

### Evaluate the model

Signaux Faible uses a fairly specific way to evaluate a model. This evaluation process is implemented in `SFModelEvaluator`.

First, start by querying a validation dataset :

In [26]:
validation_set = SFDataset(
        date_min="2018-01-01",
        date_max="2018-06-30",
        fields=FIELDS_TO_QUERY,
        sample_size=5_000
)
validation_set.fetch_data().prepare_data()

INFO:root:Replacing missing data with default values
INFO:root:Drop observations with missing required fields.
INFO:root:Removing NAs from dataset.
INFO:root:Number of observations before: 5000
INFO:root:Number of observations after: 4939
INFO:root:Resetting index for DataFrame.



Signaux Faibles Dataset (batch_id : 2102)
----------------------------------------------------
            siret    periode  outcome  montant_part_ouvriere_past_1  \
0  78439368800063 2018-02-01    False                           0.0   
1  32992501093876 2018-03-01    False                           0.0   
2  57980541700022 2018-03-01    False                           0.0   
3  52289945900021 2018-02-01    False                           0.0   
4  38994147700041 2018-05-01    False                           0.0   

   montant_part_patronale_past_1      siren  ratio_dette  
0                            0.0  784393688          0.0  
1                            0.0  329925010          0.0  
2                            0.0  579805417          0.0  
3                            0.0  522899459          0.0  
4                            0.0  389941477          0.0  
[...]
----------------------------------------------------
Number of observations = 4939
        

Run the cross-validation evaluation method on our model :

In [27]:
from predictsignauxfaibles.model_selection import SFModelEvaluator

cv_scores = SFModelEvaluator(model = gam).cv_evaluation(num_folds=5, validate_set=validation_set)
cv_scores

{0: 0.5805391396545401,
 1: 0.5912488741294926,
 2: 0.5889695609995358,
 3: 0.6007928337183918,
 4: 0.5979756981215532}

Compute the average performance of the model :

In [28]:
average_score = sum(cv_scores.values()) / len(cv_scores)
round(average_score, 3)

0.592

### Make predictions on new data

In [29]:
new_data = validation_set.data[MY_FEATURES]

# predict probabilities (a float)
pred_probas = gam.predict_proba(new_data)

# predict outcome (True/False)
pred_outcomes = gam.predict(new_data)

pred_probas[:5], pred_outcomes[:5]

(array([0.01483678, 0.01483678, 0.01483678, 0.01483678, 0.01483678]),
 array([False, False, False, False, False]))

### Save the model

Work in progress :) In the meantime, you can use `pickle` to serialize any python object.