# Inference Notes
-----

## What is with this inference parameter?

In the previous notebook, we started creating features using the `PandasEngine`. It had a parameter **inference** which can take a value *True* or *False*. Let's go into a couple of examples and see how to use that parameter.

Broadly speaking, a machine learning model can be used in 2 modes.
- **Training**, where the model is trained on a specific data set. This is a learning mode. 
- **Inference**, is a more operational mode. Where the model is acutally used to generate output. 


## Requirements
Before running the experiment, make sure to import the `numpy`, `pandas` and `numba` packages in your virtual environment
```
> pip install numpy
> pip install pandas
> pip install numba
```
And that the notebook can find the `f3atur3s` and `eng1n3` packages.

## Preparation

Before creating features, we will have to import a couple of packages

In [1]:
import numpy as np
import pandas as pd
import math
import f3atur3s as ft
import eng1n3.pandas as en

And we define the **file** we will read from.

In [2]:
file = './data/intro_card.csv'

### Training Mode

Certain features have inference attributes, this is data collected during the training. For instance `FeatureOneHot` collects the names, the unique values of the base feature, during training.

Note that not all feature have inference elements. For instance a FeatureSource does not have any inference elements, it does not need to know anything about the underlying data to construct the feature.

In [3]:
card = ft.FeatureSource('Card', ft.FEATURE_TYPE_STRING)
merchant = ft.FeatureSource('Merchant', ft.FEATURE_TYPE_STRING)
amount = ft.FeatureSource('Amount', ft.FEATURE_TYPE_FLOAT)
date = ft.FeatureSource('Date', ft.FEATURE_TYPE_DATE, format_code='%Y%m%d')
mcc = ft.FeatureSource('MCC', ft.FEATURE_TYPE_CATEGORICAL, default='0000')
country = ft.FeatureSource('Country', ft.FEATURE_TYPE_CATEGORICAL)

# The one-hot encoding of Country
country_oh = ft.FeatureOneHot('Country_OH', ft.FEATURE_TYPE_INT_8, country)

td = ft.TensorDefinition('Features', [date, card, merchant, amount, country_oh])

After creating the inference attributes are None. They are unknown, the feature has not seen any data yet

In [4]:
country_oh.expand_names == None

True

This is reflected in the `inference_ready` property. Both the Feature and the TensorDefinition know they have not seen any data.

In [5]:
country_oh.inference_ready, td.inference_ready

(False, False)

Let's read our file again and see what happens. The `inference` flag is False, we are telling the engine that we are in **training mode**.

In [6]:
with en.EnginePandas(num_threads=1) as e:
    df = e.df_from_csv(td, file, inference=False)
    
df

2023-03-17 14:29:09.115 eng1n3.common.engine           INFO     Start Engine...
2023-03-17 14:29:09.117 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-03-17 14:29:09.117 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-03-17 14:29:09.117 eng1n3.pandas.pandasengine     INFO     Building Panda for : Features from file ./data/intro_card.csv
2023-03-17 14:29:09.158 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: Features


Unnamed: 0,Date,Card,Merchant,Amount,Country__DE,Country__FR,Country__GB
0,2020-01-01,CARD-1,MRC-1,1.0,1,0,0
1,2020-01-02,CARD-2,MRC-2,2.0,0,0,1
2,2020-01-03,CARD-1,MRC-3,3.0,1,0,0
3,2020-01-04,CARD-1,MRC-3,4.0,0,1,0
4,2020-01-04,CARD-2,MRC-2,5.0,0,0,1
5,2020-01-06,CARD-2,MRC-4,6.0,1,0,0


When we check the expand_names attribute and the inference_ready flags, we see things have changed. The `FeatureOneHot` now knows it has seen 3 unique values for the *Country* feature and that it will have to build out 3 columns.

In [7]:
country_oh.expand_names

['Country__DE', 'Country__FR', 'Country__GB']

In [8]:
country_oh.inference_ready, td.inference_ready

(True, True)

Let's create a copy of the original file containing only the 2 first lines

In [9]:
first_lines = './data/2lines_intro_card.csv'
!head -2 $file > $first_lines
!cat $first_lines

Date,Amount,Card,Merchant,MCC,Country,Fraud
20200101,1.0,CARD-1,MRC-1,0001,DE,0


Now imagine we want to use this second file as test file or in some sort of production environment. This is what would happen if we use a **new** TensorDefinition and **new** feature and do **not** run in inference mode.

In [10]:
country_oh_new = ft.FeatureOneHot('Country_OH', ft.FEATURE_TYPE_INT_8, country)
td_new = ft.TensorDefinition('Features', [date, card, merchant, amount, country_oh_new])
with en.EnginePandas(num_threads=1) as e:
    df = e.df_from_csv(td_new, first_lines, inference=False)
    
df

2023-03-17 14:29:14.894 eng1n3.common.engine           INFO     Start Engine...
2023-03-17 14:29:14.895 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-03-17 14:29:14.895 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-03-17 14:29:14.896 eng1n3.pandas.pandasengine     INFO     Building Panda for : Features from file ./data/2lines_intro_card.csv
2023-03-17 14:29:14.903 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: Features


Unnamed: 0,Date,Card,Merchant,Amount,Country__DE
0,2020-01-01,CARD-1,MRC-1,1.0,1


This is somewhat problematic. We would have trained a model that has seen 3 'Country' features, and in this file we only have one. That is because the file only contained one unique value for the 'Country' feature.

This is bound to create problems, you can not just give a model a file with one layout during training and then another one during test or in production.

### Inference Mode
This is why we have inferene model. If we read the second file again, but this time with the original `TensorDefinition` and with the `inference` flag set to *'True'*, we get a different result. We are now back to having 3 'Country' features.

The FeatureOneHot remembered it saw 3 values at training (it's inference attributes) and it built them for us, even though they were not in the data.

In [11]:
with en.EnginePandas(num_threads=1) as e:
    df = e.df_from_csv(td, first_lines, inference=True)
    
df

2023-03-17 14:29:17.547 eng1n3.common.engine           INFO     Start Engine...
2023-03-17 14:29:17.548 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-03-17 14:29:17.548 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-03-17 14:29:17.549 eng1n3.pandas.pandasengine     INFO     Building Panda for : Features from file ./data/2lines_intro_card.csv
2023-03-17 14:29:17.554 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: Features


Unnamed: 0,Date,Card,Merchant,Amount,Country__DE,Country__FR,Country__GB
0,2020-01-01,CARD-1,MRC-1,1.0,1,0,0


When we build test or production files, we should always build then with the `inference` True flag set, that will make sure that test/production and training files are optimally consistent wrt the applies transformations. Following features have inference attributes;
- FeatureOneHot
- FeatureIndex
- FeatureBin
- FeatureNormalizeScale
- FeatureNormalizeStandard

In order to keep the inference attributes over time, it is possible to save an entire TensorDefinition. That will create a directory where meta data on the TensorDefinition is kept as well as meta data on each feature embedded in it. This is how we **save** a TensorDefinition

### Saving and Loading

In [12]:
save_location = './data/oh_save'
ft.TensorDefinitionSaver.save(td, save_location)

This create a directory structure ub the save location. There is a JSON file describing the TensorDefinition and a directory named 'features' where all the features of the TensorDefinition are stored.

In [13]:
!ls -l $save_location

total 8
drwxrwxr-x 2 toms toms 4096 Mär 17 14:29 features
-rw-rw-r-- 1 toms toms  157 Mär 17 14:29 tensor.json


The specific meta data on the feature 'Country_OH' is saved in a JSON file. We can see that it has stored the names it is expecting in the data-frame. (The 'expand_names' key)

In [14]:
country_oh_json = save_location+'/features/Country_OH.json'
!cat $country_oh_json

{
    "name": "Country_OH",
    "type": {
        "key": 8,
        "name": "INT_8",
        "root_type": {
            "key": 1,
            "name": "INTEGER"
        },
        "precision": 8
    },
    "embedded_features": [
        "Country"
    ],
    "base_feature": "Country",
    "expand_names": [
        "Country__DE",
        "Country__FR",
        "Country__GB"
    ],
    "class": "FeatureOneHot"
}

We can reload the file into a **new** TensorDefinition. The name is the same as the original TensorDefinition and the inference_flag is set to True

In [15]:
td_reload = ft.TensorDefinitionLoader.load(save_location)
(td_reload.name, td_reload.inference_ready)

('Features', True)

The new `FeatureOneHot` remembers the inference attributes. So we can load our production/test file with the inference_flag set to *True*

In [16]:
oh_reload = td_reload.features[4]
(oh_reload.name, oh_reload.expand_names)

('Country_OH', ['Country__DE', 'Country__FR', 'Country__GB'])

In [17]:
with en.EnginePandas(num_threads=1) as e:
    df = e.df_from_csv(td_reload, first_lines, inference=True)
    
df

2023-03-17 14:29:30.293 eng1n3.common.engine           INFO     Start Engine...
2023-03-17 14:29:30.294 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-03-17 14:29:30.295 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-03-17 14:29:30.295 eng1n3.pandas.pandasengine     INFO     Building Panda for : Features from file ./data/2lines_intro_card.csv
2023-03-17 14:29:30.303 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: Features


Unnamed: 0,Merchant,Date,Card,Amount,Country__DE,Country__FR,Country__GB
0,MRC-1,2020-01-01,CARD-1,1.0,1,0,0


### Don't forget to clean-up

In [18]:
!rm $first_lines
!rm -rf $save_location

# Conclusion
Some features are not just plain built from other features or parameters, but need information from the data itself in order to be built. Those features store that information in inference attributes. 

It is very important for the consitency of the model input that training and test/production sets are created with the same inference attributes. Test/Production data should always be built with the inference_flag set to *True* and only after the features have read a representative set of training data.

The TensorDefinition object and all the features, including the inference attributes can be saved and re-loaded.