# Creating Input a Neural Net Model can actually use.

Up until this point in the notebooks we turned the file into a `Pandas DataFrame ` object. This is very conventient to get to know the data and run some basic analytics on it, but it's not something a Neural Net Model can directly use. We'll have to turn the `DataFrames` into `Numpy` arrays. Those can almost directly be used by a Neural Net, for instance by wrapping them into a Pytorch `DataLoader` object.

There are some differences between a DataFrame and a Numpy Array. An important one is that a DataFrame can have different data-type for each column (feature). We made `DataFrames` with 'object' (string), 'float', 'categorical' datatypes, all in one DataFrame.

That is not possible with `Numpy` Arrays, a specific Numpy array has to have a single DataType. It needs a homogeneous datatype across the entire array.

## Requirements
Before running the experiment, make sure to import the `numpy`, `pandas` and `numba` packages in your virtual environment
```
> pip install numpy
> pip install pandas
> pip install numba
```
And that the notebook can find the `f3atur3s` and `eng1n3` packages.

## Preparation

Before creating features, we will have to import a couple of packages

In [1]:
import numpy as np
import pandas as pd
import math
import f3atur3s as ft
import eng1n3.pandas as en

And we define the **file** we will read from.

In [2]:
file = './data/intro_card.csv'

In [3]:
date     = ft.FeatureSource('Date', ft.FEATURE_TYPE_DATE, format_code='%Y%m%d')
card     = ft.FeatureSource('Card', ft.FEATURE_TYPE_STRING)
merchant = ft.FeatureSource('Merchant', ft.FEATURE_TYPE_STRING)
amount   = ft.FeatureSource('Amount', ft.FEATURE_TYPE_FLOAT_32)
mcc      = ft.FeatureSource('MCC', ft.FEATURE_TYPE_CATEGORICAL, default='0000')
country  = ft.FeatureSource('Country', ft.FEATURE_TYPE_CATEGORICAL)
fraud    = ft.FeatureSource('Fraud', ft.FEATURE_TYPE_INT_8)

mcc_oh = ft.FeatureOneHot('MCC_OH', ft.FEATURE_TYPE_INT_8,  mcc)
country_oh = ft.FeatureOneHot('Country_OH', ft.FEATURE_TYPE_INT_8, country)
fraud_label = ft.FeatureLabelBinary('Fraud_Label', ft.FEATURE_TYPE_INT_8, fraud)

### Not being smart about it

When we define feature, we provide a `FeatureType`. The FeatureType defines what the dtype of the Pandas DataFrame and Numpy array will be.

The `EnginePandas` has a method to build a **Numpy** array instead of a **DataFrame**, it is named `np_from_csv`. We can try to use this to build all created features, like before, and see what happens

In [4]:
td = ft.TensorDefinition('Features', [date, card, merchant, amount, mcc_oh, country_oh])

with en.EnginePandas(num_threads=1) as e:
    ti = e.np_from_csv(td, file, inference=False)

2023-04-18 20:59:33.253 eng1n3.common.engine           INFO     Start Engine...
2023-04-18 20:59:33.254 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-04-18 20:59:33.255 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5


EnginePandasException: Error creating source: Found more than one feature root type. ['STRING', 'FLOAT', 'INTEGER'] in TensorDefinition Features. This process can only handle features of the same root type, for instance only INTEGER or only FLOAT

### A bit smarter

We get an error, this is expected. The engine is telling us that is can not build *TensorDefinition* because the features are not all of the same **FeatureRootType**.

When we want to build a Numpy array we will need to split the features up into several TensorDefinitions, where each *TensorDefinition* only contains feature of a **single FeatureRootType**.

For instance we can bundle the mcc_oh and country_oh, they are the same type.

In [5]:
td = ft.TensorDefinition('Features', [mcc_oh, country_oh])

with en.EnginePandas(num_threads=1) as e:
    ti = e.np_from_csv(td, file, inference=False)
    
type(ti), ti.number_of_lists, type(ti.numpy_lists[0])

2023-04-18 20:59:35.788 eng1n3.common.engine           INFO     Start Engine...
2023-04-18 20:59:35.788 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-04-18 20:59:35.788 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-04-18 20:59:35.789 eng1n3.pandas.pandasengine     INFO     Building Panda for : All_r_1 from file ./data/intro_card.csv
2023-04-18 20:59:35.804 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: All_r_1
2023-04-18 20:59:35.804 eng1n3.pandas.pandasengine     INFO     Converting All_r_1 to 1 numpy arrays


(eng1n3.common.tensorinstance.TensorInstanceNumpy, 1, numpy.ndarray)

This looks better, we have no error and we are getting back an object of `TensorInstanceNumpy`, this is an object that can contain several Numpy arrays. In our case as we had one TensorDefinition as input, we get a TensorInstanceNumpy with exactly one list.

We can have a look at the content of that list. It is a matrix (Rank-2 tensor) of size 6 x 7, i.e. 6 rows and 7 columns. The dype is uint8 (a very small integer), and contains zeros and ones.

In [6]:
ti.numpy_lists[0].shape, ti.numpy_lists[0]

((6, 7),
 array([[1, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 0, 1],
        [0, 0, 0, 1, 1, 0, 0]], dtype=uint8))

The content looks a bit cryptic, but we can make it a bit more familiar by building out a `DataFrame` with the same features. And observe that is basically the same data as before, but less visually attractive.

In [7]:
with en.EnginePandas(num_threads=1) as e:
    df = e.df_from_csv(td, file, inference=False)
    
df

2023-04-18 20:59:38.868 eng1n3.common.engine           INFO     Start Engine...
2023-04-18 20:59:38.869 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-04-18 20:59:38.869 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-04-18 20:59:38.870 eng1n3.pandas.pandasengine     INFO     Building Panda for : Features from file ./data/intro_card.csv
2023-04-18 20:59:38.877 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: Features


Unnamed: 0,MCC__0001,MCC__0002,MCC__0003,MCC__0000,Country__DE,Country__FR,Country__GB
0,1,0,0,0,1,0,0
1,0,1,0,0,0,0,1
2,0,0,1,0,1,0,0
3,0,0,1,0,0,1,0
4,0,1,0,0,0,0,1
5,0,0,0,1,1,0,0


### Multiple TensorDefinitions

In a real life case we may not want to be restricted to a single data type and provide more than just the OneHot Features to a model. Luckily we can provide multiple TensorDefinitions to the `np_from_csv` call. Let's define a second `TensorDefinition` for the amount and a third `TensorDefinition` for the Fraud label and ask to build them all at once.

In [8]:
td_oh     = ft.TensorDefinition('Features_OH', [mcc_oh, country_oh])
td_amount = ft.TensorDefinition('Feature_Amount', [amount])
td_label  = ft.TensorDefinition('Feature_Fraud', [fraud_label])

with en.EnginePandas(num_threads=1) as e:
    ti = e.np_from_csv(
        (td_oh, td_amount, td_label),  # A Tuple of multiple TensorDefinitions
        file, 
        inference=False)
    
type(ti), ti.number_of_lists, type(ti.numpy_lists[0])

2023-04-18 20:59:41.427 eng1n3.common.engine           INFO     Start Engine...
2023-04-18 20:59:41.427 eng1n3.pandas.pandasengine     INFO     Pandas Version : 1.5.3
2023-04-18 20:59:41.427 eng1n3.pandas.pandasengine     INFO     Numpy Version : 1.23.5
2023-04-18 20:59:41.428 eng1n3.pandas.pandasengine     INFO     Building Panda for : All_r_1 from file ./data/intro_card.csv
2023-04-18 20:59:41.434 eng1n3.pandas.pandasengine     INFO     Reshaping DataFrame to: All_r_1
2023-04-18 20:59:41.435 eng1n3.pandas.pandasengine     INFO     Converting All_r_1 to 3 numpy arrays


(eng1n3.common.tensorinstance.TensorInstanceNumpy, 3, numpy.ndarray)

As we asked for 3 TensorDefinitions to be built, we get 3 Numpy arrays back in the TensorInstanceNumpy. The first list is the one we recognize from the previous examples, the OneHot encoded 'Mercant' and 'Country' fields.

The second list is a 6x1 matrix containing the amounts of each sample. We have 6 samples and 1 feature. 

And the third list is the 'Fraud', column, that it what we will use as **label** when we run predictive models.

In [9]:
ti.numpy_lists[0].shape, ti.numpy_lists[0]

((6, 7),
 array([[1, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 0, 0],
        [0, 0, 1, 0, 0, 1, 0],
        [0, 1, 0, 0, 0, 0, 1],
        [0, 0, 0, 1, 1, 0, 0]], dtype=uint8))

In [10]:
ti.numpy_lists[1].shape, ti.numpy_lists[1]

((6, 1),
 array([[1.],
        [2.],
        [3.],
        [4.],
        [5.],
        [6.]], dtype=float32))

In [11]:
ti.numpy_lists[2].shape, ti.numpy_lists[2]

((6, 1),
 array([[0],
        [0],
        [1],
        [0],
        [0],
        [0]], dtype=int8))

### Label Data

The attentive reader will note that the third list containing the label has the same dtype (int8) as the first list and wonder why we did not build them together.

Neural Nets generally need a target, the objective they are going to solve for. This is often referred to as **label**, it is generally fed as separate tensor to the model. That is why we have it a distinctive TensorDefinition and build it out seperately. It is also usefull for production modes, typically you will not provide the label in production, so it's best to keep it seperated from the actual input data.

If and when we want to train a Neural Net Model, we'll have to tell it where it can find the label. The TensorInstanceNumpy understands that a TensorDefinition that only contains `FeatureLabel` classes is very likely to be the label we will want to use.

We can ask it which Numpy arrray(s) it thinks are the label(s), and index into the TensorInstance. That will come in handly later down the line to correctly use the label in the Neural Net Models.

In [12]:
ti.label_indexes

(2,)

In [13]:
ti.numpy_lists[ti.label_indexes[0]]

array([[0],
       [0],
       [1],
       [0],
       [0],
       [0]], dtype=int8)

### Split data
When you train a Neural Net, you'll always want to split your data into a training, validation and test set.
- The training set is the one used to train the model.
- The validation set is used to make sure the model does not overfit during training
- The test set is used totally at the end of the modelling process to publish results, it should really only be used once.

Now that we have a `TensorInstance` object we can ask it to perform a sequential split. This normally the best way to split data that has a time dimension. **Never random shuffle transactional data**, that can create data leakage. 

Make sure the data is **ordered** in the time dimension before applying the split, the split function itself does not order, it assumes the data *is* ordered.

Below call to `split_sequential` asks to split the file sequentially, the last '1' record will be *test*, the previous 2 will be *validation* and everything before that will be *training*.

Note how the result we get has each of the Numpy arrays split in the same way.

In [14]:
train, val, test = ti.split_sequential(2,1)

In [15]:
train, val, test

(TensorInstance with shapes: ((3, 7), (3, 1), (3, 1)),
 TensorInstance with shapes: ((2, 7), (2, 1), (2, 1)),
 TensorInstance with shapes: ((1, 7), (1, 1), (1, 1)))

In [16]:
train.numpy_lists[0], train.numpy_lists[1], train.numpy_lists[2]

(array([[1, 0, 0, 0, 1, 0, 0],
        [0, 1, 0, 0, 0, 0, 1],
        [0, 0, 1, 0, 1, 0, 0]], dtype=uint8),
 array([[1.],
        [2.],
        [3.]], dtype=float32),
 array([[0],
        [0],
        [1]], dtype=int8))

# Conclusion

At this point we have created something that can almost readily be used in a training, validation and testing methodoly for a Neural Net. Next up is creating and training models.