## Using ticdat to build modular engines

The goal of the `ticdat` package is to facilitate solve engines that are modular and robust. For example, the multicommodity `netflow.py` engine can read and write from a variety of file types when run from the the command line. It can also be run from a Python script that contains embedded static data, or from a script that reads and writes from a system-of-record data source such as an ERP system. 

With regards to the latter, we should note that Python is one of the most popular "glue" [languages](https://en.wikipedia.org/wiki/Scripting_language#Glue_languages). The market has recognized that Python scripts are easy to write, manage data with intuitive programming syntax, and can be connected to nearly any data source.

The `ticdat` package can easily be used in any Python glue script. One way to do this is to exploit `ticdat`'s ability to recognize data tables as list-of-lists. The inner lists contain data values in the field order defined by by the `TicDatFactory` (i.e. `netflow.input_schema`).

For example, suppose the `netflow` engine needs to connect to an Oracle database for a daily automated solve. The integration engineer can use the `cx_Oracle` [package](https://oracle.github.io/python-cx_Oracle/) (or something equivalent) to turn system data into a list-of-lists for each input table. These data structures can then be used to create a `TicDat` object that can be passed as input data to `netflow.solve`. The solution `TicDat` object returned by `netflow.solve` can then be converted back into a list-of-lists representation of each solution report table. (The list-of-lists strategy is just one approach. It might make sense to convert system-of-record data into `pandas.DataFrame` objects, and then use these `DataFrame`s to build the `TicDat` object.)

We demonstrate this approach without explicit references to `cx_Oracle`. By demonstrating that `ticdat` is compatible with list-of-list/`DataFrame` table representations we thus show that `ticdat` is compatible with any data source that can be connected to Python, and also with human readable static data.

In [1]:
commodities = [['Pencils', 0.5], ['Pens', 0.2125]]

# a one column table can just be a simple list 
nodes = ['Boston', 'Denver', 'Detroit', 'New York',  'Seattle']

cost = [['Pencils', 'Denver', 'Seattle', 30.0],
        ['Pens', 'Denver', 'Seattle', 30.0],
        ['Pencils', 'Detroit', 'New York', 20.0],
        ['Pens', 'Detroit', 'New York', 20.0],
        ['Pens', 'Detroit', 'Boston', 20.0],
        ['Pencils', 'Detroit', 'Boston', 10.0],
        ['Pens', 'Detroit', 'Seattle', 80.0],
        ['Pens', 'Denver', 'New York', 70.0],
        ['Pens', 'Denver', 'Boston', 60.0],
        ['Pencils', 'Denver', 'New York', 40.0],
        ['Pencils', 'Detroit', 'Seattle', 60.0],
        ['Pencils', 'Denver', 'Boston', 40.0]]

inflow = [['Pens', 'Boston', -40.0],
          ['Pens', 'New York', -30.0],
          ['Pencils', 'Detroit', 50.0],
          ['Pencils', 'Seattle', -10.0],
          ['Pencils', 'Denver', 60.0],
          ['Pens', 'Detroit', 60.0],
          ['Pens', 'Seattle', -30.0],
          ['Pens', 'Denver', 40.0],
          ['Pencils', 'New York', -50.0],
          ['Pencils', 'Boston', -50.0]]

An integration engineer might prefer to copy system-of-records data into `pandas.DataFrame` objects. Note that `pandas` is itself [capable](https://stackoverflow.com/questions/35781580/cx-oracle-import-data-from-oracle-to-pandas-dataframe) of reading directly from various SQL databases, although it usually needs a supporting package like `cx_Oracle`.

In [2]:
from pandas import DataFrame
arcs = DataFrame({"Source": ["Denver", "Denver", "Denver", "Detroit", "Detroit", "Detroit",], 
                 "Destination": ["Boston", "New York", "Seattle", "Boston", "New York", 
                                 "Seattle"], 
                 "Capacity": [120, 120, 120, 100, 80, 120]})
arcs

Unnamed: 0,Source,Destination,Capacity
0,Denver,Boston,120
1,Denver,New York,120
2,Denver,Seattle,120
3,Detroit,Boston,100
4,Detroit,New York,80
5,Detroit,Seattle,120


Next we create a `TicDat` input data object from the list-of-lists/`DataFrame` representations.

In [3]:
import netflow
dat = netflow.input_schema.TicDat(commodities=commodities, nodes=nodes, cost=cost, arcs=arcs, 
                          inflow=inflow)

We now create a TicDat solution data object by calling `solve`.

In [4]:
sln = netflow.solve(dat)

Using license file /Users/petercacioppi/gurobi.lic
Gurobi Optimizer version 9.1.2 build v9.1.2rc0 (mac64)
Thread count: 4 physical cores, 4 logical processors, using up to 4 threads
Optimize a model with 16 rows, 12 columns and 36 nonzeros
Model fingerprint: 0x6b98959e
Coefficient statistics:
  Matrix range     [2e-01, 1e+00]
  Objective range  [1e+01, 8e+01]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+01, 1e+02]
Presolve removed 16 rows and 12 columns
Presolve time: 0.01s
Presolve: All rows and columns removed
Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    5.5000000e+03   0.000000e+00   2.000000e+01      0s
Extra simplex iterations after uncrush: 1
       1    5.5000000e+03   0.000000e+00   0.000000e+00      0s

Solved in 1 iterations and 0.02 seconds
Optimal objective  5.500000000e+03


In [5]:
sln

td: {flow: 7, parameters: 1}

In [6]:
sln.flow

{('Pencils', 'Denver', 'Seattle'): _td:{'Quantity': 10.0},
 ('Pens', 'Denver', 'Seattle'): _td:{'Quantity': 30.0},
 ('Pens', 'Detroit', 'New York'): _td:{'Quantity': 30.0},
 ('Pens', 'Detroit', 'Boston'): _td:{'Quantity': 30.0},
 ('Pencils', 'Detroit', 'Boston'): _td:{'Quantity': 50.0},
 ('Pens', 'Denver', 'Boston'): _td:{'Quantity': 10.0},
 ('Pencils', 'Denver', 'New York'): _td:{'Quantity': 50.0}}

We can then easily convert the `sln` object to represent the solution tables as `pandas.DataFrame` objects.

In [7]:
sln_to_pandas = netflow.solution_schema.copy_to_pandas(sln, reset_index=True)
sln_to_pandas

pd: {flow: 7, parameters: 1}

In [8]:
sln_to_pandas.parameters

Unnamed: 0,Parameter,Value
0,Total Cost,5500.0


In [9]:
sln_to_pandas.flow

Unnamed: 0,Commodity,Source,Destination,Quantity
0,Pencils,Denver,Seattle,10.0
1,Pens,Denver,Seattle,30.0
2,Pens,Detroit,New York,30.0
3,Pens,Detroit,Boston,30.0
4,Pencils,Detroit,Boston,50.0
5,Pens,Denver,Boston,10.0
6,Pencils,Denver,New York,50.0


Or course, to convert from `DataFrame` to list-of-lists is trivial.

In [10]:
sln_lists = {}
for k in netflow.solution_schema.all_tables:
    df = getattr(sln_to_pandas, k)
    sln_lists[k] = [list(row) for row in df.itertuples(index=False)]

In [11]:
import pprint
for sln_table_name, sln_table_data in sln_lists.items():
    print("\n\n**\nSolution Table %s\n**"%sln_table_name)
    pprint.pprint(sln_table_data)



**
Solution Table parameters
**
[['Total Cost', 5500.0]]


**
Solution Table flow
**
[['Pencils', 'Denver', 'Seattle', 10.0],
 ['Pens', 'Denver', 'Seattle', 30.0],
 ['Pens', 'Detroit', 'New York', 30.0],
 ['Pens', 'Detroit', 'Boston', 30.0],
 ['Pencils', 'Detroit', 'Boston', 50.0],
 ['Pens', 'Denver', 'Boston', 10.0],
 ['Pencils', 'Denver', 'New York', 50.0]]


## Modularity with PanDatFactory
The `netflow` code uses `TicDatFactory` for `input_schema` and `solution_schema`. If you prefer organizing your `solve` code around `DataFrame` objects, then you will likely prefer `PanDatFactory`. This is demonstrated with the `netflow_pd` engine. 

Here, we run through the same steps using `netflow_pd`.

In [12]:
import netflow_pd
dat_pd = netflow_pd.input_schema.PanDat(commodities=commodities, nodes=nodes, cost=cost, arcs=arcs, 
                                     inflow=inflow)
dat_pd

pd: {arcs: 6, commodities: 2, cost: 12, inflow: 10, nodes: 5}

In [13]:
sln_pd = netflow_pd.solve(dat_pd)
sln_pd

Gurobi Optimizer version 9.1.2 build v9.1.2rc0 (mac64)
Thread count: 4 physical cores, 4 logical processors, using up to 4 threads
Optimize a model with 16 rows, 12 columns and 36 nonzeros
Model fingerprint: 0x6b98959e
Coefficient statistics:
  Matrix range     [2e-01, 1e+00]
  Objective range  [1e+01, 8e+01]
  Bounds range     [0e+00, 0e+00]
  RHS range        [1e+01, 1e+02]
Presolve removed 16 rows and 12 columns
Presolve time: 0.01s
Presolve: All rows and columns removed
Iteration    Objective       Primal Inf.    Dual Inf.      Time
       0    5.5000000e+03   0.000000e+00   2.000000e+01      0s
Extra simplex iterations after uncrush: 1
       1    5.5000000e+03   0.000000e+00   0.000000e+00      0s

Solved in 1 iterations and 0.01 seconds
Optimal objective  5.500000000e+03


pd: {flow: 7, parameters: 1}

In [14]:
sln_pd.parameters

Unnamed: 0,Parameter,Value
0,Total Cost,5500.0


In [15]:
sln_pd.flow

Unnamed: 0,Commodity,Source,Destination,Quantity
0,Pencils,Denver,Seattle,10.0
1,Pens,Denver,Seattle,30.0
2,Pens,Detroit,New York,30.0
3,Pens,Detroit,Boston,30.0
4,Pencils,Detroit,Boston,50.0
5,Pens,Denver,Boston,10.0
6,Pencils,Denver,New York,50.0


As you can see, `sln_pd` is the same collection of `DataFrame` results as `sln_to_pandas`, and thus converting into list-of-lists would similarly be trivial.

## Using ticdat to build robust engines

We have demonstrated how we can use `ticdat` to build modular engines. We now demonstrate how we can use `ticdat` to build engines that check `solve` pre-conditions, and are thus robust with respect to data integrity problems.

First, lets violate our (somewhat artificial) rule that the commodity volume must be positive.

In [16]:
dat.commodities["Pens"] = 0

The `input_schema` can not only flag this problem, but give us a useful data structure to examine.

In [17]:
netflow.input_schema.find_data_type_failures(dat)

{TableField(table='commodities', field='Volume'): ValuesPks(bad_values=(0,), pks=('Pens',))}

Next, lets add a Cost record for a non-existent commodity and see how `input_schema` flags this problem.

In [18]:
dat.cost['Crayons', 'Detroit', 'Seattle'] = 10
netflow.input_schema.find_foreign_key_failures(dat, verbosity="Low")

{('cost', 'commodities', ('Commodity', 'Name')): (('Crayons',),
  (('Crayons', 'Detroit', 'Seattle'),))}

In real life, data integrity failures can typically be grouped into a small number of categories. However, the number of failures in each category might be quite large. `ticdat` creates data structures for each of these categories that can themselves be examined programmatically. As a result, an analyst can leverage the power of Python to detect patterns in the data integrity problems.

## Robust engines with PanDatFactory
As before, we will demonstrate the same sort of integrity checks, except this time using a `solve` engine that is based on `DataFrame` input and thus uses `PanDatFactory`.

First, lets make a `DataFrame` copy of the `dat` object that has integrity violations.

In [19]:
dat_pd = netflow.input_schema.copy_to_pandas(dat, reset_index=True)

Now, lets check the two types of integrity failures. Bear in mind that `find_data_type_failures` and `find_foreign_key_failures` will now return dictionaries that themselves contain `DataFrame` objects, so we will pretty them up a bit for display.

In [20]:
data_type_fais = netflow_pd.input_schema.find_data_type_failures(dat_pd)
data_type_fais.keys()

dict_keys([TableField(table='commodities', field='Volume')])

In [21]:
data_type_fais['commodities', 'Volume']

Unnamed: 0,Name,Volume
1,Pens,0.0


In [22]:
fk_fails = netflow_pd.input_schema.find_foreign_key_failures(dat_pd, verbosity="Low")
fk_fails.keys()

dict_keys([('cost', 'commodities', ('Commodity', 'Name'))])

In [23]:
fk_fails['cost', 'commodities', ('Commodity', 'Name')]

Unnamed: 0,Commodity,Source,Destination,Cost
12,Crayons,Detroit,Seattle,10.0
