### ticdat: A Useful Shim Library on Top of pandas 

We're going to demonstrate how `ticdat` can be seen as a thin, useful library to enhance your use of `pandas`.

To run this notebook, first `pip install gurobi_optimods`

In [1]:
from gurobi_optimods import datasets
data = datasets.load_workforce()

The `data` object here has `DataFrame` objects as attributes. 

In [2]:
data.availability[:5]

Unnamed: 0,Worker,Shift,Preference
0,Siva,2023-05-02,2.0
1,Siva,2023-05-03,2.0
2,Siva,2023-05-05,5.0
3,Siva,2023-05-07,3.0
4,Siva,2023-05-09,2.0


In [3]:
data.shift_requirements[:5]

Unnamed: 0,Shift,Required
0,2023-05-01,3
1,2023-05-02,2
2,2023-05-03,4
3,2023-05-04,2
4,2023-05-05,5


In [4]:
data.worker_limits

Unnamed: 0,Worker,MinShifts,MaxShifts
0,Siva,6,8
1,Ziqiang,6,7
2,Matsumi,6,8
3,Femke,5,8
4,Vincent,6,8
5,Marisa,5,8
6,Pauline,6,8


We can pass these `DataFrame` objects to the `solve_workforce_scheduling` function to solve the workforce scheduling problem being demonstrated by `gurobi_optimods.workforce`.

In [5]:
from gurobi_optimods.workforce import solve_workforce_scheduling
assigned_shifts = solve_workforce_scheduling(
    availability=data.availability,
    shift_requirements=data.shift_requirements,
    worker_limits=data.worker_limits,
    preferences="Preference",
)
assigned_shifts[:5]

Set parameter WLSAccessID
Set parameter WLSSecret
Set parameter LicenseID to value 945452
WLS license 945452 - registered to Decision Spot
Gurobi Optimizer version 12.0.0 build v12.0.0rc1 (mac64[arm] - Darwin 22.6.0 22H420)

CPU model: Apple M2 Max
Thread count: 12 physical cores, 12 logical processors, using up to 12 threads

WLS license 945452 - registered to Decision Spot
Optimize a model with 28 rows, 72 columns and 216 nonzeros
Model fingerprint: 0xf3d4e6ad
Variable types: 0 continuous, 72 integer (72 binary)
Coefficient statistics:
  Matrix range     [1e+00, 1e+00]
  Objective range  [1e+00, 5e+00]
  Bounds range     [1e+00, 1e+00]
  RHS range        [2e+00, 8e+00]
Found heuristic solution: objective 170.0000000
Presolve removed 6 rows and 22 columns
Presolve time: 0.00s
Presolved: 22 rows, 50 columns, 145 nonzeros
Variable types: 0 continuous, 50 integer (50 binary)

Root relaxation: objective 1.850000e+02, 23 iterations, 0.00 seconds (0.00 work units)

    Nodes    |    Current

Unnamed: 0,Worker,Shift,Preference
0,Siva,2023-05-03,2.0
1,Siva,2023-05-05,5.0
2,Siva,2023-05-07,3.0
3,Siva,2023-05-10,4.0
4,Siva,2023-05-11,5.0


The problem is that `solve_workforce_scheduling` is brittle. If you introduce a minor flaw into one of the arguments, this subroutine crashes ungracefully. This risk isn't even mentioned in the docstring.

In [6]:
worker_limits_dup = data.worker_limits.copy(deep=True)
worker_limits_dup
worker_limits_dup.loc[len(worker_limits_dup)] = ['Vincent', 3, 5]
worker_limits_dup

Unnamed: 0,Worker,MinShifts,MaxShifts
0,Siva,6,8
1,Ziqiang,6,7
2,Matsumi,6,8
3,Femke,5,8
4,Vincent,6,8
5,Marisa,5,8
6,Pauline,6,8
7,Vincent,3,5


In [7]:
solve_workforce_scheduling(
    availability=data.availability,
    shift_requirements=data.shift_requirements,
    worker_limits=worker_limits_dup,
    preferences="Preference",
)

Set parameter WLSAccessID
Set parameter WLSSecret
Set parameter LicenseID to value 945452
WLS license 945452 - registered to Decision Spot


KeyError: 'series must be aligned'

This appears to be unintended behavior. In other words, a bug. While it's true that we passed a `worker_limits` with two "Vincent" rows, it would be preferable if `solve_workforce_scheduling` provided a clearer error message.  As it stands, the `KeyError: 'series must be aligned'` message gives no direct insight into the real issue (duplicate rows).

By contrast, consider how the [netflow_pd.py](https://github.com/ticdat/ticdat/blob/master/examples/gurobipy/netflow/netflow_pd.py) `ticdat` example handles a similar problem. 

In [8]:
import netflow_pd
dat = netflow_pd.input_schema.csv.create_pan_dat("netflow_sample_data")

The `dat` object here has `DataFrame` attributes, similar to `data`. Let's create a similar data integrity problem and see how `netflow_pd` handles it.

In [9]:
dat.commodities.loc[len(dat.commodities)] = ['Pencils', 1.3]
dat.commodities

Unnamed: 0,Name,Volume
0,Pencils,0.5
1,Pens,0.2125
2,Pencils,1.3


In [10]:
netflow_pd.solve(dat)

AssertionError: 

We have an error message, but a much clearer one. The problem is `input_schema.find_duplicates` found something. The natural next step is to see what it found.

In [11]:
netflow_pd.input_schema.find_duplicates(dat)

{'commodities':       Name  Volume
 2  Pencils     1.3}

Lets look at this a bit more closely.

In [12]:
netflow_pd.input_schema.find_duplicates(dat)["commodities"]

Unnamed: 0,Name,Volume
2,Pencils,1.3


`ticdat` identified the problem for us quite nicely. There is a second row in the commodities table for "Pencils". 

Of course, `ticdat` here just functioned as a thin library on top of `pandas`. The `find_duplicates` routine is actually implemented by calling `DataFrame.duplicated`. If you'd rather make that call directly, rather than use `ticdat`, feel free. The point is this - a brittle subroutine is a buggy subroutine. Don't assume that the magic data fairy is going to pass you perfect `DataFrame` objects. Validate any assumptions you make about your subroutine arguments prior to running the optimization logic.