In [1]:
import sys
sys.path.append('..')
from panelctmc import panelctmc, panel_to_datalist

### Data Preprocessing with Numpy
The input panel data is assumed to be a numpy array with data type `object` for all elements.
Panda would automatically convert a dataframe to such a numpy array but when loading data with numpy's `loadtxt` the data type `dtype=object` needs to be specified.

In [2]:
import numpy as np
paneldata = np.loadtxt('../data/demo1.csv', delimiter=',', skiprows=1, dtype=object)

The raw input could be strings `dtype=str` but numpy would do some automagic string date conversion what can be avoided by using `dtype=object`.

The array contains three columns. 
`panelctmc` assumes the columns exactly at a specific column index.

* 0: The example identifier, e.g. a country
* 1: The date as string `'%Y-%m-%d'` or a `datetime.datetime` object
* 2: The label, i.e. a nominal value

In [3]:
paneldata

array([['Abu Dhabi', '2007-07-02', 'AA'],
       ['Angola', '2012-05-23', 'BB-'],
       ['Angola', '2011-05-24', 'BB-'],
       ...,
       ['Vietnam', '2002-06-11', 'BB-'],
       ['Zambia', '2012-03-01', 'B+'],
       ['Zambia', '2011-03-02', 'B+']], dtype=object)

### Example 1

The labels (Third column) are usually very messy, contain data entry errors, and what not.
You should first check what unique labels exists.

In [4]:
np.unique(paneldata[:, 2])

array(['-', 'A', 'A+', 'A-', 'AA', 'AA+', 'AA-', 'AAA', 'B', 'B+', 'B-',
       'BB', 'BB+', 'BB-', 'BBB', 'BBB+', 'BBB-', 'C', 'CC', 'CCC',
       'CCC+', 'CCC-', 'D', 'DD', 'DDD', 'RD', 'withdrawn'], dtype=object)

Now we need some domain knowledge to make sense of these labels.
These labels are Credit Ratings for sovereign bonds from the CRA Fitch.

We will group these labels a follows

* `AAA` -- supposedly the best credit quality
* `AA` and all modifications (notches)
* `A` and all modifications
* `BBB` and all modifications. Is the lowes "Investment Grade" rating
* `BB` and all modifications
* `B` and all modifications
* all `C` ratings
* all `D` ratings

Everything else (e.g. `-`, `withdrawn`, etc.) is ignored at the moment.
`panelctmc` will automatically create a state for missing values.

In [5]:
mapping = [['AAA'], ['AA+', 'AA', 'AA-'], ['A+', 'A', 'A-'], 
          ['BBB+', 'BBB', 'BBB-'], ['BB+', 'BB', 'BB-'], 
          ['B+', 'B', 'B-'], ['CCC+', 'CCC', 'CCC-', 'CC', 'C'], 
          ['DDD', 'DD', 'D', 'RD']]
mapping

[['AAA'],
 ['AA+', 'AA', 'AA-'],
 ['A+', 'A', 'A-'],
 ['BBB+', 'BBB', 'BBB-'],
 ['BB+', 'BB', 'BB-'],
 ['B+', 'B', 'B-'],
 ['CCC+', 'CCC', 'CCC-', 'CC', 'C'],
 ['DDD', 'DD', 'D', 'RD']]

Estimate the transition matrix

In [6]:
transmat, genmat, transcount, statetime, datalist = panelctmc(paneldata, mapping)

In [7]:
print("Num Examples: {:d}".format(len(datalist)))
print("Num Transitions: {:d}".format(transcount.sum()))
print(statetime.round(1))
transmat.round(3)

Num Examples: 71
Num Transitions: 173
[ 70.1 172.1 152.8 194.6 182.8 152.8  20.9  10.3  28.3]


array([[0.959, 0.04 , 0.001, 0.   , 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.043, 0.897, 0.057, 0.003, 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.001, 0.029, 0.887, 0.08 , 0.003, 0.   , 0.   , 0.   , 0.   ],
       [0.   , 0.001, 0.063, 0.864, 0.064, 0.007, 0.   , 0.   , 0.   ],
       [0.   , 0.   , 0.004, 0.109, 0.8  , 0.082, 0.003, 0.   , 0.002],
       [0.   , 0.   , 0.   , 0.007, 0.104, 0.793, 0.051, 0.008, 0.038],
       [0.   , 0.   , 0.   , 0.001, 0.014, 0.212, 0.526, 0.172, 0.075],
       [0.   , 0.   , 0.   , 0.001, 0.027, 0.381, 0.064, 0.515, 0.012],
       [0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 0.   , 1.   ]])

First, note that we specified 8 groups but `panelctmc` outputs 9 states. The last state (9th row and column) is the state for missing labels.

Second, let's look at `statetime`. 
In my opinion `statetime` should be somewhat across all states.
`statetime` is used as denominator in the calculation of the generator matrix.
A low duration in a particular state indicates that there are not many observations as well (rule of thumb). In other words, a low time period of a state indicate a non-generalizable transition probability estimation.

### Example 2
In this example, we changed the group label mapping.
Just check `statetime` and compare with Example 1.

In [8]:
mapping = [['AAA', 'AA+', 'AA', 'AA-', 'A+', 'A', 'A-'], 
           ['BBB+', 'BBB', 'BBB-'],
           ['BB+', 'BB', 'BB-'], 
           ['B+', 'B', 'B-']]

transmat, genmat, transcount, statetime, datalist = panelctmc(paneldata, mapping)

In [9]:
print("Num Examples: {:d}".format(len(datalist)))
print("Num Transitions: {:d}".format(transcount.sum()))
print(statetime.round(1))
transmat.round(3)

Num Examples: 59
Num Transitions: 136
[198.1 194.6 182.8 152.8  59.5]


array([[0.934, 0.063, 0.002, 0.   , 0.   ],
       [0.064, 0.864, 0.064, 0.007, 0.   ],
       [0.004, 0.109, 0.8  , 0.083, 0.005],
       [0.   , 0.007, 0.104, 0.794, 0.095],
       [0.   , 0.   , 0.01 , 0.15 , 0.84 ]])

In [10]:
sum(transmat[0,:])

1.0