# Tabulation of categorical data

In this tutorial, we will explore samplics' APIs for creating design-based tabulation. There are two main python classes for tabulation. ```Tabulation()``` for one-way tables and ```CrossTabulation()``` for two-way tables. 

In [1]:
import numpy as np
import pandas as pd

import samplics 
from samplics.categorical import Tabulation, CrossTabulation

  import pandas.util.testing as tm


## One-way tabulation

The birth dataset has four variables: region, agecat, birthcat, and pop. The variables agecat and birthcat are categirical. By default, pandas reads them as numerical. We could change this by using ```dtype="string"``` or ``` dtype="category"```. 

In [2]:
birth = pd.read_csv("../../../datasets/docs/birth.csv", dtype={"agecat":"string", "birthcat":"category"})

region = birth["region"]
agecat = birth["agecat"]
birthcat = birth["birthcat"]

birth.head(15)

Unnamed: 0,region,agecat,birthcat,pop
0,1,1,1,28152
1,1,1,1,103101
2,1,1,1,113299
3,1,1,1,112028
4,1,1,1,99588
5,1,1,1,22356
6,1,1,1,102926
7,1,1,1,12627
8,1,1,1,112885
9,1,1,1,150297


When requesting a table, the user can set ```parameter="count"``` which results in a tabulation with counts in the cells while ```parameter="proportion``` leads to cells with proportions. The expression ```Tabulation("count")``` instantiates a class ```Tabulation()``` which has a method ```tabulate()``` to produce the table. 

In [3]:
birth_count = Tabulation(parameter="count")
birth_count.tabulate(birthcat, remove_nan=True)

print(birth_count)


Tabulation of birthcat
 Number of strata: 1
 Number of PSUs: 923
 Number of observations: 923
 Degrees of freedom: 922.00

  variable category  count   stderror    lower_ci    upper_ci
 birthcat        1  240.0  13.333695  213.832087  266.167913
 birthcat        2  450.0  15.193974  420.181215  479.818785
 birthcat        3  233.0  13.204959  207.084737  258.915263


When ```remove_nan=False```, the numpy and pandas special values NaNs, respectively np.nan and NaN are treaty as a valid category and shown on the tabulation as follows

In [4]:
birth_count = Tabulation(parameter="count")
birth_count.tabulate(birthcat, remove_nan=False)

print(birth_count)


Tabulation of birthcat
 Number of strata: 1
 Number of PSUs: 956
 Number of observations: 956
 Degrees of freedom: 955.00

  variable category  count   stderror    lower_ci    upper_ci
 birthcat        1  240.0  13.414066  213.675550  266.324450
 birthcat        2  450.0  15.441157  419.697485  480.302515
 birthcat        3  233.0  13.281448  206.935807  259.064193
 birthcat      nan   33.0   5.647499   21.917060   44.082940
