# Categorical Variables Generation

When building Machine Learning algorithms it is very useful to test them under conditions in which we know exactly the properties and probability distributions of the data. 

In this course we will use extensive use of random
sampling to investigate the behavior of ML procedures.

As an example, in this notebook we will learn how to
1. generate random categorical variables with their *natural* multinomial distribution
2. test that our algorithm is really generating the expected distribution
3. generate a categorical **dependent**  variable.
4. compare the performance of a few different implementations.
5. save results for later

We will work in the reverse process (estimating the underlying probability distribution from sample data) on a separate notebook. 

## Preliminaries

### Imports

In [1]:
import os
import numpy as np
import numpy.random as random
import scipy.special as special
import pandas as pd

### Random Seed

In [2]:
seed=67421
random.seed(seed)

### Data Directories

In [3]:
base_data_dir="../../data"

This creates a directory to save results for later

In [4]:
data_dir=base_data_dir+"/probabilisticTools"
if not os.path.exists(data_dir):
    os.makedirs(data_dir)
    print("Directory",data_dir,"created.")

## Generating  a Categorical Random Variable

In [5]:
categories=np.array(["A","B","C","D"])

We generate a random variale with a multinomial distribution
$$
    P(X=d) = P^x_d
$$
where $d=1,\cdots,D$ runs over a $D$ possible  categories.

There is no implied ordering or numerical relationship between de different categories, but
$$
    \sum_{d=1}^D P^x_d =1
$$
as $X$ must belong to one, and only one of the categories

In [6]:
p_x=np.array([0.4,0.3,0.25,0.05])
D=len(categories)
print("D = ",D)
p_x

D =  4


array([ 0.4 ,  0.3 ,  0.25,  0.05])

In [7]:
N=1000

Numpy has a `random.multinomial` variable that will return one-hot encoded samples of a multinomial variable

In [8]:
Z_x=random.multinomial(1,p_x,N)
print("Z_x.shape = ",Z_x.shape)
Z_x[:10]

Z_x.shape =  (1000, 4)


array([[1, 0, 0, 0],
       [1, 0, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 1, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0]])

The mean of $Z_x$ is the sample (empirical) probability 
distribution

In [9]:
Z_x.mean(axis=0)

array([ 0.415,  0.297,  0.243,  0.045])

We use `argmax` function to pick up with one of the variables is chosen in each sample

In [10]:
xi=Z_x.argmax(axis=1)
xi[:10]

array([0, 0, 2, 0, 1, 2, 1, 1, 2, 0], dtype=int64)

We can now assign the labels

In [11]:
X=categories[xi]
X[:10]

array(['A', 'A', 'C', 'A', 'B', 'C', 'B', 'B', 'C', 'A'], 
      dtype='<U1')

Let's check we got the right probability distribution

In [12]:

print("index","categor","p_X","sample p_X",sep="\t")
for idx,c in  enumerate(categories):
    p_hat=np.average(X==c)
    print(idx,c,p_x[idx],p_hat,sep="\t")

index	categor	p_X	sample p_X
0	A	0.4	0.415
1	B	0.3	0.297
2	C	0.25	0.243
3	D	0.05	0.045


Looks close, but is it close enough?

We can run [Pearson's]( https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test) $\chi^2$ test

$$
C^2 = N \sum_{d=1}^D\frac{(\hat{p}_d - p_d)^2}{p_d}
$$

where $\hat{p}_d$ is the sample distribution and $p_d$ is the polulation distribution.

If the sample is really generated from $p_k$ then

$$
    C^2 \sim \chi^2_{D-1}
$$



In [13]:
p_hat=Z_x.mean(axis=0)
df=p_hat-p_x
C2 = N*(df**2/p_x).sum()
C2

1.288499999999998

Probability that a $\chi^2_{D-1}$ has a value this large

In [14]:
special.chdtrc(D-1,C2)

0.73186566009435738

## Generating a Dependent Categorical Variable

We will now generate a new categorical value $Y$ where
$$
    P(Y=k | X=d) = P^y_{k,d}
$$


In [15]:
labels=np.array(["a","b","c","d","e"])

K=len(labels)
print(D,K)

4 5


In [16]:
p_y=np.array([
    [ 0.1, 0.6, 0.2, 0.97],
    [ 0.1, 0.2, 0.2, 0.01],
    [ 0.7, 0.1, 0.2, 0.01],
    [ 0.1, 0.095, 0.2, 0.01],
    [ 0.0,  0.005, 0.2,  0]
])

Let's check the conditional probabilities are well defined

In [17]:
p_y.sum(axis=0)

array([ 1.,  1.,  1.,  1.])

### Method 1

We loop over each 

In [18]:

Z_y=np.empty((N,K))
for i,x in enumerate(Z_x):    
    p=np.dot(p_y,x) # this picks up the right column because x is one-hot
    Z_y[i]=random.multinomial(1,p,1)

Z_y.shape

(1000, 5)

In [19]:
Z_y[:10]

array([[ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.]])

In [20]:
Z_y.mean(axis=0)

array([ 0.336,  0.154,  0.366,  0.093,  0.051])

In [21]:
Y=labels[Z_y.argmax(axis=1)]
Y[:10]

array(['c', 'c', 'c', 'c', 'a', 'a', 'a', 'a', 'b', 'c'], 
      dtype='<U1')

Let's look at $Y$s marginal probabilities
$$
    P(Y=k) = \sum_d P(Y=k| X=d)*P(X=d)
$$
or, expressend in terms of the matrix components
$$
    p^M_k =\sum_d P^y_{k,d} P^x_{d}  = P^Y * P^X
$$
where the last product is a matrix product

In [22]:
P_M=np.dot(p_y,p_x)
P_M

array([ 0.3185,  0.1505,  0.3605,  0.119 ,  0.0515])

Le'ts check that generated Y has the right marginal distribution

In [23]:
print("index","label","p_Y","sample p_Y",sep="\t")
for idx,c in  enumerate(labels):
    print(idx,c,P_M[idx],np.average(Y==c),sep="\t")

index	label	p_Y	sample p_Y
0	a	0.3185	0.336
1	b	0.1505	0.154
2	c	0.3605	0.366
3	d	0.119	0.093
4	e	0.0515	0.051


We can now compute the empirical (sample) join  distribution as
$$
        \hat{P}(Y=k,X=d) =  \frac{1}{N}\sum_i z_{y\ {i,k}} z_{x\ {i,k}} = \frac{1}{N} Z_Y^T * Z_X
$$

In [24]:
P_hat=np.dot(Z_y.T,Z_x)/N
P_hat

array([[ 0.052,  0.176,  0.064,  0.044],
       [ 0.046,  0.057,  0.051,  0.   ],
       [ 0.287,  0.028,  0.05 ,  0.001],
       [ 0.03 ,  0.031,  0.032,  0.   ],
       [ 0.   ,  0.005,  0.046,  0.   ]])

using bayes theorem  the true (population) probabilities should be

$$ 
    P(Y=k,X=d) = P(Y=k|X=d)P(X=d) = P^y_{k,d} P^x_d
$$

In [25]:
P_J = p_y*p_x
P_J.shape

(5, 4)

In [26]:
P_J

array([[ 0.04  ,  0.18  ,  0.05  ,  0.0485],
       [ 0.04  ,  0.06  ,  0.05  ,  0.0005],
       [ 0.28  ,  0.03  ,  0.05  ,  0.0005],
       [ 0.04  ,  0.0285,  0.05  ,  0.0005],
       [ 0.    ,  0.0015,  0.05  ,  0.    ]])

We can  compare sample and population means again using Pearson's test

The following quantity

$$
    C^2=N \sum_{k,d} \frac{(\hat{P}_{k,d} - P_{k,d})^2}{P_{k,d}} 
$$

should be distributed as a $\chi^2$ random variable with $ K \times D -1$ degrees of freedom. 

In [27]:
# We need the floor there because of of P_J entries are zero
C2 = N* ((P_hat-P_J)**2/np.maximum(1e-9,P_J)).sum()
C2

28.590712907698794

Let's look at its p-level

In [28]:
special.chdtrc(K*D-1,C2)

0.07270150002974464

So there is a high probability that $\hat{P}$ was generated from $P$, as it should!

### Method 2

Looping in python is slow, so we would like to generate all the random variables in one go.

We can do it by linearizing the join probability of X and Y, Define

$$
    P'_{n={D\,k+d}} = P(Y=k,X=d)
$$

so that the index $n=1,\cdots D\times K$ runs through all posible combinations of $X$ and $Y$ values.

In [29]:
# linearize join_p
P_J=P_J.ravel()
print("P_J.shape = ",P_J.shape)
P_J

P_J.shape =  (20,)


array([ 0.04  ,  0.18  ,  0.05  ,  0.0485,  0.04  ,  0.06  ,  0.05  ,
        0.0005,  0.28  ,  0.03  ,  0.05  ,  0.0005,  0.04  ,  0.0285,
        0.05  ,  0.0005,  0.    ,  0.0015,  0.05  ,  0.    ])

We can now genarate random variables with that $D\times K$ categorical variable

In [30]:
Z2=random.multinomial(1,P_J,N)
print("Z2.shape = ",Z2.shape)
Z2[:3]

Z2.shape =  (1000, 20)


array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

And, reshape it into a $N\times K \times D$ one-hot matrix where only one element of the matrix is not zero.

The colum of the non-zero element is the value of $X$ label, and the row is the $Y$ label.

In [31]:

Z2=Z2.reshape(N,K,D)
print("Z2.shape = ",Z2.shape)
Z2[:3]

Z2.shape =  (1000, 5, 4)


array([[[0, 0, 0, 0],
        [0, 0, 0, 0],
        [1, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 1],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]],

       [[0, 0, 0, 0],
        [0, 0, 1, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0],
        [0, 0, 0, 0]]])

As before, we can easily get the joint empirical probabilities

In [32]:
Z2.mean(axis=0)

array([[ 0.038,  0.173,  0.04 ,  0.074],
       [ 0.04 ,  0.053,  0.049,  0.001],
       [ 0.286,  0.024,  0.052,  0.   ],
       [ 0.041,  0.023,  0.062,  0.   ],
       [ 0.   ,  0.001,  0.043,  0.   ]])

Extraction the $X$ and $Y$ values is a bit of work 

In [33]:
X1=Z2.argmax(axis=2) # column of the on (1) value
X=X1.max(axis=1) # flatten array
X[:20]

array([0, 3, 2, 0, 0, 2, 0, 1, 2, 0, 0, 1, 1, 0, 0, 2, 1, 1, 2, 2], dtype=int64)

In [34]:
Y1=Z2.argmax(axis=1) # row of the on (1) value
Y=Y1.max(axis=1)

Y[:20]

array([2, 0, 1, 2, 0, 2, 2, 0, 2, 3, 3, 0, 0, 2, 2, 2, 0, 0, 0, 3], dtype=int64)

### Performance Comparison

#### Generate X and Y one by one

In [35]:
%%timeit -n 5 -r 5
x=np.empty((N,D))
Z=np.empty((N,K))
for i1 in range(N):
    x[i1]=random.multinomial(1,p_x)
    prob=np.dot(p_y,x[i1])
    Z[idx]=random.multinomial(1,prob,1)
Y=Z.argmax(axis=1)
Z.shape

5 loops, best of 5: 5.86 ms per loop


#### Vectorize X, generate Y one by one

In [36]:
%%timeit -n 5 -r 5
x=random.multinomial(1,p_x,N)
Z=np.empty((N,K))
for idx,s in enumerate(x):
    prob=np.dot(p_y,s)
    Z[idx]=random.multinomial(1,prob,1)
Y=Z.argmax(axis=1)
Z.shape

5 loops, best of 5: 3.46 ms per loop


#### Vectorize X and Y

In [37]:
%%timeit -n 5 -r 5

Z2=random.multinomial(1,P_J,N)
Z2=Z2.reshape(N,K,D)
X1=Z2.argmax(axis=2)
X=X1.max(axis=1)
Y1=Z2.argmax(axis=1)
Y=Y1.max(axis=1)

5 loops, best of 5: 736 µs per loop


The second method is **5 to 10 times faster**!

In general, you should **avoid python for loops** that run over large ammounts of data. They are slow.

Specially loops over the sample size (that can be large) are problematic

### A Function to Generate Conditionaly Dependent Categorial Variables

In [38]:
def generate_conditional_categorical(p_X,p_Y, X_labels,Y_labels,N):
    P_J=(p_Y*p_X).ravel()
    Z2=random.multinomial(1,P_J,N)
    Z2=Z2.reshape(N,K,D)
    X1=Z2.argmax(axis=2) # find wich index in Z2 is not zero
    X=X1.max(axis=1) # return the index, colapsing one dimension
    Y1=Z2.argmax(axis=1)
    Y=Y1.max(axis=1)
    return X_labels[X],Y_labels[Y]

In [39]:
X,Y=generate_conditional_categorical(p_x,p_y,categories,labels,N)
print("X,Y shapes =",X.shape,Y.shape)

X,Y shapes = (1000,) (1000,)


In [40]:
X[:5],Y[:5]

(array(['A', 'A', 'B', 'B', 'A'], 
       dtype='<U1'), array(['a', 'c', 'a', 'a', 'a'], 
       dtype='<U1'))

### Save generated data

First we join X and Y into an Nx2 array

In [41]:
data=np.c_[X,Y]
print("data.shape = ",data.shape)
data[:5]

data.shape =  (1000, 2)


array([['A', 'a'],
       ['A', 'c'],
       ['B', 'a'],
       ['B', 'a'],
       ['A', 'a']], 
      dtype='<U1')

Next we create a Panda's data frame

In [42]:
df=pd.DataFrame(data,columns=["X","Y"])
df.head()

Unnamed: 0,X,Y
0,A,a
1,A,c
2,B,a
3,B,a
4,A,a


And we save it for later

In [43]:
df.to_csv(data_dir+"/dependent_categorical.csv",
          index=False)

#### Save Test Data

Data Generated using the same $P(X)$ and $P(Y|X)$

In [44]:
X_test,Y_test=generate_conditional_categorical(p_x,p_y,categories,labels,int(0.25*N))
print("X,Y shapes =",X_test.shape,Y_test.shape)

X,Y shapes = (250,) (250,)


In [45]:
test_data=np.c_[X_test,Y_test]
df_test=pd.DataFrame(test_data,columns=["X","Y"])
df_test.to_csv(data_dir+"/dependent_categorical_test1.csv",
          index=False)

Data generated with different $P(X)$ probability distribution, but same $P(Y|X)$

In [46]:
p1_x=np.array([0.3,0.4,0.2,0.1])

In [47]:
X_test,Y_test=generate_conditional_categorical(p1_x,p_y,categories,labels,int(0.25*N))
test_data=np.c_[X_test,Y_test]
df_test=pd.DataFrame(test_data,columns=["X","Y"])
df_test.to_csv(data_dir+"/dependent_categorical_test2.csv",
          index=False)

Data generated with same  P(X)  probability distribution, but different  $P(Y|X)$

In [48]:
p1_y=np.array([
    [ 0.3, 0.4, 0.05, 0.97],
    [ 0.1, 0.2, 0.15, 0.01],
    [ 0.5, 0.3, 0.1, 0.01],
    [ 0.1, 0.095, 0.1, 0.01],
    [ 0.0,  0.005, 0.5,  0]
])

In [49]:
X_test,Y_test=generate_conditional_categorical(p_x,p1_y,categories,labels,int(0.25*N))
test_data=np.c_[X_test,Y_test]
df_test=pd.DataFrame(test_data,columns=["X","Y"])
df_test.to_csv(data_dir+"/dependent_categorical_test3.csv",
          index=False)

Data generated with diffent P(X) probability distribution and  different  $P(Y|X)$

In [50]:
X_test,Y_test=generate_conditional_categorical(p1_x,p1_y,categories,labels,int(0.25*N))
test_data=np.c_[X_test,Y_test]
df_test=pd.DataFrame(test_data,columns=["X","Y"])
df_test.to_csv(data_dir+"/dependent_categorical_test4.csv",
          index=False)

### For Homework

In [51]:
p2_y=np.array([
    [0.2,0.2,0.2,0.2],
    [0.2,0.2,0.2,0.2],
    [0.2,0.2,0.2,0.2],
    [0.2,0.2,0.2,0.2],
    [0.2,0.2,0.2,0.2]
     ])

In [52]:
X_homework2,Y_homework2=generate_conditional_categorical(p1_x,p2_y,categories,labels,int(N))
homework_data=np.c_[X_homework2,Y_homework2]
df_homework=pd.DataFrame(homework_data,columns=["X","Y"])
df_homework.to_csv(data_dir+"/homework.csv",
          index=False)