### 2) Posterior Sampling of Chinese Restaurant Franchise

This section contains code for the posterior sampling of our [data](https://web.archive.org/web/20040328153507/http://elegans.swmed.edu/wli/cgcbib) using the algorithm in **Section 5.1** of [Hierarchical Dirichlet Processes](https://people.eecs.berkeley.edu/~jordan/papers/hdp.pdf).

*I don't understand the posterior sampling for concentration parameters that is in the appendix. How does this tie in?*

In [2]:
import pandas as pd
import numpy as np
import scipy.special as sp
import random

data = pd.read_csv('final_project_data.csv', index_col = 0)

In [325]:
# Priors on concentration parameters
alpha_0 = np.random.gamma(1, 1) 
gamma_ = np.random.gamma(1, .1) # This prior on gamma always gives us ~ .0001 which leads to random betas ~= .9999. This can't be right... 

# j is the number of documents
j = data.shape[0]

# T is truncated on global level
T = 100 # arbitrary number

# Generate Betas
#beta_ = np.random.beta(1, gamma_, T)
#BETA = np.zeros(len(beta_))
#BETA[0] = beta_[0]
#BETA[1:] = beta_[1:]*(1 - beta_[:-1]).cumprod(axis = 0)

# Generate Pis
#pi_ = np.zeros((j, len(BETA)))
#PI = np.zeros((j, len(BETA)))

#for k in range(len(BETA)):
#    pi_[:,k] = np.random.beta(alpha_0*BETA[k], alpha_0*(1 - sum(BETA[:k+1])), j)

#for i in range(j):
#    for k in range(len(BETA)):
#        PI[i,k] = pi_[i,k]*np.prod(1 - pi_[i,:k])

# H is the prior over topic distributions. 
# We don't know the number of topic distributions so
# how do we determine the length of H? Set = to arbitrary T?
H = np.zeros((T))
prior_H = np.random.beta(1, .5, T)
for i in range(T):
    H[i] = prior_H[i]*np.prod(1 - prior_H[:i])

PHI = H

In [429]:
num_customers = data.shape[0]*data.shape[1]
table_ = [1]
next_table_ = 2


num_restaurantJ_tableT = np.zeros((j, T))

# This Process is if customers are entering one at a time
for i in range(len(data.iloc[0,:]) - 1):
    if np.random.rand() < alpha_0/(i + alpha_0):
        # Customer i sits at new table
        table_.append(next_table_)
        next_table_ += 1
    else:
        choice = np.random.choice(np.array(table_))
        table_.append(choice)

# We need to consider if people 
# are already assigned and the N_ij'th person is removed

## Sampling of t

**Questions:**
-  I'm not sure what $f_{k_{jt}}^{-x_{ji}}(x_{ji})$ is

$$\sum f(x_{ji}\mid \phi ) h(\phi) $$

In [524]:
# Assign customers to table with probabilty PHI 
initial_tables = np.random.choice(range(1,101), len(data.iloc[0,:]), p = PHI)
pd.Series(initial_tables).value_counts()

2    3791
1    2384
3      12
4       2
dtype: int64

In [545]:
# Delete i-th person and decide their table
for i in range(len(initial_tables)):
    tmp = list(initial_tables[i+1:]) + list(initial_tables[:i])
    
    if np.random.rand() < alpha_0 / (len(initial_tables) + alpha_0):
        initial_tables = np.insert(tmp, obj = i, values = max(initial_tables) + 1)
    else:
        choice = np.random.choice(tmp)
        initial_tables = np.insert(tmp, obj = i, values = choice)

pd.Series(initial_tables).value_counts()

2    3750
1    2423
3      13
7       1
6       1
5       1
dtype: int64

After removing each person and choosing $t_i$, the results are nearly identical to the prior distribution.

### Variable Descriptions

- **H** - base distribution
- **F** - data distribution

#### Concentration Parameters:
- $\alpha_0$ - 
- $\gamma$ -

#### Random Variables
- $x_{ji}$ - observed data (arised with draw from distribution $F(\theta_{ji})$
- $\theta_{ji}$ - customers correspond to the factors $\theta_{ji}$
    - is equal to $\psi_{jt_{ji}}$
- $\psi_{jt}$ - is the dish served at table $t$ in restaurant $j$
    - $\psi_{jt} = \phi_{k_{jt}}$
    - The table-specific choice of dishes
    - is instance of mixture component $k_{jt}$
- $\phi_k$ - the global menu of dishes 
    - Prior over $\phi_k$ is $H$
    - $K$ iid r.v's
- $z_{ji} = k_{jt_{ji}}$ denotes the mixture component associated with the observation $x_{ji}$
- $t_{ji}$ - customer $i$ in restaurant $j$ sits at table $t_{ji}$
    - The index of the $\psi_{jt}$ associated with $\theta_{ji}$
- $k_{jt}$ - Table $t$ in restaurant $j$ serves dish $k_{jt}$

#### Counts
- $n_{jtk}$ - number of customers in restaurant $j$ at table $t$ eating dish $k$
- $m_{jk}$ - number of tables in restaurant $j$ serving dish $k$
- $K$ - denotes the number of dishes being served throughout the franchise

#### Calculation of $p(x_{ji}\mid \textbf{t}^{-ji}, t_{ji} = t^{\text{new}}, \textbf{k})$

............




In [None]:
# Need to change data structure back to way I had it..
# in the list of words per document

# Also pretty sure I need to change doc.wordIndex to be some global index mapping a value to a word

In [117]:
import numpy as np

class documentInfo:
    def __init__(self, document, docIndex):
        self.document = document
        self.docIndex = docIndex
        self.tableInRestaurantCount = 0
        self.docLength = len(document)
        self.wordIndex = []
        self.tableAssignment = []
        self.wordCountByTable = np.zeros(2)
        self.tableTopic = np.zeros(2)
        
        for i in range(self.docLength):
            self.wordIndex.append(i)
            self.tableAssignment.append(-1)
        
temp = documentInfo(['the', 'boy', 'jumped', 'over', 'the', 'fence'], 0)

array([0., 0.])

In [59]:
def initialState(documents):
    '''Initially assign the words to tables and topics'''
    unique_words = set([word for doc in documents for word in doc])
    vocabSize = len(unique_words)
    wordCount = 0                  # n_{...}
    docCount = len(documents)      # Number of Documents
    topicCount = 1                 # K
    tableCount = 0            # m_{..}
    docInfo = []
    
    for i in range(docCount):
        docInfo.append(documentInfo(documents[i], i))
        wordCount += len(documents[i])
        
    p = np.zeros(20)
    f = np.zeros(20)
    
    tableCountByTopic = np.zeros(topicCount + 1)  # m_{.k}
    wordCountByTopic = np.zeros(topicCount + 1) # n_{..k}
    wordCountByTopicTerm = np.zeros((topicCount + 1, vocabSize)) # n_{word_i, topic k}
    
    # Assign each topic a single document
    for k in range(topicCount):
        doc = docInfo[k]
        for i in range(doc.docLength):
            # Assign each word i in document doc.docIndex to topic k, table 0
            assignWord(doc.docIndex, i, 0, k)
    
    # Randomly assign the remaining documents to topics
    for j in range(topicCount, docCount):
        doc = docInfo[j]
        k = np.random.randint(topicCount) # randomly choose k
        for i in range(doc.docLength):
            assignWord(doc.docIndex, i, 0, k)
                       
def assignWord(docIndex, i, table, k):
    '''Assign a word to document docIndex, word index i, table table, and topic k'''
    doc = docInfo[docIndex]
    doc.tableAssignment[i] = table
    doc.wordCountByTable[table] += 1
    wordCountByTopic[k] += 1
    wordCountByTopicTerm[k][doc.wordIndex[i]] += 1
    
    if (doc.wordCountByTable[table] == 1): # create new table
        doc.tableInRestaurantCount += 1
        doc.tableTopic[table] = k
        tableCount += 1
        tableCountByTopic[k] += 1
        doc.tableTopic = ENSURECAPACITY(doc.tableTopic, doc.tableInRestaurantCount)
        doc.wordCountByTable = ENSURECAPACITY(doc.wordCountByTable, doc.tableInRestaurantCount)
        if (k == topicCount):
            topicCount += 1
            tableCountByTopic = ENSURECAPACITY(tableCountByTopic, topicCount)
            wordCountByTopic = ENSURECAPACITY(wordCountByTopic, topicCount)
            wordCountByTopicTerm = ADD(wordCountByTopicTerm, np.zeros(vocabSize), topicCount)
            
def ENSURECAPACITY(input_, min_req = 1):
    '''This functions extends the length of array if less than than min_req'''
    if (min_req < len(input_)):
        return input_
    array = np.zeros(2*len(input_))
    for i in range(len(input_)):
        array[i] = input_[i]
    return array
    
        

def ADD(input_, newRow, index_):
    '''This function inserts newRow into input_ at index index_'''
    if len(input_) <= index_:
        tmp = np.zeros((index_*2, len(newRow)))
        for i in range(len(input_)):
            tmp[i,:] = input_[i,:]
        input_ = tmp
    input_[index,:] = newRow
    return input_

def updateK():
    '''Sample a new topic K from full conditional'''
    p = ENSURECAPACITY(p, topicCount)
    pSum = 0
    for k in range(topicCount):
        pSum += tableCountByTopic[k] * f[k]
        p[k] = pSum
    
    pSum += gamma/ vocabSize
    p[topicCount] = pSum
    
    rand_draw = np.random.rand()*pSum
    for k in range(topicCount):
        if rand_draw < p[k]:
            newTopic = k
            break
    return newTopic

def updateT(docIndex, i):
    '''Sample a new table T from full conditional for customer i in restaurant docIndex'''
    doc = docInfo[docIndex]
    f = ENSURECAPACITY(f, topicCount)
    p = ENSURECAPACITY(p, topicCount)
    fNew = gamma / vocabSize # right side of equation 31 (although i dont see how vocabSize comes into play)
    for k in range(topicCount):
        f[k] = (wordCountTopicTerm[k, doc.wordIndex[i]] + beta) / (wordCountByTopic[k] + vocabSize*beta)  # I have 0 intuition to what role beta and vb and even vocabSize have in the equations
        fNew += tableCountByTopic[k] * f[k] # left side of equation 31
    for j in range(doc.tableInRestaurantCount):
        if doc.wordCountByTable[j] > 0:
            pSum += doc.wordCountByTable[j] * f[doc.tableTopic[j]] # top of equation 32
        p[j] = pSum
    
    pSum += alpha_0 * fNew/(tableCount + gamma) # bottom of 32 and division in 31
    p[doc.tableInRestaurantCount] = pSum
    rand_draw = np.random.rand()*pSum
    for j in range(doc.numberInRestaurantCount):
        if(rand_draw < p[j]):
            table_choice = j
            break
            
    return table_choice

def removeWord(docIndex, i):
    '''Remove word from document[docIndex] and wordIndex[i]'''
    doc = docInfo[docIndex]
    table = doc.tableAssignment[i] # the table where word i is sitting
    k = doc.tableTopic[table] # the topic at table word i is sitting
    doc.wordCountByTable -= 1
    wordCountByTopic[k] -= 1
    wordCountByTopicTerm[k][doc.wordIndex[i]] -= 1
    if doc.wordCountByTable[table] == 0: # remove table if no one at it
        tableCount -= 1
        tableCountByTopic[k] -= 1
        doc.tableTopic[table] = -1 # They have -- but subtracting 1 from a topic just changes the topic...
        


In [109]:
np.random.rand()*10

7.997176611775973

In [58]:
len(np.array(([3,3,3,3,3,], [2,2,2,2,2], [4,4,4,4,4])))

3

## CODE ADD (LINE 55) FUNCTIONS

In [48]:
g = 0
for i in range(10):
    g += 1

In [296]:
# expected value of dirichlet
def dirichlet_EV(alpha):
    if (len(alpha.shape) == 1):
        return(sp.psi(alpha)- sp.psi(np.sum(alpha)))
    
    return(sp.psi(alpha) - sp.psi(np.sum(alpha, 1))[:, np.newaxis])

0.9999999999999999

In [None]:
#G_0 | gamma_, H ~ DP(gamma_, H)

#G_j | alpha_0, G_0 ~ DP(alpha_0, G_0)

#theta_ji | G_j ~ G_j

#x_ji | theta_ji ~ F(theta_ji)

#G_0 = sum(beta_k * delta_phi_k)
#G_j = sum(pi_jk * delta_phi_k)
