In [17]:
import pandas as pd
import numpy as np

state_filename: str = "../../tests/test_data/train_0ec98113/mallet/state.mallet.gz"
state: pd.DataFrame = pd.read_csv(state_filename, compression='gzip', sep=' ', skiprows=[1,2] )

#state.columns = ['doc','source','pos','typeindex','type','topic']
#state.columns = ['document_id','source','pos','token_id','token','topic_id']
state.columns = ['document_id','source','pos','w','token','z']

vocab: dict = state[['w', 'token']].drop_duplicates().set_index('token',drop=True).w.to_dict()

state = state[["document_id", "w", "z", "token"]]


Legend:

| Symbol | | Description | | Dimension | |
| --- | --- | --- | --- | --- | --- |
| d in D | | A set of Documents, d is document-term counts | | | known |
| W | | Vocabulary (unique tokens) | | | known |
| K | | Number of topics | | | unknown |
| z | | Topic assignment of every word in every document. From z we can derive the document topic mixture, ($\theta$), and the word distribution, ($\phi$), of each topic. | | | unknown |
| theta | $\theta$ | each row represent a distribution of topics over documents | $P(topic_k \| document_d)$ | $K \times D$ |
| phi | $\phi$ | Distribution of words over topics | $P(token_v\|topic_k)$ | $ M \times K $
| gamma | $\gamma$ | |  $P(topic_k\|token_v)$|
| alpha | $\alpha$ | Input parameter of the dirichlet distribution used when infering $\theta$ i.e. when sampling the *topic distribution of a document*. Prior information for document's topic mixture.|
| beta | $\beta$ |  Input parameter of the dirichlet distribution used when sampling the *word distribution of a given topic*. |

The most important are three matrices: theta gives P(topick|documentd), phi gives P(tokenv|topick), and gamma gives P(topick|tokenv).


Let's get the ugly part out of the way, the parameters and variables that are going to be used in the model. 

* <b>beta</b> ($\overrightarrow{\beta}$) : In order to determine the value of $\phi$, the word distirbution of a given topic, we sample from a dirichlet distribution using $\overrightarrow{\beta}$ as the input parameter.  What does this mean? The $\overrightarrow{\beta}$ values are our prior information about the word distribution in a topic. Example: I am creating a document generator to mimic other documents that have topics labeled for each word in the doc. I can use the number of times each word was used for a given topic as the $\overrightarrow{\beta}$ values. 

* <b>theta</b> ($\theta$) : Is the topic proportion of a given document. More importantly it will be used as the parameter for the multinomial distribution used to identify the topic of the next word. To clarify, the selected topic's word distribution will then be used to select a word _w_. 

* <b>phi</b> ($\phi$) : Is the word distribution of each topic, i.e. the probability of each word in the vocabulary being generated if a given topic, _z_ (z ranges from 1 to k),  is selected.  

* <b>xi</b> ($\xi$) : In the case of a variable lenght document, the document length is determined by sampling from a Poisson distribution with an average length of $\xi$

* <b>k</b> : Topic index
* <b>z</b> : Topic selected for the next word to be generated. 
* <b>w</b> : Generated Word
* <b>d</b> : Current Document

Outside of the variables above all the distributions should be familiar from the previous chapter. 

In [14]:

M = state["document_id"].max()
V = state["token_id"].max() + 1
K = state["topic_id"].max() + 1

Nd = np.zeros((M, K), dtype=int)
Nk = np.zeros((K, V), dtype=int)

pos = None

Z = np.zeros(M, dtype=int) if pos else []

for m in range(M):
    
    doc = state.loc[state["document_id"] == m]
    
    if pos:
        Z[m] = doc.iloc[pos[m]]["z"]
        
    for i, (doc, w, z, word) in doc.iterrows():
        
        Nd[doc,z] += 1
        Nk[z,w] += 1
        
        if pos == None:
            Z.append(z)

print(doc, w, z, word)
theta = Nd / Nd.sum(axis=1)[:,np.newaxis]
phi = Nk / Nk.sum(axis=1)[:,np.newaxis]
results = [theta, phi, Nd, Nk, Z]
    

3 248 3 sabatini


In [7]:
phi

array([[0.        , 0.        , 0.01010101, ..., 0.        , 0.        ,
        0.        ],
       [0.11188811, 0.        , 0.        , ..., 0.        , 0.00699301,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.00763359],
       [0.        , 0.01980198, 0.        , ..., 0.00990099, 0.        ,
        0.        ]])

In [None]:

    print(f'Results retrieved after {round(time.time()-start)} seconds.')
    

In [3]:
def get_results(state, pos=None):
    """
    Retrieve model parameters and topic indicator counts from state dataframe.
    Position returns topic indicators only for words in pos positions (count matrices are the same).
    """
    start = time.time()
    vocab = {}
    V = len(set(state["w"]))
    K = len(set(state["z"]))
    Nd = np.zeros((M, K), dtype=int)
    Nk = np.zeros((K, V), dtype=int)
    Z = np.zeros(M, dtype=int) if pos else []
    for m in range(M):
        M = max(state["doc"])+1
        doc = state.loc[state["doc"] == m]
        if pos:
            Z[m] = doc.iloc[pos[m]]["z"]
        for i,row in doc.iterrows():
            doc, w, z, word = row[["doc", "w", "z", "word"]]
            Nd[doc,z] += 1
            Nk[z,w] += 1
            if pos == None:
                Z.append(z)
            if word not in vocab:
                vocab[word] = w

    theta = Nd / Nd.sum(axis=1)[:,np.newaxis]
    phi = Nk / Nk.sum(axis=1)[:,np.newaxis]
    results = [theta, phi, Nd, Nk, Z]
    print(f'Results retrieved after {round(time.time()-start)} seconds.')
    return [results, vocab]

get_results()


IPS-LMU		    jupyterhub	   sandboxes	    welfare-state-analytics
clones		    minio_sandbox  sead		    workspaces
data		    penelope	   tCoIR
hypermodern-python  petprojects    vulcan.dotfiles
inidun		    polycaste	   webapi
