## Goal

I find an improved implementation of anchor-word algorithm at https://aclanthology.org/D19-1504.pdf . I want to see how well it performs on a simulated dataset that slightly violates the "anchor-word" assumption. 

In [1]:
import os
import sys
import pandas as pd
from scipy import sparse

import numpy as np
import matplotlib.pyplot as plt

script_dir = "../"
sys.path.append(os.path.abspath(script_dir))
from file2 import *
from factorize import *
from smallsim_functions import *
from misc import *


np.random.seed(123)

## small, uncorrelated example

I simulate a count matrix from $L, F$, with each column of $F$ has 20 words that is 100 times more expressed than the rest words. This will not satisfy the strict "anchor word" assumption, but it's close: $F_{s_k, k} >> F_{s_k, l}, \forall l \neq k$ where $s_k$ is "approximate" anchor word for topic $k$. 

In [2]:
n = 600
p = 400
k = 6
doc_len = 50

sim = smallsim_independent(n = n, p = p, k = k, doc_len = doc_len)
X = sparse.coo_matrix(sim["X"])
L = sim["L"]
F = sim["F"]
id_m = sim["id_m"]
C, D1, D2 = X2C(X)

[file.bows2C] Start constructing dense C...
- Counting the co-occurrence for each document...
+ Finish constructing C and D!
  - The sum of all entries = 1.000000
  - Elapsed Time = 0.0530 seconds


In [3]:
S, B, A, Btilde, Cbar, C_rowSums, diagR, C = factorizeC(C, K=k, rectifier='AP', optimizer='activeSet')


+ Start rectifying C...
+ Start alternating projection
  - 1-th iteration... (7.469974e-04 / 2.790099e-07)
  - 2-th iteration... (1.510219e-06 / 2.790142e-07)
  - 3-th iteration... (9.042341e-07 / 2.790178e-07)
  - 4-th iteration... (6.287480e-07 / 2.790207e-07)
  - 5-th iteration... (4.865718e-07 / 2.790230e-07)
  - 6-th iteration... (3.969262e-07 / 2.790250e-07)
  - 7-th iteration... (3.346263e-07 / 2.790267e-07)
  - 8-th iteration... (2.876316e-07 / 2.790282e-07)
  - 9-th iteration... (2.504964e-07 / 2.790295e-07)
  - 10-th iteration... (2.201664e-07 / 2.790307e-07)
  - 11-th iteration... (1.947226e-07 / 2.790318e-07)
  - 12-th iteration... (1.736961e-07 / 2.790328e-07)
  - 13-th iteration... (1.564211e-07 / 2.790337e-07)
  - 14-th iteration... (1.413543e-07 / 2.790345e-07)
  - 15-th iteration... (1.281040e-07 / 2.790352e-07)
+ Finish alternating projection
  - Elapsed seconds = 0.1108

  - Finish rectifying C! [0.110793]
+ Start finding the set of anchor bases S...
[inference.findS

## Evaulate results

In [4]:
topic_idx = match_topics(F, B).astype(int)
topic_idx

array([4, 0, 5, 3, 2, 1])

In [5]:
cand_set = set(np.unique(id_m))
[w in cand_set for w in S]

[True, True, True, True, False, True]

In [12]:
F[S[topic_idx],:].round(3)

array([[0.   , 0.   , 0.   , 0.   , 0.   , 0.001],
       [0.001, 0.062, 0.001, 0.   , 0.   , 0.   ],
       [0.001, 0.   , 0.038, 0.   , 0.001, 0.001],
       [0.001, 0.   , 0.   , 0.059, 0.   , 0.   ],
       [0.   , 0.   , 0.   , 0.   , 0.029, 0.   ],
       [0.001, 0.   , 0.   , 0.001, 0.   , 0.076]])

In [6]:
# compare A and LLt/n
A_reorder = A[topic_idx,:]
A_reorder = A_reorder[:, topic_idx]
A_reorder.round(decimals=2)

array([[0.11, 0.01, 0.  , 0.  , 0.01, 0.02],
       [0.01, 0.15, 0.01, 0.01, 0.01, 0.01],
       [0.  , 0.01, 0.13, 0.02, 0.01, 0.01],
       [0.  , 0.01, 0.02, 0.12, 0.01, 0.01],
       [0.01, 0.01, 0.01, 0.01, 0.09, 0.01],
       [0.02, 0.01, 0.01, 0.01, 0.01, 0.12]])

In [7]:
L = sim["L"]
(L.T.dot(L)/n).round(decimals=2)

array([[0.1 , 0.01, 0.01, 0.01, 0.01, 0.01],
       [0.01, 0.13, 0.01, 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.11, 0.01, 0.01, 0.01],
       [0.01, 0.01, 0.01, 0.11, 0.01, 0.01],
       [0.01, 0.01, 0.01, 0.01, 0.09, 0.01],
       [0.01, 0.01, 0.01, 0.01, 0.01, 0.12]])