## Goal

In practice people remove words that "don't contain much information", like the most/least frequent words. This is claimed to get better models (topics make more sense). Also in [previous experiment](https://github.com/zihao12/pyJSMF-RAW/blob/master/experiments/sla_multinomial1.ipynb), we can see estimation of the high dimensional $C$ can be a challenge, and poor estimation on those possibly un-important coordinates lead to bad estimate of $F, A$. Here I want to use simulation to see if it's indeed the case. 

## Setting

There are $5000$ words in the dictionary but only the first $80$ words contain structural information. I use datasets $X, X_{\text{small}, X_{\text{mid}}$, where $X$ is the full data, $X_{\text{small}$ contain exactly that $80$ words and $X_{\text{mid}$ contains the first $500$ words. Then I compare estimated topic-topic matrix $A$

## Result
Not surprising the result (bigger the better ) $X < X_{\text{mid}, X_{\text{small}}$. In fact, using $X$ ignores the correlation among topics (why is that?)

## Thoughts
Removing words in an ad hoc way is not optimal. But it shows the potential benefits if we can impose the assumption that most words are background words in estimating $C$ (so $C$'s "effective dimension" is reduced). 

In [1]:
import os
import sys
import pandas as pd
from scipy import sparse

import numpy as np
import matplotlib.pyplot as plt

script_dir = "../"
sys.path.append(os.path.abspath(script_dir))
from file2 import *
from factorize import *
from smallsim_functions4 import *
from misc import *


np.random.seed(123)

In [2]:
n = 1000
p = 5000
k = 4
doc_len = 100

X, Atrue, Ftrue, p0 = smallsim_independent(n = n, p = p, k = k, doc_len = doc_len)

Xsmall = X[:,:p0]
Xmid = X[:,:500]

w_idx = np.where(X.sum(axis = 0) > 0)[0]
X = X[:,w_idx]

w_idx_small = np.where(Xsmall.sum(axis = 0) > 0)[0]
Xsmall = Xsmall[:,w_idx_small]

w_idx_mid = np.where(Xmid.sum(axis = 0) > 0)[0]
Xmid = Xmid[:,w_idx_mid]

## Fit with $X$

In [3]:
C, _, _ = X2C(sparse.coo_matrix(X))
S, B, A, Btilde, Cbar, C_rowSums, diagR, C = factorizeC(C, K=k, rectifier='AP', optimizer='activeSet')


[file.bows2C] Start constructing dense C...
- Counting the co-occurrence for each document...
+ Finish constructing C and D!
  - The sum of all entries = 1.000000
  - Elapsed Time = 0.8630 seconds
+ Start rectifying C...
+ Start alternating projection
  - 1-th iteration... (3.108002e-04 / 4.829844e-08)
  - 2-th iteration... (1.651053e-07 / 4.829848e-08)
  - 3-th iteration... (1.218360e-07 / 4.829853e-08)
  - 4-th iteration... (9.207829e-08 / 4.829858e-08)
  - 5-th iteration... (7.075159e-08 / 4.829862e-08)
  - 6-th iteration... (5.529315e-08 / 4.829866e-08)
  - 7-th iteration... (4.404138e-08 / 4.829869e-08)
  - 8-th iteration... (3.593319e-08 / 4.829871e-08)
  - 9-th iteration... (3.011415e-08 / 4.829873e-08)
  - 10-th iteration... (2.592578e-08 / 4.829875e-08)
  - 11-th iteration... (2.287429e-08 / 4.829877e-08)
  - 12-th iteration... (2.059198e-08 / 4.829878e-08)
  - 13-th iteration... (1.883374e-08 / 4.829879e-08)
  - 14-th iteration... (1.744964e-08 / 4.829881e-08)
  - 15-th itera

In [4]:
Atrue.round(2)

array([[0.17, 0.03, 0.02, 0.02],
       [0.03, 0.19, 0.02, 0.02],
       [0.02, 0.02, 0.18, 0.02],
       [0.02, 0.02, 0.02, 0.18]])

In [5]:
A.round(2)

array([[0.32, 0.  , 0.  , 0.  ],
       [0.  , 0.29, 0.  , 0.  ],
       [0.  , 0.  , 0.3 , 0.  ],
       [0.  , 0.  , 0.  , 0.3 ]])

## Fit with $X_{\text{small}}$

In [6]:
Csmall, _, _ = X2C(sparse.coo_matrix(Xsmall))
Ssmall, Bsmall, Asmall, _, _, _, _, Csmall = factorizeC(Csmall, K=k, rectifier='AP', optimizer='activeSet')

[file.bows2C] Start constructing dense C...
- Counting the co-occurrence for each document...
+ Finish constructing C and D!
  - The sum of all entries = 1.000000
  - Elapsed Time = 0.0924 seconds
+ Start rectifying C...
+ Start alternating projection
  - 1-th iteration... (1.034361e-03 / 5.349512e-07)
  - 2-th iteration... (1.577341e-07 / 5.349518e-07)
  - 3-th iteration... (5.533484e-09 / 5.349519e-07)
  - 4-th iteration... (1.941216e-10 / 5.349519e-07)
  - 5-th iteration... (6.810031e-12 / 5.349519e-07)
  - 6-th iteration... (2.389036e-13 / 5.349519e-07)
  - 7-th iteration... (8.382330e-15 / 5.349519e-07)
  - 8-th iteration... (2.940625e-16 / 5.349519e-07)
  - 9-th iteration... (1.558816e-17 / 5.349519e-07)
  - 10-th iteration... (1.129429e-17 / 5.349519e-07)
  - 11-th iteration... (1.541077e-17 / 5.349519e-07)
  - 12-th iteration... (1.410298e-17 / 5.349519e-07)
  - 13-th iteration... (1.048977e-17 / 5.349519e-07)
  - 14-th iteration... (1.011943e-17 / 5.349519e-07)
  - 15-th itera

## Fit with $X_{\text{mid}}$

In [7]:
Cmid, _, _ = X2C(sparse.coo_matrix(Xmid))
Smid, Bmid, Amid, _, _, _, _, Cmid = factorizeC(Cmid, K=k, rectifier='AP', optimizer='activeSet')

[file.bows2C] Start constructing dense C...
- Counting the co-occurrence for each document...
+ Finish constructing C and D!
  - The sum of all entries = 1.000000
  - Elapsed Time = 0.1075 seconds
+ Start rectifying C...
+ Start alternating projection
  - 1-th iteration... (8.616572e-04 / 3.712268e-07)
  - 2-th iteration... (1.922946e-07 / 3.712268e-07)
  - 3-th iteration... (1.457203e-07 / 3.712269e-07)
  - 4-th iteration... (1.192272e-07 / 3.712270e-07)
  - 5-th iteration... (1.003734e-07 / 3.712270e-07)
  - 6-th iteration... (8.669674e-08 / 3.712271e-07)
  - 7-th iteration... (7.555509e-08 / 3.712272e-07)
  - 8-th iteration... (6.697047e-08 / 3.712272e-07)
  - 9-th iteration... (6.052318e-08 / 3.712273e-07)
  - 10-th iteration... (5.520723e-08 / 3.712273e-07)
  - 11-th iteration... (5.071830e-08 / 3.712274e-07)
  - 12-th iteration... (4.683560e-08 / 3.712274e-07)
  - 13-th iteration... (4.344543e-08 / 3.712275e-07)
  - 14-th iteration... (4.045788e-08 / 3.712275e-07)
  - 15-th itera

## Compare estimation of $A$

In [8]:
print(Atrue.round(2))
print(Atrue.sum(axis = 0).round(2))

[[0.17 0.03 0.02 0.02]
 [0.03 0.19 0.02 0.02]
 [0.02 0.02 0.18 0.02]
 [0.02 0.02 0.02 0.18]]
[0.24 0.26 0.25 0.25]


In [9]:
print(A.round(2))
print(A.sum(axis = 0).round(2))

[[0.32 0.   0.   0.  ]
 [0.   0.29 0.   0.  ]
 [0.   0.   0.3  0.  ]
 [0.   0.   0.   0.3 ]]
[0.32 0.29 0.3  0.3 ]


In [10]:
print(Amid.round(2))
print(Amid.sum(axis = 0).round(2))

[[0.18 0.   0.01 0.  ]
 [0.   0.21 0.   0.03]
 [0.01 0.   0.29 0.01]
 [0.   0.03 0.01 0.22]]
[0.19 0.24 0.31 0.27]


In [11]:
print(Asmall.round(2))
print(Asmall.sum(axis = 0).round(2))

[[0.17 0.02 0.02 0.02]
 [0.02 0.2  0.01 0.02]
 [0.02 0.01 0.21 0.03]
 [0.02 0.02 0.03 0.18]]
[0.23 0.26 0.27 0.25]
