-
Notifications
You must be signed in to change notification settings - Fork 2
/
flashier_sla_text.Rmd
169 lines (136 loc) · 4.97 KB
/
flashier_sla_text.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "flashier_sla_text"
author: "Matthew Stephens"
date: "2023-10-17"
output: workflowr::wflow_html
editor_options:
chunk_output_type: console
---
## Introduction
I want to try running flashier (non-negative) on some text data and see what happens. It is also a chance to try out the flashier release to CRAN.
I tried running flashier on both the log1p transformed counts directly, and log1p transform of fitted
values from a topic model. Both produce somewhat promising results. It is hard to beat the log1p transform for simplicity and speed.
```{r}
library(Matrix)
library(readr)
library(tm)
library(fastTopics)
library(flashier)
library(ebpmf)
sla <- read_csv("../../gsmash/data/SLA/SCC2016/Data/paperList.txt")
sla <- sla[!is.na(sla$abstract),]
sla$docnum = 1:nrow(sla)
datax = readRDS('../../gsmash/data/sla_full.rds')
dim(datax$data)
sum(datax$data==0)/prod(dim(datax$data))
datax$data = Matrix(datax$data,sparse = TRUE)
```
## Data filtering
filter out some documents: use top 60% longest ones as in Ke and Wang 2022.
```{r}
doc_to_use = order(rowSums(datax$data),decreasing = T)[1:round(nrow(datax$data)*0.6)]
mat = datax$data[doc_to_use,]
sla = sla[doc_to_use,]
samples = datax$samples
samples = lapply(samples, function(z){z[doc_to_use]})
```
Filter out words that appear in less than 5 documents. This results in around 2000 words
```{r}
word_to_use = which(colSums(mat>0)>=5)
mat = mat[,word_to_use]
mat = Matrix(mat,sparse=TRUE)
lmat = Matrix(log(mat+1),sparse=TRUE)
```
## Fit log1p counts
Here I take the log(mat+1) transform and fit.
```{r}
set.seed(1)
fit.nn = flash(lmat,ebnm_fn = c(ebnm::ebnm_point_exponential,ebnm::ebnm_point_exponential),var_type=2,greedy_Kmax = 200)
```
Look at the keywords for each factor. There are many single-word factors, and not as many
additional factors as I would have expected.
```{r}
get_keywords = function(fit.nn){
L= fit.nn$L_pm
F_pm = fit.nn$F_pm
if(is.null(L)){ # allows to deal with ebpmf fit
L = fit.nn$udv$u
F_pm = fit.nn$udv$v
}
rownames(L)<-1:nrow(L)
Lnorm = t(t(L)/apply(L,2,max))
Fnorm = t(t(F_pm)*apply(L,2,max))
khat = apply(Lnorm,1,which.max)
Lmax = apply(Lnorm,1,max)
khat[Lmax<0.1] = 0
keyw.nn =list()
for(k in 1:ncol(Fnorm)){
key = Fnorm[,k]>log(2)
keyw.nn[[k]] = (colnames(mat)[key])[order(Fnorm[key,k],decreasing = T)]
}
return(keyw.nn)
}
print(get_keywords(fit.nn))
```
Look at fitted values: we see the fit seems to miss quite a lot of the structure in the data.
I think this is partly because the low counts are adding noise that it is not dealing with so well.
```{r}
fv= fitted(fit.nn)
sub = sample(1:length(fv),100000)
plot(lmat[sub],fv[sub])
```
## Topic model
Here I fit a topic model - it seems that the log fitted values do a better job of fitting the data.
```{r}
library(fastTopics)
fit_nmf_k50 = fit_poisson_nmf(mat,k=50)
fvals.nmf = fit_nmf_k50$L %*% t(fit_nmf_k50$F)
plot(lmat[sub],log(fvals.nmf[sub]+1))
```
Here I tried running flash on the fitted values from the topic model.
```{r}
set.seed(1)
fit.nn.nmf = flash(log(fvals.nmf+1),ebnm_fn = c(ebnm::ebnm_point_exponential,ebnm::ebnm_point_exponential),var_type=2,greedy_Kmax = 200)
plot(log(fvals.nmf+1)[sub],fitted(fit.nn.nmf)[sub])
print(get_keywords(fit.nn.nmf))
```
I'm still a bit suprised the topic model fit isnt closer to the data. Some large
word counts are pretty poorly fit.
```{r}
plot(mat[sub],fvals.nmf[sub])
```
Here I increased the number of topics to 100 to try to get a better fit.
```{r}
fit_nmf_k100 = fit_poisson_nmf(mat,k=100,init.method="random")
fvals.nmf.k100 = fit_nmf_k100$L %*% t(fit_nmf_k100$F)
plot(mat[sub],fvals.nmf.k100[sub])
plot(log(1+mat[sub]),log(1+fvals.nmf.k100[sub]))
```
I tried fitting flash to the transformed values again. This seems promising.
Maybe we should fit topic model with even larger k? Experiment with more pseudocounts?
```{r}
set.seed(1)
fit.nn.nmf.k100 = flash(log(fvals.nmf.k100+1),ebnm_fn = c(ebnm::ebnm_point_exponential,ebnm::ebnm_point_exponential),var_type=2,greedy_Kmax = 200)
plot(log(1+mat[sub]),fitted(fit.nn.nmf.k100)[sub])
print(get_keywords(fit.nn.nmf.k100))
```
## Change pseudocount
Here I wanted to see what happens if I change the pseudocount. At first I tried 0.001 after looking at some plots but it was very different from 1, so I tried 0.1. It is still interestingly quite different from 1 - much more keywords per topic.
```{r}
plot(log(1+mat[sub]),log(0.001+fvals.nmf.k100[sub]))
set.seed(1)
Y = Matrix(log(1+fvals.nmf.k100/0.1)) # do it this way so non-negative
fit.nn.nmf.k100.001 = flash(Y,ebnm_fn = c(ebnm::ebnm_point_exponential,ebnm::ebnm_point_exponential),var_type=2,greedy_Kmax = 100)
print(get_keywords(fit.nn.nmf.k100.001))
```
## Anscombe transform
Try the anscombe transformation
```{r}
fit.nn.a = flash(sqrt(mat+3/8),ebnm_fn = c(ebnm::ebnm_point_exponential,ebnm::ebnm_point_exponential),var_type=2,greedy_Kmax = 200)
print(get_keywords(fit.nn.a))
```
```{r}
fv= fitted(fit.nn.a)
sub = sample(1:length(fv),100000)
plot(sqrt(mat+3/8)[sub],fv[sub])
```