# In this notebook
This notebook handles validating the results of the VAE to ensure that the reconstruction *generally* looks good.

Specific model parameters analyzed:
* Count Vector: Minimum 100 count, max 0.1 (~/wildfires/data/san_francisco/cached/min100max_01/count_vec.joblib)
* Number of topics: 20

In [1]:
import sys
import torch
import joblib
sys.path.append("../scripts/")
import vae

In [2]:
model = torch.load('../scripts/model/model_3epoch.pt')

In [3]:
cv = joblib.load('../data/san_francisco/cached/count_vec.joblib')

In [4]:
tweets = vae.Tweets('../data/san_francisco/')
x_test = tweets.load(test=True)

Loading in the data...


  tweets = self._load_data()


Cached file was found...loading lemmatized tweets from the cache.
Creating the count vector


In [5]:
model = vae.VAE(tweets.vocab_size)

In [6]:
model.load_state_dict(torch.load('../scripts/model/model_3epoch.pt'))

<All keys matched successfully>

In [7]:
model.eval()

test_doc = x_test[0] 
recon = model(test_doc)
s, W, mu, logvar = recon

In [55]:
torch.arg

tensor([[4.5883e-07, 6.2493e-15, 7.7969e-16, 1.9733e-14, 1.4801e-12, 3.0517e-14,
         1.2722e-14, 5.9346e-14, 2.1961e-16, 7.7079e-15, 2.2167e-14, 5.0431e-16,
         2.3461e-15, 1.1333e-13, 5.6398e-15, 2.8376e-13, 3.6686e-15, 2.3433e-15,
         2.8595e-15, 1.1051e+00]], grad_fn=<SoftplusBackward0>)

In [19]:
wtm = model.W_tilde

In [20]:
wtm.requires_grad = False

In [38]:
words = cv.get_feature_names_out()

In [47]:
def get_top_words(word_topic, num_words):
    top_words = torch.argsort(word_topic, descending=True)[:, :num_words]
    for i, topic in enumerate(top_words):
        print(f"Topic {i}:")
        print("\t"+"\n\t".join(words[topic]))
        print()
get_top_words(wtm, 20)

Topic 0:
	editing
	bestie
	telegraph
	election
	passive
	weave
	attendee
	iso
	block
	daughter
	soul
	iced
	laker
	yah
	candle
	boarding
	hump
	roughly
	irresponsible
	meal

Topic 1:
	murderer
	cruelty
	shed
	prince
	excel
	comparable
	mature
	boy
	jalen
	okay
	packaging
	juicy
	pup
	fade
	sibling
	medium
	barista
	canadian
	debug
	thanks

Topic 2:
	raid
	harris
	wrong
	ignorance
	diary
	somewhat
	dramatically
	kkk
	nazis
	ml
	favorite
	shred
	dictatorship
	sweaty
	upside
	perfect
	mark
	manufacturer
	nfl
	various

Topic 3:
	baseball
	lay
	shady
	record
	claim
	absolutely
	memory
	biz
	reset
	foreal
	sweep
	leaf
	difficulty
	loose
	walgreen
	skull
	reserve
	barkley
	rodger
	changer

Topic 4:
	killing
	hope
	dealer
	tax
	vicious
	democracy
	viking
	channel
	unusual
	dray
	loan
	photo
	oppress
	fuk
	rope
	wound
	dan
	walgreen
	cashier
	focus

Topic 5:
	fil
	homework
	doc
	mrs
	carve
	rhythm
	helicopter
	swoop
	shrug
	bone
	breeze
	packet
	popularity
	atrocious
	disrespectful
	liz
	frank


In [50]:
dates = x_test.dates

In [60]:
s.detach().numpy().flatten()



array([4.5883192e-07, 6.2492553e-15, 7.7969256e-16, 1.9733184e-14,
       1.4801237e-12, 3.0517452e-14, 1.2721615e-14, 5.9346157e-14,
       2.1961304e-16, 7.7078567e-15, 2.2167454e-14, 5.0431458e-16,
       2.3461409e-15, 1.1332620e-13, 5.6398338e-15, 2.8375937e-13,
       3.6686361e-15, 2.3433412e-15, 2.8594699e-15, 1.1051416e+00],
      dtype=float32)

In [62]:
topic_loadings = []

for i, d in enumerate(dates):
    
    # Get the word vec
    word_vec = x_test[0]
    word_vec.requires_grad = False
    # Pass the vector through the model
    s, W, mu, logvar = model(word_vec)
    
    topic_loadings.append(s.detach().numpy().flatten())

In [73]:
np.argsort(topic_loadings, axis=1)[:,1:6]

array([[ 2, 11, 12,  6,  9],
       [ 2, 12, 18, 11, 14],
       [ 2, 18, 11, 12,  9],
       ...,
       [12, 11,  2, 14, 18],
       [ 2, 11, 12, 17, 18],
       [ 2, 18, 11, 14,  6]])

In [74]:
mu

tensor([[-15.3875, -33.9825, -35.7263, -34.0246, -29.6566, -34.0913, -34.4844,
         -33.0297, -37.6620, -34.4011, -31.7631, -35.4285, -34.4079, -30.0886,
         -35.1199, -31.5863, -34.1912, -34.2304, -35.5713,  -0.1589]],
       grad_fn=<MmBackward0>)

Interesting...so when there are not enough topics, all of the words get lumped into one topic.

In [118]:
# Make sure sensible words are in our corpus
words_to_check = [
    'cough',
    'lung',
    'eye',
    'itch',
    'fire',
    'wildfire',
    'smoke',
    'hurt'
]

for w in words_to_check:
    
    print(w, ': ', w in words)

cough :  True
lung :  True
eye :  True
itch :  True
fire :  False
wildfire :  True
smoke :  True
hurt :  True


In [137]:
# What are the most probable words in this reconstruction?
for i, v in enumerate(reversed(np.argsort(recon_vals))):
    print(words[v])
    if i > 20:
        break

mexican
keeper
jeopardy
attack
florida
spanish
dem
cater
mullen
tip
mute
dreamforce
sacrifice
max
hacker
hitter
specific
lego
module
sibling
contrary
transportation


In [123]:
# What are the most probable words in our generated document?
for i, v in enumerate(reversed(np.argsort(test_doc))):
    print(words[v])
    if i > 20:
        break

just
good
know
like
thank
love
make
people
game
want
say
right
great
look
today
way
think
hit
don
work
time
day


In [138]:
# Least probable words in the reconstruction
for i, v in enumerate(np.argsort(recon_vals)):
    print(words[v])
    if i > 20:
        break

japanese
appt
francis
doom
oil
cowardly
misogynistic
patent
employment
dough
fraud
fit
outing
outfit
wood
quite
omarosa
hahaha
rewatch
title
prob
surgeon


In [125]:
# What are the lest probable words in our generated document?
for i, v in enumerate(np.argsort(test_doc)):
    print(words[v])
    if i > 20:
        break

aa
organic
organ
org
oreo
oregon
ordinary
orbit
oracle
optional
optimize
optimistic
optimism
opt
oppression
oppress
opposition
oppose
opportunity
opponent
opioid
opinion


In [111]:
# Min probability of a word being in this corpus with min_df=100
print("Min: ", 1000/(1.8*10**6))

# Max prob of a word being in this corpus with max_df=0.1
print("Max: ", 0.1)


Min:  0.0005555555555555556
Max:  0.1


In [77]:
1.8*10**6*0.05

90000.0

In [144]:
mu

tensor([[-15.7212, -34.9691, -36.0413, -35.4208, -30.5385, -33.4864, -35.0444,
         -33.2605, -37.6280, -35.3061, -32.4391, -37.4034, -37.1712, -30.5614,
         -37.9732, -32.9141, -35.1267, -34.4124, -35.7661,   0.3856]],
       grad_fn=<MmBackward0>)

In [148]:
torch.exp(logvar)

tensor([[2.2437e-06, 2.2084e-14, 3.6074e-16, 3.9704e-15, 1.1269e-12, 9.9991e-15,
         7.4339e-16, 2.9564e-15, 3.2217e-14, 1.4587e-15, 2.1465e-15, 1.1409e-14,
         3.5717e-15, 4.5019e-15, 7.6575e-15, 2.9934e-14, 3.7548e-16, 1.6810e-15,
         3.8571e-15, 2.8558e-14]], grad_fn=<ExpBackward0>)

## Conclusion: 

Looks like this specific model is probably doing OK, but some words are perhaps too infrequent or too frequent.

We should look at words that only appear in at least every 20 or 50 tweets, but not more than 1000, so that they have a probability of showing up enough, but not too much.

Also, it looks like we are getting negative mu's with small variances. This doesn't make sense. We need to constrain our mus to be positive, with small variances.

TODO:
 - [ ] Change the count vector parameters.
 - [ ] Review the KL divergence...shouldn't these MUs be positive?
 - [ ] Add more components to the topic matrix. 20 seems too small...how about 50?