The goal of this project is to derive the latent variables behind tweets. To do this, I will be experimenting with various forms of sequence-to-sequence auto-encoders.
You can find some tweet data here. It was intended to be used for sentiment analysis, but it can be repurposed for this. However, it is slightly biased (only tweets with emoticons were used).
I trained a model with 768 LSTM cells per layer and a bottleneck layer with 1024 neurons. After a day of training on a Titan X, the model gets down to a cost of about 0.7 nats. The model is fairly good at reconstructions:
Original | Reconstructed |
---|---|
I hate my job. | I hate my job. |
today will be a good day | today will be a good day |
Well, that's my musical day set then. | Well, that's my musical days then sleep. |
@unixpickle I am not sure if you're serious... | @inupciline I am so tired your superfure... ok. |
You can also use the model to interpolate between two tweets. Due to this paper, I suspect that I would get better interpolations if I used a variational auto-encoder (something I am looking into). For now, here's what we got:
0.000 | I hate my job. |
0.167 | I hate my 10 one. |
0.333 | I have my toddlering. |
0.500 | I have my folding trashes |
0.667 | I love my friends at hang |
0.833 | I love my friends and family |
1.000 | I love my friends and family |
0.000 | I hate my job. |
0.167 | I hate my job. |
0.333 | I had to be my agile. |
0.500 | Ita do we had my big lonely |
0.667 | today we blit a good hand |
0.833 | today will be a good day |
1.000 | today will be a good day |