document classifier based on https://deeplearning4j.org/doc2vec library.
We use supervised algorithm, for this we need a lot of data:
learning data, also same data after NLP
based on Open ANC with some samples of 'labeled' text.
Current approach based on idea:
- learn model with language corpus,
- build vector of labeled data
- calculate 'cosine distance' between sample text and labeled data
We test same data with different learning config. Found that most optimal config, use:
- minWordFrequency: 3 - min frequency of word when it will be considered
- learningRate: 0.05
- layersSize: 200 - size of neural network layer, between 50-300
- iterations: 2
- epochs: 14 - see explanation
- window: 7 - length of words sequence which NN will consume
- negative: 7.0 - value of negative sampling (count of word, uint)
- sampling: 1.0E-3 - value of subsampling, usually between 1e-3 - 1e-5
- trainElementsVectors: true
- trainSequenceVectors: true
- wordsConversion: "LEMMA_POS" - also:
RAW
(No conversion),POS
(Add POS to end of word),LEMMA
(Use lemma instead of word),LEMMA_POS
(Use lemma + POS)
Subsampling in d4j use follow formula:
R=(sqrt(wordFrequency/subsampling)+1)*(subsampling/wordFrequency)
then it remove word if R less than random value in [0;1), therefore any R which is greater than 1 will remain anyway
Based on Word2Vec
We test same data with different learning config. Found that most optimal config, use:
- batchSize: Int = 64, //Number of examples in each minibatch
- vectorSize: Int = 300, //Size of the word vectors. 300 in the Google News model
- nEpochs: Int = 2, //Number of epochs (full passes of training data) to train on
- truncateReviewsToLength: Int = 256, //Truncate reviews with length (# words) greater than this
- learningRate: Double = 2e-2
Can use lemmas for learning, see com/codeabovelab/tpc/tool/learn/sentiment/SentimentIterator.kt:init
- glove: https://nlp.stanford.edu/projects/glove/
- Word2vec: https://en.wikipedia.org/wiki/Word2vec
- TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
- Skip-gram, Negative Sampling, Subsampling & etc http://arxiv.org/pdf/1310.4546.pdf
- Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling https://arxiv.org/abs/1611.01462
- https://en.wikipedia.org/wiki/Cosine_similarity
- Example analysis of Enron email https://jaycode.github.io/enron/identifying-fraud-from-enron-email.html
- Example 2 analysis of Enron email https://github.com/yielder/identifying-fraud-from-enron-email
- Analysis of communication patterns with scammers in Enron corpus https://arxiv.org/pdf/1509.00705.pdf