NN model

document classifier based on https://deeplearning4j.org/doc2vec library.

We use supervised algorithm, for this we need a lot of data:

Learning data

based on Open ANC with some samples of 'labeled' text.

Current approach based on idea:

We test same data with different learning config. Found that most optimal config, use:

minWordFrequency: 3 - min frequency of word when it will be considered
learningRate: 0.05
layersSize: 200 - size of neural network layer, between 50-300
iterations: 2
epochs: 14 - see explanation
window: 7 - length of words sequence which NN will consume
negative: 7.0 - value of negative sampling (count of word, uint)
sampling: 1.0E-3 - value of subsampling, usually between 1e-3 - 1e-5
trainElementsVectors: true
trainSequenceVectors: true
wordsConversion: "LEMMA_POS" - also: RAW (No conversion), POS (Add POS to end of word), LEMMA (Use lemma instead of word), LEMMA_POS (Use lemma + POS)

Subsampling in d4j use follow formula:

R=(sqrt(wordFrequency/subsampling)+1)*(subsampling/wordFrequency)

then it remove word if R less than random value in [0;1), therefore any R which is greater than 1 will remain anyway

Based on Word2Vec

We test same data with different learning config. Found that most optimal config, use:

batchSize: Int = 64, //Number of examples in each minibatch
vectorSize: Int = 300, //Size of the word vectors. 300 in the Google News model
nEpochs: Int = 2, //Number of epochs (full passes of training data) to train on
truncateReviewsToLength: Int = 256, //Truncate reviews with length (# words) greater than this
learningRate: Double = 2e-2

Can use lemmas for learning, see com/codeabovelab/tpc/tool/learn/sentiment/SentimentIterator.kt:init

glove: https://nlp.stanford.edu/projects/glove/
Word2vec: https://en.wikipedia.org/wiki/Word2vec
TF-IDF: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
Skip-gram, Negative Sampling, Subsampling & etc http://arxiv.org/pdf/1310.4546.pdf
Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling https://arxiv.org/abs/1611.01462
https://en.wikipedia.org/wiki/Cosine_similarity
Example analysis of Enron email https://jaycode.github.io/enron/identifying-fraud-from-enron-email.html
Example 2 analysis of Enron email https://github.com/yielder/identifying-fraud-from-enron-email
Analysis of communication patterns with scammers in Enron corpus https://arxiv.org/pdf/1509.00705.pdf