## Feature ideas

+ TF-IDF similiarity
+ doc2vec (tf-idf weighted w2v) similiarity
+ LDA similiarity

    - (further improvement?)~ entities intersection (Polyglot / NER by Chaplinskyi)
    - alignment in headlines + word vectors between correspondent words 
    - mean and std for similiarities between all pairs of words in 2 headlines


http://jens-lehmann.org/files/2017/kcap_simdoc.pdf

In [1]:
from zipfile import ZipFile
import re, glob, pandas as pd, numpy as np

from cp_utils.vectorizers import *

logging.getLogger('polyglot').setLevel(logging.CRITICAL)

## Initialize vectorizers

In [3]:
tfidf = TfIdfGensimVectoriser(tfidf_path='./vectors/tfidf_lemm_nofunctors_unigramms.gensim',
                              dictionary='./vectors/pos_lemmatized_nofunctors.dict',
                              lemmatize=True)

2018-05-03 23:42:11,596 : INFO : loading TfidfModel object from ./vectors/tfidf_lemm_nofunctors_unigramms.gensim
2018-05-03 23:42:13,563 : INFO : loading id2word recursively from ./vectors/tfidf_lemm_nofunctors_unigramms.gensim.id2word.* with mmap=None
2018-05-03 23:42:13,564 : INFO : loaded ./vectors/tfidf_lemm_nofunctors_unigramms.gensim
2018-05-03 23:42:13,565 : INFO : loading Dictionary object from ./vectors/pos_lemmatized_nofunctors.dict
2018-05-03 23:42:13,765 : INFO : loaded ./vectors/pos_lemmatized_nofunctors.dict


In [None]:
lda_vect = LdaVectorizer(lda_path='./vectors/lda050418_1000dim_15pass_100iter_10offset_0.7lr.lda',
                         tfidf_path='./vectors/tfidf_lemm_nofunctors_unigramms.gensim',
                         dictionary='./vectors/pos_lemmatized_nofunctors.dict',
                         lemmatize=True)

In [4]:
d2v_tfidf = TfIdfD2vVectoriser(vec_path='./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v',
                               tfidf_path='./vectors/tfidf_lemm_nofunctors_unigramms.gensim',
                               dictionary='./vectors/pos_lemmatized_nofunctors.dict',
                               lemmatize=True)

2018-05-03 23:42:15,927 : INFO : loading Word2VecKeyedVectors object from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v
2018-05-03 23:42:18,753 : INFO : loading wv recursively from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v.wv.* with mmap=None
2018-05-03 23:42:18,754 : INFO : loading vectors from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v.wv.vectors.npy with mmap=None
2018-05-03 23:42:20,486 : INFO : setting ignored attribute vectors_norm to None
2018-05-03 23:42:20,488 : INFO : loading vocabulary recursively from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v.vocabulary.* with mmap=None
2018-05-03 23:42:20,490 : INFO : loading trainables recursively from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v.trainables.* with mmap=None
2018-05-03 23:42:20,491 : INFO : loading syn1neg from ./vectors/w2v_sent_5dim_5win_5Mwaxvocab_15Kbatch_10epoch.w2v.trainables.syn1neg.npy with mmap=None
2018-05-03 23:42:22,195 : INF

## Vectorize texts and map them to ids

In [5]:
def text_generator():
    with ZipFile('aggr_texts.zip') as zf:
        names = zf.namelist()
        for fn in names:
            with zf.open(fn) as f:
                yield fn, f.read().decode()

In [39]:
texts = []
for fname, doc in text_generator():
    d = {}
    d['id'] = re.sub('\D', '', fname)
    d['text'] = re.split('\s+', doc)
    texts += [d]
    
texts_df = pd.DataFrame(texts)
id2index = {int(row['id']): i for i, row in texts_df.iterrows()}

In [10]:
%%time

lda_feats = lda_vect.transform(texts_df.text.values)
print('vectorized LDA')

d2v_feats = d2v_tfidf.transform(texts_df.text.values)
print('vectorized D2V')

tfidf_feats = tfidf.transform(texts_df.text.values)
print('vectorized Tf-Idf')

  out=out, **kwargs)
  ret = ret.dtype.type(ret / rcount)


vectorized D2V
CPU times: user 52.1 s, sys: 20 ms, total: 52.1 s
Wall time: 52.2 s


In [13]:
import _pickle as pkl

with open('d2v_feats.pkl', 'wb') as f:
    pkl.dump(d2v_feats, f)
    
with open('tfidf_feats.pkl', 'wb') as f:
    pkl.dump(tfidf_feats, f)

with open('lda_feats.pkl', 'wb') as f:
    pkl.dump(lda_feats, f)

## Construct features for pairs of documents

In [17]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import classification_report, confusion_matrix

In [40]:
train_pairs = pd.read_csv('train_pair_ids.tsv', sep='\t')
test_pairs = pd.read_csv('test_pair_ids.tsv', sep='\t')

In [58]:
def get_feature_agg_vector(row):
    lda_row_1 = lda_feats[id2index[row.id1]]
    lda_row_2 = lda_feats[id2index[row.id2]]
    
    d2v_row_1 = d2v_feats[id2index[row.id1]]
    d2v_row_2 = d2v_feats[id2index[row.id2]]
    
    tfidf_row_1 = tfidf_feats[id2index[row.id1]]
    tfidf_row_2 = tfidf_feats[id2index[row.id2]]
    
    d2v_cosine = cosine_similarity(d2v_row_1, d2v_row_2)[0, 0]
    lda_diff = np.abs(lda_row_1 - lda_row_2)
    tfidf_cosine = cosine_similarity(tfidf_row_1, tfidf_row_2)[0, 0]
    
    feature_vec = sparse.hstack([d2v_row_1,
                                 d2v_row_2,
                                 d2v_cosine,
                                 lda_diff,
                                 tfidf_cosine])
    return feature_vec

In [63]:
train_balanced = pd.concat([train_pairs.loc[train_pairs.is_similar, ],
                            train_pairs.loc[~train_pairs.is_similar, 
                                          ].sample(n=16276)])

train_feats = sparse.vstack([get_feature_agg_vector(row) for i, row in train_balanced.iterrows()])
print('Got train features')

test_feats = sparse.vstack([get_feature_agg_vector(row) for i, row in test_pairs.iterrows()])
print('Got test features')

Got train features
Got test features


In [65]:
test_feats

<70275x1102 sparse matrix of type '<class 'numpy.float64'>'
	with 6956648 stored elements in COOrdinate format>

## Train classifier

In [75]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

In [76]:
%%time

cls = MLPClassifier(hidden_layer_sizes=(100, 100,))
cls.fit(train_feats, train_balanced.is_similar)

CPU times: user 11min 30s, sys: 14min 31s, total: 26min 1s
Wall time: 6min 58s


In [77]:
predicted = cls.predict(test_feats)

In [78]:
print(classification_report(predicted, test_pairs.is_similar))

confusion_matrix(predicted, test_pairs.is_similar,
                 labels=[True, False])

             precision    recall  f1-score   support

      False       0.76      0.90      0.82     45449
       True       0.72      0.47      0.57     24826

avg / total       0.74      0.75      0.73     70275



array([[11660, 13166],
       [ 4645, 40804]])

* Baseline 0.56 f1 score for similar pairs - 0.73
* Adding weighted aweraged word2vec, lda, and tf-idf features resulted in better f1 for unsimilar documents and much fewer false-positives for similarity.