## Original Basic Siamese RNN (LSTM)
   - Use Jieba tokenizer to get tokens
   - Build dictionary
   - Turn titles into index vectors
   - Zero padding to make fixed-length index vector 
   - Turn label into one-hot vectors
   - Siamese LSTM model 
   - Train and test
   - Submission

In [1]:
import numpy as np
import pandas as pd

In [2]:
import jieba.posseg as pseg
import os
import keras

Using TensorFlow backend.


In [3]:
TRAIN_CSV_PATH = './project1_data/train.csv'
TEST_CSV_PATH = './project1_data/test.csv'
TOKENIZED_TRAIN_CSV_PATH = './project1_data/tokenized_train.csv'
TOKENIZED_TEST_CSV_PATH = './project1_data/tokenized_test.csv'

In [4]:
train = pd.read_csv(TRAIN_CSV_PATH, index_col='id')
train.head(3)

Unnamed: 0_level_0,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,1,2017养老保险又新增两项，农村老人人人可申领，你领到了吗,警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京,There are two new old-age insurance benefits f...,"Police disprove ""bird's nest congress each per...",unrelated
3,2,3,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小,"""If you do not come to Shenzhen, sooner or lat...",Shenzhen's GDP outstrips Hong Kong? Shenzhen S...,unrelated
1,2,4,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",GDP首超香港？深圳澄清：还差一点点……,"""If you do not come to Shenzhen, sooner or lat...",The GDP overtopped Hong Kong? Shenzhen clarifi...,unrelated


### Use Jieba tokenizer to get tokens

In [6]:
def jieba_tokenizer(text):
    words = pseg.cut(text)
    return ' '.join([word for word, flag in words if flag != 'x'])

In [7]:
train.isna().any()

title1_zh    False
title2_zh     True
label        False
dtype: bool

In [8]:
train.title2_zh.fillna('UNKNOWN', inplace=True)
train.isna().any()

title1_zh    False
title2_zh    False
label        False
dtype: bool

In [9]:
def process(data):
    res = data.apply(jieba_tokenizer)
    return res

def check_merge_idx(data, res):
    assert((data.index == res.index).all(), 'Something error when merge data')

def parallelize(data, func):
    from multiprocessing import cpu_count, Pool
    cores = partitions = cpu_count()
    data_split = np.array_split(data, partitions)
    pool = Pool(cores)
    res = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    check_merge_idx(data, res)
    return res

  assert((data.index == res.index).all(), 'Something error when merge data')


In [10]:
np.all(train.index == train.title1_zh.index)

True

In [11]:
if os.path.exists(TOKENIZED_TRAIN_CSV_PATH):
    print("Use prepared tokenized train data")
    train = pd.read_csv(TOKENIZED_TRAIN_CSV_PATH, index_col='id')
else:
    print("start to training")
    train['title1_tokenized'] = parallelize(train.loc[:, 'title1_zh'], process)
    train['title2_tokenized'] = parallelize(train.loc[:, 'title2_zh'], process)
    train.to_csv('tokenized_train.csv',index=True)

start to training


Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Loading model cost 1.406 seconds.
Prefix dict has been built succesfully.
Loading model cost 1.420 seconds.
Loading model cost 1.390 seconds.
Prefix dict has been built succesfully.
Prefix dict has been built succesfully.
Loading model cost 1.368 seconds.
Prefix dict has been built succesfully.
Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Building 

In [12]:
train.loc[:, ["title1_zh", "title1_tokenized"]].head(5)

Unnamed: 0_level_0,title1_zh,title1_tokenized
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2017养老保险又新增两项，农村老人人人可申领，你领到了吗,2017 养老保险 又 新增 两项 农村 老人 人人 可 申领 你 领到 了 吗
3,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
1,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
2,"""你不来深圳，早晚你儿子也要来""，不出10年深圳人均GDP将超香港",你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
9,"""用大蒜鉴别地沟油的方法,怎么鉴别地沟油",用 大蒜 鉴别 地沟油 的 方法 怎么 鉴别 地沟油


In [13]:
train.loc[:, ["title2_zh", "title2_tokenized"]].head(5)

Unnamed: 0_level_0,title2_zh,title2_tokenized
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,警方辟谣“鸟巢大会每人领5万” 仍有老人坚持进京,警方 辟谣 鸟巢 大会 每人 领 5 万 仍 有 老人 坚持 进京
3,深圳GDP首超香港？深圳统计局辟谣：只是差距在缩小,深圳 GDP 首 超 香港 深圳 统计局 辟谣 只是 差距 在 缩小
1,GDP首超香港？深圳澄清：还差一点点……,GDP 首 超 香港 深圳 澄清 还 差 一点点
2,去年深圳GDP首超香港？深圳统计局辟谣：还差611亿,去年 深圳 GDP 首 超 香港 深圳 统计局 辟谣 还 差 611 亿
9,吃了30年食用油才知道，一片大蒜轻松鉴别地沟油,吃 了 30 年 食用油 才 知道 一片 大蒜 轻松 鉴别 地沟油


In [14]:
train.fillna('UNKNOWN', inplace=True)

In [22]:
MAX_NUM_WORDS = 70000
tokenizer = keras.preprocessing.text.Tokenizer(num_words=MAX_NUM_WORDS)

In [15]:
corpus_x1 = train.title1_tokenized
corpus_x2 = train.title2_tokenized
corpus = pd.concat([corpus_x1, corpus_x2])
corpus.shape

(641104,)

In [16]:
pd.DataFrame(corpus.iloc[:5],
             columns=['title'])

Unnamed: 0_level_0,title
id,Unnamed: 1_level_1
0,2017 养老保险 又 新增 两项 农村 老人 人人 可 申领 你 领到 了 吗
3,你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
1,你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
2,你 不 来 深圳 早晚 你 儿子 也 要 来 不出 10 年 深圳 人均 GDP 将 超 香港
9,用 大蒜 鉴别 地沟油 的 方法 怎么 鉴别 地沟油


### Build dictionary

In [17]:
with open('./project1_data/corpus.txt', 'w', encoding='utf-8')as f:
    for sent in corpus:
        f.write(sent)

### Turn titles into index vectors

In [23]:
tokenizer.fit_on_texts(corpus)
x1_train = tokenizer.texts_to_sequences(corpus_x1)
x2_train = tokenizer.texts_to_sequences(corpus_x2)

### Zero padding to make fixed-length index vector 

In [24]:
MAX_SEQUENCE_LENGTH = 20
x1_train = keras.preprocessing.sequence.pad_sequences(x1_train, maxlen=MAX_SEQUENCE_LENGTH)

x2_train = keras.preprocessing.sequence.pad_sequences(x2_train, maxlen=MAX_SEQUENCE_LENGTH)

### Turn labels into one-hot vectors

In [19]:
import numpy as np 


label_to_index = {
    'unrelated': 0, 
    'agreed': 1, 
    'disagreed': 2
}


y_train = train.label.apply(
    lambda x: label_to_index[x])

y_train = np.asarray(y_train).astype('float32')

y_train[:5]

array([0., 0., 0., 0., 1.], dtype=float32)

In [20]:
y_train.shape

(320552,)

In [21]:
y_train = keras.utils.to_categorical(y_train)
y_train[:5]

array([[1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.]], dtype=float32)

### Siamese LSTM model 

In [25]:
from sklearn.model_selection import train_test_split

VALIDATION_RATIO = 0.1
RANDOM_STATE = 0

x1_train, x1_val, x2_train, x2_val, y_train, y_val = \
    train_test_split(
        x1_train, x2_train, y_train, 
        test_size=VALIDATION_RATIO, 
        random_state=RANDOM_STATE
)


In [36]:
# parameters
NUM_CLASSES = 3
MAX_NUM_WORDS = 70000
MAX_SEQUENCE_LENGTH = 20
NUM_EMBEDDING_DIM = 256
NUM_LSTM_UNITS = 128

In [37]:
from keras import Input
from keras.layers import Embedding,LSTM, concatenate, Dense
from keras.models import Model

top_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')
bm_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype='int32')

embedding_layer = Embedding(MAX_NUM_WORDS, NUM_EMBEDDING_DIM)
top_embedded = embedding_layer(top_input)
bm_embedded = embedding_layer(bm_input)

shared_lstm = LSTM(NUM_LSTM_UNITS)
top_output = shared_lstm(top_embedded)
bm_output = shared_lstm(bm_embedded)

merged = concatenate([top_output, bm_output], axis=-1)

dense =  Dense(units=NUM_CLASSES, activation='softmax')
predictions = dense(merged)


model = Model(inputs=[top_input, bm_input], outputs=predictions)

model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 20)           0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 20)           0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 20, 256)      17920000    input_5[0][0]                    
                                                                 input_6[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   (None, 128)          197120      embedding_2[0][0]                
          

In [38]:
from keras.optimizers import Adam

In [39]:
lr = 1e-3
opt = Adam(lr=lr, decay=lr/50)
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy'])

In [40]:
y_train.shape

(288496, 3)

### Train

In [None]:
BATCH_SIZE = 512

NUM_EPOCHS = 10


history = model.fit(
    x=[x1_train, x2_train], 
    y=y_train,
    batch_size=BATCH_SIZE,
    epochs=NUM_EPOCHS,
    validation_data=(
        [x1_val, x2_val], 
        y_val
    ),
    shuffle=True
)

### Test

In [55]:
# tokenized
import pandas as pd
if os.path.exists(TOKENIZED_TEST_CSV_PATH):
    print("Use tokenized test csv")
    test = pd.read_csv(TOKENIZED_TEST_CSV_PATH, index_col=0)
else:
    print("Use raw test csv")
    test = pd.read_csv(TEST_CSV_PATH, index_col=0)
    test.fillna('UNKNOWN', inplace=True)
    test['title1_tokenized'] = parallelize(test.loc[:, 'title1_zh'], process)
    test['title2_tokenized'] = parallelize(test.loc[:, 'title2_zh'], process)
    test.fillna('UNKNOWN', inplace=True)
test.head(3)

Use raw test csv


Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
2019-05-10 11:04:25,240 : DEBUG : Building prefix dict from the default dictionary ...
2019-05-10 11:04:25,248 : DEBUG : Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
2019-05-10 11:04:25,265 : DEBUG : Building prefix dict from the default dictionary ...
Building prefix dict from the default dictionary ...
2019-05-10 11:04:25,310 : DEBUG : Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Dumping model to file cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Dumping model to file cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
Dumping model to file cache /var/folders/90/jkdn_401557c3mztpykpjdfw0000gn/T/jieba.cache
2019-05-10 11:04:27,177 : DEBUG : Dumping model to file cache /var/folders/90/jkdn_4

Unnamed: 0_level_0,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en,title1_tokenized,title2_tokenized
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
321187,167562,59521,萨拉赫人气爆棚!埃及总统大选未参选获百万选票 现任总统压力山大,辟谣！里昂官方否认费基尔加盟利物浦，难道是价格没谈拢？,egypt 's presidential election failed to win m...,Lyon! Lyon officials have denied that Felipe F...,萨拉 赫 人气 爆棚 埃及 总统大选 未 参选 获 百万 选票 现任 总统 压力 山 大,辟谣 里昂 官方 否认 费 基尔 加盟 利物浦 难道 是 价格 没 谈拢
321190,167564,91315,萨达姆被捕后告诫美国的一句话，发人深思,10大最让美国人相信的荒诞谣言，如蜥蜴人掌控着美国,A message from Saddam Hussein after he was cap...,The Top 10 Americans believe that the Lizard M...,萨达姆 被捕 后 告诫 美国 的 一句 话 发人深思,10 大 最 让 美国 人 相信 的 荒诞 谣言 如 蜥蜴人 掌控 着 美国
321189,167563,167564,萨达姆此项计划没有此国破坏的话，美国还会对伊拉克发动战争吗,萨达姆被捕后告诫美国的一句话，发人深思,Will the United States wage war on Iraq withou...,A message from Saddam Hussein after he was cap...,萨达姆 此项 计划 没有 此国 破坏 的话 美国 还 会 对 伊拉克 发动战争 吗,萨达姆 被捕 后 告诫 美国 的 一句 话 发人深思


In [None]:
# index vectors
x1_test = tokenizer.texts_to_sequences(test.title1_tokenized)
x2_test = tokenizer.texts_to_sequences(test.title2_tokenized)

# zero padding
x1_test = keras.preprocessing.sequence.pad_sequences(x1_test, maxlen=MAX_SEQUENCE_LENGTH)

x2_test = keras.preprocessing.sequence.pad_sequences(x2_test,maxlen=MAX_SEQUENCE_LENGTH)    

# predict 
predictions = model.predict([x1_test, x2_test])

In [None]:
index_to_label = {v: k for k, v in label_to_index.items()}
test['Category'] = [index_to_label[idx] for idx in np.argmax(predictions, axis=1)]


### Submission

In [None]:
submission = test.loc[:, ['Category']].reset_index()
submission.columns = ['Id', 'Category']
submission.to_csv('./submission.csv', index=False)
submission.head()

## pre-trained embedding
- word2vec
- doc2vec
- fastText
- bert-as-service

### word2vec (word-level)

In [57]:
import logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('./project1_data/corpus.txt')
w2v_model = word2vec.Word2Vec(sentences, size=250, workers=3)
w2v_model.save("word2vec250_word.model")
# how to load model
#w2v = word2vec.Word2Vec.load("word2vec250_word.model")

2019-05-10 11:18:04,638 : INFO : collecting all words and their counts
2019-05-10 11:18:05,991 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-10 11:18:07,474 : INFO : collected 298614 word types from a corpus of 7483099 raw words and 749 sentences
2019-05-10 11:18:07,475 : INFO : Loading a fresh vocabulary
2019-05-10 11:18:07,669 : INFO : effective_min_count=5 retains 63882 unique words (21% of original 298614, drops 234732)
2019-05-10 11:18:07,670 : INFO : effective_min_count=5 leaves 7150609 word corpus (95% of original 7483099, drops 332490)
2019-05-10 11:18:07,913 : INFO : deleting the raw counts dictionary of 298614 items
2019-05-10 11:18:07,962 : INFO : sample=0.001 downsamples 27 most-common words
2019-05-10 11:18:07,963 : INFO : downsampling leaves estimated 6665665 word corpus (93.2% of prior 7150609)
2019-05-10 11:18:08,147 : INFO : estimated required memory for 63882 words and 250 dimensions: 159705000 bytes
2019-05-10 11:18:08,148 : INFO 

2019-05-10 11:18:50,993 : INFO : training on a 37415495 raw words (33326731 effective words) took 42.0s, 794212 effective words/s
2019-05-10 11:18:51,041 : INFO : saving Word2Vec object under word2vec250_word.model, separately None
2019-05-10 11:18:51,043 : INFO : storing np array 'vectors' to word2vec250_word.model.wv.vectors.npy
2019-05-10 11:18:51,239 : INFO : not storing attribute vectors_norm
2019-05-10 11:18:51,240 : INFO : storing np array 'syn1neg' to word2vec250_word.model.trainables.syn1neg.npy
2019-05-10 11:18:51,431 : INFO : not storing attribute cum_table
2019-05-10 11:18:51,557 : INFO : saved word2vec250_word.model


### word2vec (char-level)

In [59]:
train_corpus = np.unique([v for v in np.concatenate([train.title1_zh.unique(), train.title2_zh.unique()]) if type(v) == str])
test_corpus = np.unique([v for v in np.concatenate([test.title1_zh.unique(), test.title2_zh.unique()]) if type(v) == str])

In [60]:
all_corpus_char = np.concatenate([train_corpus, test_corpus])

In [61]:
with open('./project1_data/corpus_char.txt', 'w', encoding='utf-8') as corpus:
    for sentence in all_corpus_char:
        for char in sentence:
            corpus.write(char + ' ')
        corpus.write('\n')

In [67]:
import logging
from gensim.models import word2vec
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = word2vec.LineSentence('./project1_data/corpus_char.txt')
w2v_model_char = word2vec.Word2Vec(sentences, sg=0, hs=0, window=5, size=250, min_count=5, workers = 3)
w2v_model_char.save("word2vec250_char.model")

2019-05-10 11:25:39,587 : INFO : collecting all words and their counts
2019-05-10 11:25:39,590 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-10 11:25:39,672 : INFO : PROGRESS: at sentence #10000, processed 268913 words, keeping 3122 word types
2019-05-10 11:25:39,744 : INFO : PROGRESS: at sentence #20000, processed 527364 words, keeping 3764 word types
2019-05-10 11:25:39,813 : INFO : PROGRESS: at sentence #30000, processed 776712 words, keeping 4033 word types
2019-05-10 11:25:39,881 : INFO : PROGRESS: at sentence #40000, processed 1026225 words, keeping 4173 word types
2019-05-10 11:25:39,949 : INFO : PROGRESS: at sentence #50000, processed 1274541 words, keeping 4291 word types
2019-05-10 11:25:40,018 : INFO : PROGRESS: at sentence #60000, processed 1509680 words, keeping 4441 word types
2019-05-10 11:25:40,086 : INFO : PROGRESS: at sentence #70000, processed 1758381 words, keeping 4532 word types
2019-05-10 11:25:40,153 : INFO : PROGRESS: at sen

2019-05-10 11:26:03,925 : INFO : EPOCH 5 - PROGRESS: at 87.51% examples, 1084350 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:26:04,470 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-10 11:26:04,480 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-10 11:26:04,483 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-10 11:26:04,484 : INFO : EPOCH - 5 : training on 5701393 raw words (4976046 effective words) took 4.6s, 1086871 effective words/s
2019-05-10 11:26:04,484 : INFO : training on a 28506965 raw words (24882527 effective words) took 23.1s, 1076733 effective words/s
2019-05-10 11:26:04,485 : INFO : saving Word2Vec object under word2vec250_char.model, separately None
2019-05-10 11:26:04,486 : INFO : not storing attribute vectors_norm
2019-05-10 11:26:04,487 : INFO : not storing attribute cum_table
2019-05-10 11:26:04,551 : INFO : saved word2vec250_char.model


### doc2vec

In [63]:
from gensim.test.utils import common_texts

In [64]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

In [68]:
corpus_x1 = train.title1_tokenized
corpus_x2 = train.title2_tokenized
corpus = pd.concat([corpus_x1, corpus_x2])
corpus.shape

(641104,)

In [69]:
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(corpus)]

In [70]:
d2v_model = Doc2Vec(documents, vector_size=250, window=2, min_count=1, workers=4)

2019-05-10 11:27:49,228 : INFO : collecting all words and their counts
2019-05-10 11:27:49,230 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags
2019-05-10 11:27:49,307 : INFO : PROGRESS: at example #10000, processed 375109 words (4950535/s), 2019 word types, 10000 tags
2019-05-10 11:27:49,373 : INFO : PROGRESS: at example #20000, processed 747818 words (5701198/s), 2610 word types, 20000 tags
2019-05-10 11:27:49,444 : INFO : PROGRESS: at example #30000, processed 1095334 words (4919136/s), 2879 word types, 30000 tags
2019-05-10 11:27:49,506 : INFO : PROGRESS: at example #40000, processed 1433188 words (5506742/s), 3109 word types, 40000 tags
2019-05-10 11:27:49,568 : INFO : PROGRESS: at example #50000, processed 1765594 words (5438962/s), 3292 word types, 50000 tags
2019-05-10 11:27:49,632 : INFO : PROGRESS: at example #60000, processed 2102958 words (5353966/s), 3409 word types, 60000 tags
2019-05-10 11:27:49,698 : INFO : PROGRESS: at example #70000, pro

2019-05-10 11:27:53,470 : INFO : PROGRESS: at example #620000, processed 21080412 words (5593124/s), 4976 word types, 620000 tags
2019-05-10 11:27:53,546 : INFO : PROGRESS: at example #630000, processed 21422550 words (4522156/s), 4980 word types, 630000 tags
2019-05-10 11:27:53,628 : INFO : PROGRESS: at example #640000, processed 21773433 words (4329830/s), 4987 word types, 640000 tags
2019-05-10 11:27:53,638 : INFO : collected 4987 word types and 641104 unique tags from a corpus of 641104 examples and 21812179 words
2019-05-10 11:27:53,639 : INFO : Loading a fresh vocabulary
2019-05-10 11:27:53,652 : INFO : effective_min_count=1 retains 4987 unique words (100% of original 4987, drops 0)
2019-05-10 11:27:53,652 : INFO : effective_min_count=1 leaves 21812179 word corpus (100% of original 21812179, drops 0)
2019-05-10 11:27:53,674 : INFO : deleting the raw counts dictionary of 4987 items
2019-05-10 11:27:53,675 : INFO : sample=0.001 downsamples 26 most-common words
2019-05-10 11:27:53,6

2019-05-10 11:29:02,455 : INFO : EPOCH 2 - PROGRESS: at 62.32% examples, 409613 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:03,468 : INFO : EPOCH 2 - PROGRESS: at 65.42% examples, 411853 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:04,475 : INFO : EPOCH 2 - PROGRESS: at 68.56% examples, 413556 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:05,492 : INFO : EPOCH 2 - PROGRESS: at 71.40% examples, 413051 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:06,506 : INFO : EPOCH 2 - PROGRESS: at 74.63% examples, 415065 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:07,524 : INFO : EPOCH 2 - PROGRESS: at 77.68% examples, 416062 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:08,541 : INFO : EPOCH 2 - PROGRESS: at 80.42% examples, 415299 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:09,552 : INFO : EPOCH 2 - PROGRESS: at 83.48% examples, 416263 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:29:10,565 : INFO : EPOCH 2 - PROGRESS: at 86.72% examples, 418340 words/s, in_qsiz

2019-05-10 11:30:07,653 : INFO : EPOCH 4 - PROGRESS: at 55.41% examples, 445214 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:08,687 : INFO : EPOCH 4 - PROGRESS: at 58.96% examples, 448401 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:09,694 : INFO : EPOCH 4 - PROGRESS: at 62.32% examples, 450333 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:10,699 : INFO : EPOCH 4 - PROGRESS: at 65.79% examples, 453579 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:11,699 : INFO : EPOCH 4 - PROGRESS: at 69.21% examples, 455485 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:12,702 : INFO : EPOCH 4 - PROGRESS: at 72.66% examples, 457317 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:13,729 : INFO : EPOCH 4 - PROGRESS: at 76.05% examples, 458323 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:14,740 : INFO : EPOCH 4 - PROGRESS: at 79.53% examples, 460169 words/s, in_qsize 7, out_qsize 0
2019-05-10 11:30:15,740 : INFO : EPOCH 4 - PROGRESS: at 82.95% examples, 461957 words/s, in_qsiz

In [71]:
d2v_model.save("doc2vec_model")

2019-05-10 11:30:52,950 : INFO : saving Doc2Vec object under doc2vec_model, separately None
2019-05-10 11:30:52,952 : INFO : storing np array 'vectors_docs' to doc2vec_model.docvecs.vectors_docs.npy
2019-05-10 11:30:54,665 : INFO : saved doc2vec_model


### fastText (word-level)

In [72]:
from gensim.models import FastText
sentences = word2vec.LineSentence('./project1_data/corpus.txt')
fasttext_model = FastText(sentences, size=250, window=3, min_count=5, workers=3)

2019-05-10 11:38:11,088 : INFO : resetting layer weights
2019-05-10 11:38:41,151 : INFO : collecting all words and their counts
2019-05-10 11:38:43,585 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-10 11:38:44,994 : INFO : collected 298614 word types from a corpus of 7483099 raw words and 749 sentences
2019-05-10 11:38:44,997 : INFO : Loading a fresh vocabulary
2019-05-10 11:38:45,200 : INFO : effective_min_count=5 retains 63882 unique words (21% of original 298614, drops 234732)
2019-05-10 11:38:45,201 : INFO : effective_min_count=5 leaves 7150609 word corpus (95% of original 7483099, drops 332490)
2019-05-10 11:38:45,381 : INFO : deleting the raw counts dictionary of 298614 items
2019-05-10 11:38:45,414 : INFO : sample=0.001 downsamples 27 most-common words
2019-05-10 11:38:45,415 : INFO : downsampling leaves estimated 6665665 word corpus (93.2% of prior 7150609)
2019-05-10 11:38:45,863 : INFO : estimated required memory for 63882 words, 296687 bu

2019-05-10 11:39:47,743 : INFO : EPOCH 4 - PROGRESS: at 52.34% examples, 503048 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:48,751 : INFO : EPOCH 4 - PROGRESS: at 60.61% examples, 506788 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:49,753 : INFO : EPOCH 4 - PROGRESS: at 68.22% examples, 505313 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:50,765 : INFO : EPOCH 4 - PROGRESS: at 75.17% examples, 499137 words/s, in_qsize 4, out_qsize 1
2019-05-10 11:39:51,775 : INFO : EPOCH 4 - PROGRESS: at 82.64% examples, 497527 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:52,789 : INFO : EPOCH 4 - PROGRESS: at 90.92% examples, 500283 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:53,794 : INFO : EPOCH 4 - PROGRESS: at 97.86% examples, 496254 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:39:54,196 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-10 11:39:54,210 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-10 11:39:54,217 : I

In [73]:
fasttext_model.save("fasttext_model250_word")

2019-05-10 11:40:08,817 : INFO : saving FastText object under fasttext_model250_word, separately None
2019-05-10 11:40:08,830 : INFO : storing np array 'vectors' to fasttext_model250_word.wv.vectors.npy
2019-05-10 11:40:09,046 : INFO : storing np array 'vectors_vocab' to fasttext_model250_word.wv.vectors_vocab.npy
2019-05-10 11:40:09,242 : INFO : storing np array 'vectors_ngrams' to fasttext_model250_word.wv.vectors_ngrams.npy
2019-05-10 11:40:13,151 : INFO : not storing attribute vectors_norm
2019-05-10 11:40:13,152 : INFO : not storing attribute vectors_vocab_norm
2019-05-10 11:40:13,152 : INFO : not storing attribute vectors_ngrams_norm
2019-05-10 11:40:13,153 : INFO : not storing attribute buckets_word
2019-05-10 11:40:13,154 : INFO : storing np array 'syn1neg' to fasttext_model250_word.trainables.syn1neg.npy
2019-05-10 11:40:13,358 : INFO : storing np array 'vectors_vocab_lockf' to fasttext_model250_word.trainables.vectors_vocab_lockf.npy
2019-05-10 11:40:13,559 : INFO : storing n

### fastText (char-level)

In [74]:
from gensim.models import FastText
sentences = word2vec.LineSentence('./project1_data/corpus_char.txt')
fasttext_model_char = FastText(sentences, size=250, window=3, min_count=5, workers=3)
fasttext_model_char.save("fasttext_model250_char")

2019-05-10 11:40:17,400 : INFO : resetting layer weights
2019-05-10 11:40:47,002 : INFO : collecting all words and their counts
2019-05-10 11:40:47,007 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-05-10 11:40:47,092 : INFO : PROGRESS: at sentence #10000, processed 268913 words, keeping 3122 word types
2019-05-10 11:40:47,164 : INFO : PROGRESS: at sentence #20000, processed 527364 words, keeping 3764 word types
2019-05-10 11:40:47,233 : INFO : PROGRESS: at sentence #30000, processed 776712 words, keeping 4033 word types
2019-05-10 11:40:47,301 : INFO : PROGRESS: at sentence #40000, processed 1026225 words, keeping 4173 word types
2019-05-10 11:40:47,369 : INFO : PROGRESS: at sentence #50000, processed 1274541 words, keeping 4291 word types
2019-05-10 11:40:47,434 : INFO : PROGRESS: at sentence #60000, processed 1509680 words, keeping 4441 word types
2019-05-10 11:40:47,504 : INFO : PROGRESS: at sentence #70000, processed 1758381 words, keeping 4532 wor

2019-05-10 11:41:22,522 : INFO : EPOCH - 4 : training on 5701393 raw words (4976771 effective words) took 5.9s, 846763 effective words/s
2019-05-10 11:41:23,528 : INFO : EPOCH 5 - PROGRESS: at 16.93% examples, 866943 words/s, in_qsize 5, out_qsize 0
2019-05-10 11:41:24,533 : INFO : EPOCH 5 - PROGRESS: at 35.40% examples, 885786 words/s, in_qsize 3, out_qsize 0
2019-05-10 11:41:25,536 : INFO : EPOCH 5 - PROGRESS: at 52.59% examples, 872677 words/s, in_qsize 0, out_qsize 0
2019-05-10 11:41:26,537 : INFO : EPOCH 5 - PROGRESS: at 68.53% examples, 849782 words/s, in_qsize 1, out_qsize 0
2019-05-10 11:41:27,538 : INFO : EPOCH 5 - PROGRESS: at 83.82% examples, 832840 words/s, in_qsize 0, out_qsize 0
2019-05-10 11:41:28,479 : INFO : worker thread finished; awaiting finish of 2 more threads
2019-05-10 11:41:28,487 : INFO : worker thread finished; awaiting finish of 1 more threads
2019-05-10 11:41:28,498 : INFO : worker thread finished; awaiting finish of 0 more threads
2019-05-10 11:41:28,499 :

### bert-as-service

In [None]:
from bert_serving.client import BertClient
bc1 = BertClient()
bc2 = BertClient()

In [77]:
train_x1_bert_c = []
for idx, sent in enumerate(corpus_x1):
    train_x1_bert_c.append(sent.replace(' ',''))

In [78]:
train_x2_bert_c = []
for idx, sent in enumerate(corpus_x2):
    train_x2_bert_c.append(sent.replace(' ',''))

In [None]:
train_x1_bert = bc1.encode(train_x1_bert_c)
train_x2_bert = b2.encode(train_x2_bert_c)

## Handcrafted features
   - TF-IDF similarity of title 1 and title 2
   - Statistics features of rumor keywords
   - Overlap ratio of string matching between title 1 and title 2
   - Token set ratio matching

### Overlap ratio of string matching between title 1 and title 2

### Token set ratio matching

In [102]:
overlap_ratio_list = []
corpus_x1 = np.array(corpus_x1)
corpus_x2 = np.array(corpus_x2)

for i in range(len(corpus_x1)):
    total = len(set(corpus_x1[0].split(' ')+corpus_x2[0].split(' ')))
    overlap_ratio_list.append(len(list(set(corpus_x1[i].split(' ')) & set(corpus_x2[i].split(' '))))/total)

In [103]:
len(overlap_ratio_list)

320552

In [104]:
train['overlap_ratio'] = pd.Series(overlap_ratio_list).values

In [105]:
train['overlap_ratio'].head()

id
0    0.038462
3    0.153846
1    0.153846
2    0.153846
9    0.115385
Name: overlap_ratio, dtype: float64

## Other Models

### RandomForest

In [203]:
from sklearn.ensemble import RandomForestClassifier

In [218]:
xtrain,xval,ytrain,yval = train_test_split(X_train, y_train)

In [219]:
rcf_body = RandomForestClassifier(n_estimators=100,n_jobs=3, verbose=1)

In [220]:
rcf_body.fit(xtrain, ytrain)
y_rc_body_pred = rcf_body.predict(xval)

[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:   40.3s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:  1.5min finished
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.5s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:    1.0s finished


In [212]:
len(xtrain[0])

500

In [221]:
# print metrics
from sklearn.metrics import f1_score, accuracy_score , recall_score , precision_score
print ("Random Forest F1 and Accuracy Scores : \n")
print ( "F1 score {:.4}%".format( f1_score(yval, y_rc_body_pred, average='macro')*100 ) )
print ( "Accuracy score {:.4}%".format(accuracy_score(yval, y_rc_body_pred)*100) )

Random Forest F1 and Accuracy Scores : 

F1 score 68.14%
Accuracy score 82.53%


In [234]:
y_rc_body_pred_test = rcf_body.predict(X_test)

[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
[Parallel(n_jobs=3)]: Done  44 tasks      | elapsed:    0.3s
[Parallel(n_jobs=3)]: Done 100 out of 100 | elapsed:    0.7s finished


In [229]:
y_rc_body_pred_test

array([[0.7978194 , 0.13778322, 0.06439738],
       [0.71346107, 0.27265004, 0.01388889],
       [0.67297281, 0.26678427, 0.06024292],
       ...,
       [0.53353043, 0.45266005, 0.01380952],
       [0.58165873, 0.39834127, 0.02      ],
       [0.55850866, 0.42734848, 0.01414286]])

In [230]:
index_to_label = {v: k for k, v in label_to_index.items()}

In [231]:
test['Category'] = [index_to_label[idx] for idx in np.argmax(y_rc_body_pred_test, axis=1)]

In [232]:
submission = test \
    .loc[:, ['Category']] \
    .reset_index()

In [233]:
submission.columns = ['Id', 'Category']
submission.to_csv('submission.csv', index=False)
submission.head()

Unnamed: 0,Id,Category
0,321187,unrelated
1,321190,unrelated
2,321189,unrelated
3,321193,unrelated
4,321191,unrelated


In [236]:
list(y_rc_body_pred_test).count(0)

77215

In [237]:
list(y_rc_body_pred_test).count(1)

2910

In [238]:
list(y_rc_body_pred_test).count(2)

1

In [None]:
y_rc_body_pred = rcf_body.predict(xval)

In [113]:
list(y_rc_body_pred).count(1)

21450

In [114]:
list(y_rc_body_pred).count(2)

1223

### XGBoost

In [239]:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

In [240]:
xgb_body = XGBClassifier(verbose=True)

In [241]:
xtrain,xval,ytrain,yval = train_test_split(X_train, y_train)

In [242]:
len(xtrain)

240414

In [243]:
xgb_body.fit(np.array(xtrain), np.array(ytrain))

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='multi:softprob', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1, verbose=True)

In [261]:
y_xgb_body_pred = xgb_body.predict_proba(X_test)

In [246]:
# print metrics  
print ("XGBoost F1 and Accuracy Scores : \n")
print ( "F1 score {:.4}%``".format( f1_score(yval, y_xgb_body_pred, average='macro')*100 ) )
print ( "Accuracy score {:.4}%".format(accuracy_score(yval, y_xgb_body_pred)*100) )

XGBoost F1 and Accuracy Scores : 

F1 score 34.91%``
Accuracy score 69.63%


### LogisticRegression

In [252]:
from sklearn.linear_model import LogisticRegression
lr_body = LogisticRegression(penalty='l1', verbose=1, n_jobs=3)

In [253]:
# train model
lr_body.fit(xtrain, ytrain)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


[LibLinear]

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=3,
          penalty='l1', random_state=None, solver='warn', tol=0.0001,
          verbose=1, warm_start=False)

In [257]:
# get predictions for article section
y_body_pred = lr_body.predict_proba(X_test)

In [256]:
# print metrics
print ("Logistig Regression F1 and Accuracy Scores : \n")
print ( "F1 score {:.4}%".format( f1_score(yval, y_body_pred, average='macro')*100 ) )
print ( "Accuracy score {:.4}%".format(accuracy_score(yval, y_body_pred)*100) )

Logistig Regression F1 and Accuracy Scores : 

F1 score 31.56%
Accuracy score 68.41%


  'precision', 'predicted', average, warn_for)
