## Load the Dataset
MatchZoo expect a list of *Quintuple* as training data. The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, `text_left` is referred as `query`, and `text_right` is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input.

MatchZoo expect a list of *Quintuple* as training data:

```python
train = [('qid0', 'did0', 'query 0', 'document 0', 'label 0'),
         ('qid0', 'did1', 'query 0', 'document 1', 'label 1'),
          ...,
         ('qid1', 'did2', 'query 1', 'document 2', 'label 3')]
```

The corresponded columns are `(text_left_id, text_right_id, text_left, text_right, label)`. For Information Retrieval task, *text_left* is referred as *query*, and *text_right* is document.

For the test case, MatchZoo expect a list of *Quadruple* (we do not need labels) as input:

```python
test = [('qid9', 'did5', 'query 9', 'document 5'),
         ...,
        ('qid2', 'did7', 'query 2', 'document 7')]
```

In [22]:
def read_data(path, stage):
    def scan_file():
        with open(path) as in_file:
            next(in_file)  # skip header
            for l in in_file:
                yield l.strip().split('\t')
    if stage == 'train':
        return [(qid, did, q, d, label) for qid, did, q, d, label in scan_file()]
    elif stage == 'predict':
        return [(elem[0], elem[1], elem[2], elem[3]) for elem in scan_file()]

train = read_data('data/matchzoo_input.txt', stage='train')
#predict  = read_data('data/matchzoo_predict.txt', stage='predict')
rank = read_data('data/matchzoo_rank.txt', stage='predict')

In [23]:
print(train[0])
#print(predict[0])
print(rank[0])

('350', 'FT934-11789', 'Health and Computer Terminals', "11 18 20,000 29 70 93 931029 _an a a a a a a a a a a action action after against against agency agree also although an an and and and and and and and and and and and anything appeal are arms as as as as as as as ascribe at at at authentic award be be because because been being being bernard between books both britain brought but by by by by by by care case case case case cast causal cause charter claim claim claim clerical colleague come comp company company computer computer concept condition condition condition conditions conditions confidence confuse considering continue correspondent costs could could country court court court court court court court criticise damages damages damages describe disappointed dismiss disorder dj2dcad8ft doubt down due ec editor elbow emergence emotional employ employee employer even exist expert factor felt financial first for for for for former ft future future gbz go greatest had had hand he he

## Preprocessing

In [25]:
from matchzoo import preprocessor
dssm_preprocessor = preprocessor.DSSMPreprocessor()
datapack_train = dssm_preprocessor.fit_transform(train, stage='train')

Start building vocabulary & fitting parameters.
100it [00:00, 3365.84it/s]
11011it [01:53, 97.08it/s] 
Start processing input data for train stage.
100it [00:00, 2190.72it/s]
11011it [02:05, 88.04it/s] 


In [26]:
type(datapack_train)

matchzoo.datapack.DataPack

In [27]:
# pre-processed records including index and processed text to store `text_left` and `id_left`
datapack_train.left.head()

Unnamed: 0_level_0,text_left,length_left
id_left,Unnamed: 1_level_1,Unnamed: 2_level_1
350,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
351,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
352,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
353,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
354,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813


In [28]:
# pre-processed records including index and processed text to store `text_right` and `id_right`
datapack_train.right.head()

Unnamed: 0_level_0,text_right,length_right
id_right,Unnamed: 1_level_1,Unnamed: 2_level_1
FT934-11789,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
LA091090-0108,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
LA120789-0021,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
LA031990-0076,"[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813
FT921-12910,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",20813


In [29]:
# pre-processed records including index and index mapping `id_left` and `id_right`
datapack_train.relation.head()

Unnamed: 0,id_left,id_right,label
0,350,FT934-11789,1
1,350,LA091090-0108,1
2,350,LA120789-0021,1
3,350,LA031990-0076,1
4,350,FT921-12910,1


In [30]:
# other information stored during the pre-processing process
datapack_train.context.keys()

dict_keys(['term_index', 'input_shapes'])

In [31]:
# vocabulary size
len(datapack_train.context['term_index'])

20812

In [32]:
# since DSSM models' input shapes are dynamic
# (depend on the generated tri-letters)
# so we have to calculate shapes during the pre-processing process
datapack_train.context['input_shapes']

[(20813,), (20813,)]

## Data Generation

In [33]:
from matchzoo import generators
from matchzoo import tasks
generator_train = generators.PointGenerator(
    inputs=datapack_train, task=tasks.Ranking(), batch_size=64, stage='train')
#generator_predict = generators.PointGenerator(
#   inputs=datapack_predict, task=tasks.Ranking(), batch_size=64, stage='predict')

## Training

In [34]:
from matchzoo import models, load_model
from matchzoo import losses
from matchzoo import tasks
from matchzoo import metrics
dssm_model = models.DSSMModel()

In [35]:
# handle dynamic input shapes of DSSM
input_shapes = datapack_train.context['input_shapes']
dssm_model.params['input_shapes'] = input_shapes

In [36]:
dssm_model.params['task'] = tasks.Ranking()
dssm_model.params['task'].metrics = ['mae', 'map']

In [37]:
dssm_model.guess_and_fill_missing_params()
print(dssm_model.params)

name                          DSSMModel
model_class                   <class 'matchzoo.models.dssm_model.DSSMModel'>
input_shapes                  [(20813,), (20813,)]
task                          <matchzoo.tasks.ranking.Ranking object at 0x210694f28>
optimizer                     adam
w_initializer                 glorot_normal
b_initializer                 zeros
dim_fan_out                   128
dim_hidden                    300
activation_hidden             tanh
num_hidden_layers             2


In [53]:
dssm_model.build()
dssm_model.compile()
dssm_model.fit_generator(generator_train, steps_per_epoch=1000, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x157722be0>

In [54]:
X, Y = generator_train[0]
dssm_model.evaluate(X, Y)



{'loss': 0.03034413605928421,
 'mean_absolute_error': 0.09994634240865707,
 'mean_average_precision(0)': 0.3829787234042553}

## Prediction Function

In [42]:
global topscore, kthscore 
datapack_rank = dssm_preprocessor.fit_transform(rank, stage='predict')
generator_rank = generators.PointGenerator(
    inputs=datapack_rank, task=tasks.Ranking(), batch_size=len(rank), stage='predict')
X_rank, _ = generator_rank[0]
k = 10
ranking = dssm_model.predict(X_rank)
rank_list = [r[0] for r in ranking]
rank_list.sort(reverse=True)
topscore = rank_list[0]
kthscore = rank_list[k]

Start processing input data for predict stage.
1it [00:00, 139.94it/s]
199it [00:06, 27.00it/s]


In [51]:
print(len(rank))
print(len(ranking))


199
199


In [52]:
with open("results.txt","w") as f:
    i = 0
    for r in rank:
        f.write(str(r[1])+" "+str(ranking[i][0])+"\n")
        i+=1

In [19]:
import numpy as np

def predict_proba(doc_text):
    predict_data = list()
    count = 1
    did_list = list()
    for doc in doc_text:
        did_list.append(did + "_PRED_"+str(count))
        predict_data.append((qid, did + "_PRED_"+ str(count), query, doc))
        count += 1
        
    datapack_predict = dssm_preprocessor.fit_transform(predict_data, stage='predict')
    generator_predict = generators.PointGenerator(
        inputs=datapack_predict, task=tasks.Ranking(), batch_size=len(doc_text), stage='predict')
    X_predict, _ = generator_predict[0]
    
    pred = dssm_model.predict(X_predict)
    pred_list = [p[0] for p in pred]
    pdoclist = list(zip(did_list, pred_list))
#     pdoclist.sort(key=lambda x: x[1], reverse = True)
    
#     k = len(doc_text) // 10
#     topscore = pdoclist[0][1]
#     kscore = pdoclist[k][1]
    
    newdoclist = list()
    for i in range(len(pdoclist)):
        if pdoclist[i][1] > kthscore:
            newdoclist.append((pdoclist[i][0], 1))
        else:
            newdoclist.append((pdoclist[i][0],0))
            
#     newdoclist.sort(key=lambda x:x[0])
    prob = [(1 - elem[1], elem[1]) for elem in newdoclist]
#     print(len(prob))   
#     print(prob)
    return np.array(prob)


## Lime Initialization 

In [20]:
from lime.lime_text import LimeTextExplainer
import re

global qid, query, did
tokenizer = lambda doc: re.compile(r"(?u)\b\w\w+\b").findall(doc)
for row in train:
    (qid, did, query, document_text, label) = row
    explainer = LimeTextExplainer(class_names=["irrelevant", "relevant"], split_expression=tokenizer)
    exp = explainer.explain_instance(document_text, predict_proba, num_features=10)
    print("Query:",query)
    print("Class:",label)
    #print("Document:", document_text)
    print(exp.as_list())
    print("-------------------------------------------")


Start processing input data for predict stage.
1it [00:00, 397.41it/s]
5000it [00:13, 357.74it/s]


Query: Health and Computer Terminals
Class: 1
[('union', -0.013153165917992123), ('yesterday', -0.011487588702034954), ('lie', 0.008016270926718397), ('claim', 0.0064727784825321264), ('to', 0.005746670594325541), ('costs', 0.005370372725414624), ('rule', 0.002767510543325756), ('rafiq', 0.002080211659524188), ('specialist', 0.0020101993505908995), ('employer', 0.0017860522267690092)]
-------------------------------------------


KeyboardInterrupt: 

In [21]:
X_predict, _ = generator_predict[0]
pred = dssm_model.predict(X)
for id_left, id_right, pred, _ in zip(X_predict.id_left, X_predict.id_right, pred, range(10)):
    print("{}/{} is predicted as {}".format(id_left, id_right, pred))

NameError: name 'generator_predict' is not defined

#### Model Persistence

You can persist your trained model using `model.save()` and `load_model` function:

In [None]:
dssm_model.save('/tmp/my_dssm_model')
loaded_dssm_model = load_model('/tmp/my_dssm_model')

In [None]:
(loaded_dssm_model.predict(X) == dssm_model.predict(X)).all()

## Reference

[Huang et al. 2013] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proc. CIKM. ACM, 2333–2338.