Setup

Requires python 3.8

run pip3 install -r requirements.txt

Overview

run_models.py main file for training and validation; see below
measure_rule_time.py measure run time for single rules
remove_bots.py remove bot users
label_log.py split a log into queries with the trained model
keep_only_queries.py keep only the query (the last log entry before a split)
log_insight.py basic statistics for splitted log
see_saw_stats.py statisics for see-saw behavior in splitted log

Paper results

Time measurements

Specify path to data in variable FILE in file measure_rule_time.pyand run it. The run time results for

Accuracy results

General usage: "python run_models [path_to_dataset]" The results of the paper could be reconstructed by using the parameter "--test=True" In this case all results are determined on the test set (with the same parameters used as in the paper).

For further examination of the errors it is possible to save all false positives, false negative and true positive pairs. To do this use the parameter "--savefp=True", "--savefn=True" or "--savet**p=True"

Rules

For calculating the results after the rules it is necessary to specify the parameter '--model=none'. This assures that no model is called and all results are the results of the rules.

The output of the rules is in the following form.

Total pairs: n
Predicted pairs: m
Percent decided: m/n
Metrics for this pairs:
Results: ACC =  PRE =  REC =  F2 =  TP =  TN =  FP =  FN = 
Needed time: ms
Time per pair:  ms

The predicted pairs are those pairs which could be decided with enough certainty.

Each rule has pairs which could not decided with enough certainty. In our proposed method those pairs are decided with machine learning. To determine the accuracy values of the rules only we could set the remaining (undecided) pairs to the opposite of the rule decision (i.e. if the rule is designed to split pairs all uncertain entries are merged together). To do this, supply the parameter "--finalrules=True" Keep in mind in this case the percentage of decided pairs is misleading.

For the paper results the following rulesets were used

time_gap - split after 5 minutes
containment - check for string containment after the 5 minutes split
lexical_similarity - for undecided pairs invoke the lexical similarity step
lexical_dissimilarity - for undecided pairs invoke the lexical dissimilarity step
cetendil - decision by edit distance by Centdil et al.
kim - decision by edit distance and time by Kim et al.
hagen - steps 1 to 3 of cascading log segmentation by Hagen et al.

python3 run_models [path_to_dataset] --model=none --rules=time --test=True python3 run_models [path_to_dataset] --model=none --rules=time --finalrules=True --test=True python3 run_models [path_to_dataset] --model=none --rules=containment --test=True python3 run_models [path_to_dataset] --model=none --rules=containment --finalrules=True --test=True python3 run_models [path_to_dataset] --model=none --rules=lexical_similarity --test=True python3 run_models [path_to_dataset] --model=none --rules=lexical_similarity --finalrules=True --test=True python3 run_models [path_to_dataset] --model=none --rules=lexical_dissimilarity --test=True python3 run_models [path_to_dataset] --model=none --rules=lexical_dissimilarity --finalrules=True --test=True

Model

For training the model behind a set of rules the following command can be used:

python3 run_models [path_to_dataset] --model=lgr --rules=[rule_set] --train=True

Of course the rule set used in the paper is lexical_dissimilarity. This command runs each pair through the rules of the ruleset and if it is still undecided it will be used for training of the logistic regression.

After training a model the results can be obtained by using

python3 run_models [path_to_dataset] --model=lgr --rules=[rule_set] --modelpath=[path to the model] --test=True

The pre trained model used in the paper is attached too. Its file name is: trained_model_paper.pyc

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

readme.md

readme.md

requirements.txt

requirements.txt

Repository files navigation

Setup

Overview

Paper results

Time measurements

Accuracy results

Rules

Model

About

Releases

Packages

Languages

webis-de/sigir21-identifying-queries-in-instant-search-logs

Folders and files

Latest commit

History

Repository files navigation

Setup

Overview

Paper results

Time measurements

Accuracy results

Rules

Model

About

Resources

Stars

Watchers

Forks

Languages