Skip to content
Implementation of DeepTileBars
Python Perl
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
eval
.gitattributes
.gitignore
README.md
extract_file.py
preprocess.py
rank.py
text2img.py
texttiling.py
trecweb_parser.py
utils.py

README.md

DeepTileBars-release

Implementation of DeepTileBars: Visualizing Term Distribution of Neural Information Retrieval

Dependencies

pyspark
nltk
BeautifulSoup
keras
krovetzstemmer
gensim

Running the model

0 Data Preparing

  • Trained gensim word2vec model: name the model as word2vec.100, and put it along with its auxiliary files in the data/ directory.

  • Inverse Document Frequency (IDF) file: data/term2idf.json, which is essentially a dictionary storing the mapping word -> idf.

  • Query file: download from TREC, unzip and put it in the data/08.million-query-topics

  • LETOR-MQ2008 file: ./MQ2008/ is the folder downloaded from Microsoft.

1 Preprocessing

python preprocess.py

2 Extracting and cleaning documents

spark-submit --master [your-spark-cluster] --py-files trecweb_parser.py extract_file.py /path/to/corpus /path/to/clean-file

3 TextTiling

Warning: python3 users may need to fix a bug in NLTK follow this post.

update: As far as we know, NLTK 3.3.0 has fixed this bug.

spark-submit --master [your-spark-cluster] texttiling.py /path/to/clean-file /path/to/segmented-file

4 Coloring

spark-submit --master [your-spark-cluster] text2img.py  /path/to/segmented-file /path/to/images

5 Run the model

python rank.py /path/to/images epochs

e.g.

python rank.py ./img 5

Citation

If you are using this repo, please cite the following paper:

@inproceedings{deeptilebars2018,
    title={DeepTileBars: Visualizing Term Distribution for Neural Information Retrieval},
    author={Tang, Zhiwen and Yang, Grace Hui},
    journal={AAAI 2019},
    year={2019}
}
You can’t perform that action at this time.