Skip to content

zouharvi/kb-shrink

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Base Shrink

Paper   Master thesis   YouTube video

The 768-dimensional embedding of 2019 Wikipedia dump (split to 100 token segment) takes almost 150GB. This poses practical issues for both research and applications. We aim to reduce the size through two methods:

Dimensionality reduction of the embedding:

  • PCA, autoencoder, random projections
  • Effect on IP vs L2
  • Pre-processing
  • Training/evaluation data size dependency

Document splitting & filtering:

  • Split on segments respecting semantic boundaries
  • Get retrievability annotations and train a filtering system
  • Decrease knowledge base size by clustering (join neighbours pointing to the same doc)
    • Observe performance vs. cluster count
    • Cluster aggregation
    • Pre-train vs. post-train reduction effects

Recommendations

  • Always use pre- and post-processing (centering & normalization).
  • PCA is a good enough solution that requires very little data (1k vectors) to fit and is stable. The autoencoder provides a slight improvement but is less stable.
  • 8-bit floats are supported and offer very little performance drop. Combine PCA and this precision reduction for the best trade-off.

Citation

@inproceedings{zouhar2022knowledge,
  title={Knowledge Base Index Compression via Dimensionality and Precision Reduction},
  author={Zouhar, Vil{\'e}m and Mosbach, Marius and Zhang, Miaoran and Klakow, Dietrich},
  booktitle={Proceedings of the 1st Workshop on Semiparametric Methods in NLP: Decoupling Logic from Knowledge},
  pages={41--53},
  year={2022},
  url={https://aclanthology.org/2022.spanlp-1.5/},
}

Furthermore, this project is also a Master thesis.

Acknowledgement

  • Based on KILT research & dataset.
  • This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) – Project-ID 232722074 – SFB 1102.

About

Shrinking knowledge bases for knowledge intensive tasks, one key-value pair at a time

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published