Skip to content

rotemple/city-data-com-corpus-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

city-data-com-corpus-scripts

City-Data.com Corpus Python Recipes

This repository contains Python recipes to process and topic model the City-Data.com Corpus (Omizo, 2023).

Maintainer

Ryan M. Omizo

Credits

Python code to calculate topical diversity is from Terragni (2023). All rights reserved.

References

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2017). Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405.

Mimno, D., Wallach, H., Talley, E., Leenders, M., & McCallum, A. (n.d.). Optimizing Semantic Coherence in Topic Models.

Omizo, R. (2023). City-Data.com Corpus [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10086354.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., & Cournapeau, D. (n.d.). Scikit-learn: Machine Learning in Python. MACHINE LEARNING IN PYTHON.

Řehůřek, R., & Sojka, P. (2011). Gensim—statistical semantics in python. Retrieved from genism.org.

Reimers, N., & Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

Terragni, S. (2023). A collection of Topic Diversity measures for topic modeling. [Python]. https://github.com/silviatti/topic-model-diversity (Original work published 2020)