This project mainly entails evaluating Performance of Different Domains for Pretraining Language Models. Here is the link to the paper written in IEEE CVPR conference paper style.
The data folder consists of two files which are bbc_news.zip and poetry.zip which are the two main datasets that have been used in this project. The philosophy dataset was too big to be added to this folder so here is the link where you can download the dataset.
I have also written a medium article that summarizes the key points, methods and insights in the paper which you can check out here.