Some of tasks done in this Repo(papers' dataset):
- Clean the texts in the title and abstract fields if needed.
- Remove mathematical symbols, meaningless characters in the text, remove stopwords, etc.
- Calculate the number of articles in each category (e.g., ph-hep or co.math).
- Identify the category that has the most articles.
- Analyze the distribution of the number of authors in each article.
- Extract and display 20 frequently used words in the abstract section of the article.
5 most frequent words in abstract:
model : 1188676
data : 917131
results : 859049
show : 831879
using : 809828
- Find the articles in which the word algorithm is mentioned in their abstract.
- Count the number of words in the abstract of this article
- Arrange them in descending order based on the number of words.
- Display the five articles with the highest number of words in the abstract as the final result.
Top 5 articles with the highest word counts in their abstract (containing 'algorithm'):
Title: The Nonlinearity Coefficient - A Practical Guide to Neural Architecture
Design, Word Count: 498
Title: Generating a Generic Fluent API in Java, Word Count: 488
Title: Boxicity and Poset Dimension, Word Count: 484
Title: An Anytime Algorithm for Optimal Coalition Structure Generation, Word Count: 484
Title: McMini: A Programmable DPOR-Based Model Checker for Multithreaded
Programs, Word Count: 475