Skip to content

theveryhim/Massive-text-processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Large-scale data analysis in pyspark framework

Some of tasks done in this Repo(papers' dataset):

  • Clean the texts in the title and abstract fields if needed.
  • Remove mathematical symbols, meaningless characters in the text, remove stopwords, etc.
  • Calculate the number of articles in each category (e.g., ph-hep or co.math).
  • Identify the category that has the most articles.
  • Analyze the distribution of the number of authors in each article.

Descriptive Alt Text

- Filter articles that have more than three authors and list their titles and authors. - Draw the number of articles registered in each year.

Descriptive Alt Text

  • Extract and display 20 frequently used words in the abstract section of the article.
5 most frequent words in abstract:
model : 1188676
data : 917131
results : 859049
show : 831879
using : 809828
  • Find the articles in which the word algorithm is mentioned in their abstract.
  • Count the number of words in the abstract of this article
  • Arrange them in descending order based on the number of words.
  • Display the five articles with the highest number of words in the abstract as the final result.
Top 5 articles with the highest word counts in their abstract (containing 'algorithm'):
Title: The Nonlinearity Coefficient - A Practical Guide to Neural Architecture
  Design, Word Count: 498
Title: Generating a Generic Fluent API in Java, Word Count: 488
Title: Boxicity and Poset Dimension, Word Count: 484
Title: An Anytime Algorithm for Optimal Coalition Structure Generation, Word Count: 484
Title: McMini: A Programmable DPOR-Based Model Checker for Multithreaded
  Programs, Word Count: 475

About

Use of pyspark framework on processing/analysis of enormous text data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published