Skip to content

Latest commit

 

History

History
45 lines (35 loc) · 1.78 KB

WebMining.md

File metadata and controls

45 lines (35 loc) · 1.78 KB

Web Mining

  1. scrapy
    Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
    Project Source: https://github.com/scrapy/scrapy
    Project Homepage: http://scrapy.org/

  2. Pattern
    Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
    Project Source: https://github.com/clips/pattern
    Project Homepage: http://www.clips.ua.ac.be/pages/pattern

  3. portia
    Portia is a tool for visually scraping web sites without any programming knowledge.
    Project Source: https://github.com/scrapinghub/portia

  4. python-goose
    Html Content / Article Extractor, web scrapping lib in Python.
    Project Source: https://github.com/grangier/python-goose

  5. newspaper
    News extraction, article extraction and content curation in python.
    Project Source: https://github.com/codelucas/newspaper
    Project Homepage: http://newspaper.readthedocs.org/en/latest/

  6. gensim
    Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.
    Project Source: https://github.com/piskvorky/gensim
    Project Homepage: http://radimrehurek.com/gensim/

  7. distribute_crawler
    A distributed web crawler.
    Project Source: https://github.com/gnemoug/distribute_crawler

  8. pyspider
    A spider system in python.
    Project Source: https://github.com/binux/pyspider

  9. tagger
    A Python module for extracting relevant tags from text documents.
    Project Source: https://github.com/apresta/tagger

  10. cola
    A distributed crawling framework.
    Project Source: https://github.com/chineking/cola