Skip to content

vksbhandary/nepali-news-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Nepali News crawler

Installation

In order to use this crawler, just install scrapy and clone this repository.


$ pip3 install scrapy
$ git clone https://github.com/vksbhandary/nepali-news-crawler.git
$ cd nepali-news-crawler
$ scrapy crawl news_hamrakura -o hamrakura.csv -t csv


Supported Sites

  1. hamrakura
  2. kantipurdaily
  3. onlinekhabar
  4. pahilopost
  5. wordpress website 1

Executing crawler

  • Executing hamrakura crawler

    
    $ scrapy crawl news_hamrakura -o hamrakura.csv -t csv
    
    
  • Executing kantipurdaily crawler

    
    $ scrapy crawl kanti_news -o kantipur.csv -t csv
    
    
  • Executing onlinekhabar crawler

    
    $ scrapy crawl news_onlinekhabar -o onlinekhabar.csv -t csv
    
    
  • Executing pahilopost crawler

    
    $ scrapy crawl news_pahilo -o file.csv -t csv
    
    
  • Executing wordpress crawler

    
    $ scrapy crawl wordpress_news -o news24nepal.csv -t csv
    
    

1 In order to use the wordpress website example you should follow steps:

  1. Open file spiders/wordpress.py
  2. Edit line 14 to add your domain
  3. Open your terminal and execute $ scrapy crawl wordpress_news -o news24nepal.csv -t csv

2 This crawler uses wordpress's RESTful API to fetch posts. Therefore a website should have enabled REST API for this crawler to work. In order to check if a wordpress website is supported by this crwaler

  1. Go to yourdomainname.com/wp-json/wp/v2/posts/
  2. If you see a bunch of Json data then its good to go
  3. If you see 404 error page or forbidden error page then its not supported.

About

Nepali News crawler (A scrapy project)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages