Skip to content

react117/language_crawl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Synopsis

This custom routine scrapes a single link. I have scrapped story title, story date, and the main story from https://www.anandabazar.com/sport/bwf-world-championships-final-pv-sindhu-vs-nozomi-okuhara-dgtl-1.1036258. Right now in order to scrap urls with this tool, user need to have a bit of scripting knowledge, because you need to identify the story title body segments . Will try to make the tool more flexible and human intervension should be as less as possible.

How to replicate my result

  1. Install Scrapy. https://docs.scrapy.org/en/latest/intro/install.html

  2. Clone the repo.

  3. The file you are looking for is global_scrape.py inside language_crawl/language_crawl/spiders/. Please go through the file. It should be straight forward.

  4. You can see my scrapped result data in abp_scrap.csv.

  5. To replicate my results, please open your terminal, go to the language_crawl/language_crawl directory and run scrapy crawl my_global_scraper -o abp_scrap.csv. But before that please go through the article once.

Referance

About

This project crawls websites using scrapy to build monolingual corpora

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages