GitHub - react117/language_crawl: This project crawls websites using scrapy to build monolingual corpora

Synopsis

This custom routine scrapes a single link. I have scrapped story title, story date, and the main story from https://www.anandabazar.com/sport/bwf-world-championships-final-pv-sindhu-vs-nozomi-okuhara-dgtl-1.1036258. Right now in order to scrap urls with this tool, user need to have a bit of scripting knowledge, because you need to identify the story title body segments . Will try to make the tool more flexible and human intervension should be as less as possible.

How to replicate my result

Install Scrapy. https://docs.scrapy.org/en/latest/intro/install.html
Clone the repo.
The file you are looking for is global_scrape.py inside language_crawl/language_crawl/spiders/. Please go through the file. It should be straight forward.
You can see my scrapped result data in abp_scrap.csv.
To replicate my results, please open your terminal, go to the language_crawl/language_crawl directory and run scrapy crawl my_global_scraper -o abp_scrap.csv. But before that please go through the article once.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
language_crawl		language_crawl
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Synopsis

How to replicate my result

Referance

About

Uh oh!

Releases

Packages

Languages

react117/language_crawl

Folders and files

Latest commit

History

Repository files navigation

Synopsis

How to replicate my result

Referance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages