cafe-map-crawler

Description

A web crawler that collects the basic information of 82,463 restaurants in Shanghai, including title, rate, area, address and average price per person for every restaurant.

Purpose

I implemented this project mainly to get familiar with web crawling, especially how to deal with (1) HTML parsing/server requesting and (2) bypassing scraping restrictions imposed by the website.

I think it would be nice to know how to prevent malicious users from scraping valuable data on company's website. However, as that one cannot expect a coach not knowing about the game to be good, in order to learn anti-scraping techniques, one has to know how to scrape and avoid anti-scraping restrictions first. I found it very interesting to explore different ways to bypass the website's prevention of crawling while gracefully following the robot's agreement. (and crawling some websites can be very tough!)

Dependencies

Comments

More details

All data is collected from dianping public pages and only for academic/non-profit use. Data has not been distributed anywhere.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
build/lib/dianping		build/lib/dianping
dianping		dianping
project.egg-info		project.egg-info
scripts		scripts
README.md		README.md
scrapinghub.yml		scrapinghub.yml
scrapy.cfg		scrapy.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cafe-map-crawler

Description

Purpose

Dependencies

Comments

About

Releases

Packages

Languages

raphaellu/cafe-map-crawler

Folders and files

Latest commit

History

Repository files navigation

cafe-map-crawler

Description

Purpose

Dependencies

Comments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages