GitHub - spatial-computing/generating-place-datasets-from-web: generating-place-datasets-from-web

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ReadMe.txt		ReadMe.txt
_URLS.txt		_URLS.txt
findF.py		findF.py
findS.py		findS.py
geocode.py		geocode.py
googlecrawler.py		googlecrawler.py
input.txt		input.txt
newkeywords		newkeywords
place_type.txt		place_type.txt
user_agents		user_agents

Repository files navigation

1) For executing the script make sure you have following python packages installed in system.
-urllib2,re,random,time,os,sys,bs4,datetime,operator,csv,StringIO,gzip,geopy

You can install on ubuntu based systems using PIP as
sudo pip install <package name>

Install pip using
sudo apt-get install pip

2) Configure following two files <Please do not rename these files other wise script will be required to be changed for the same>
    a) input.txt:
        first line containing the city name and following lines containing the street names
        ex:
        Glendale
        S Glendale Ave
        S Brand Blvd
        S Central Ave
        E California Ave
        
    b) place_type.txt:
       Each line in this file corresponds to place types that will be searched in the specified area.
       For enabling a type of area just remove the # from the place type line and similarily for disabling add a # sign in front of place   type    .
       ex:
       #church
       school
       #hospital
       #club
       Script will only search for "school" in the given area for this place_type.txt

3) Run script now using:
    > python googlecrawler.py input.txt 
   Do not run parallel execution of script in same folder if so required duplicate the web_scrapper folder and run it separately.
   
   If any network error encountered re run the script it will resume from last execution step or even kill or stop script it will resume from last execution step.
   For starting a clean start and ignoring old files use -n option as:
   > python googlecrawler.py input.txt -n
   
   For starting a clean start and delete old files use -c option as:
   > python googlecrawler.py input.txt -c