Geospatial Semantic Pattern Recognition in Volunteered Geographic Data Using the Random forest Algorithm
Masters of Spatial Analysis, Ryerson University, 2016
Thesis Defended on April 27, 2016
Supervised by Dr. Claus Rinner
The ubiquitous availability of location technologies has enabled large quantities of Volunteered Geographic Data (VGD) to be produced by users worldwide. VGD has been a cost effective and scalable solution to obtaining unique and freely available geospatial data. However, VGD suffers from reliability issues as user behaviour is often variable. Large quantities make manual assessments of the user generated data inefficient, expensive, and impractical. This research utilized a random forest algorithm based on geospatial semantic variables in order to aid the improvement and understanding of multi-class VGD without ground-truth reference data. An automated Python script of a random forest based procedure was developed. A demonstration of the automated script on OpenStreetMap (OSM) data with user generated tags in Toronto, Ontario, was effective in recognizing patterns in the OSM data with predictive performances of ~0.71 (where 0 is the worse, and 1 is the best) based on a class weighted metric, and the ability to reveal variable influences and outliers.
The code was written in Python 3.5 and has been tested for the Mapzen Toronto data for Windows and Linux operating systems. The code is described in Section 4 of the PDF, which used a tree-optimized random forest model to learn geospatial patterns for the prediction and outlier detection of known spatial object classes (Figure 1).
Figure 1. Flowchart of code process
- Install Anaconda Python 3.5 for windows
- Download wheel files: GDAL, Fiona, pyproj, and shapely for Python 3.5 (cp35)
- Uninstall existing OSGeo4W, GDAL, Fiona, pyproj, or shapely libraries
- Navigate to downloaded wheel files using the console
- Install the wheel (.whl) files and libraries using
64-bit Example (Same wheel files used in thesis)
cd path/to/downloaded_wheels pip install GDAL-2.0.3-cp35-cp35m-win_amd64.whl pip install Fiona-1.7.0-cp35-cp35m-win_amd64.whl pip install pyproj-184.108.40.206-cp35-cp35m-win_amd64.whl pip install Shapely-1.5.16-cp35-cp35m-win_amd64.whl pip install geopandas pip install joblib pip install seaborn pip install treeinterpreter pip install tqdm conda install -c ioos rtree
cd path/to/downloaded_wheels pip install GDAL-2.0.3-cp35-cp35m-win32.whl pip install Fiona-1.7.0-cp35-cp35m-win32.whl pip install pyproj-220.127.116.11-cp35-cp35m-win32.whl pip install Shapely-1.5.16-cp35-cp35m-win32.whl pip install geopandas pip install joblib pip install seaborn pip install treeinterpreter pip install tqdm conda install -c ioos rtree
- Install Anaconda Python 3.5 for linux
- Install libraries using
pip install treeinterpreter pip install tqdm conda install -c conda-forge geopandas conda install joblib conda install seaborn conda install -c ioos rtree
- Download this repository
- Unzip the file and navigate to the code folder
- Execute the code using
cd path/to/msa-thesis-master/py python thesis.py config.txt path/to/output_folder
The config file can be used to apply and alter the methods to other datasets.
Please see Section 4.1 in the PDF for more details.
Note: The unedited config.txt file contains the settings used to obtain results for the most recent Mapzen Toronto data. The data used in the thesis is provided in the reproduce release which contains instructions to reproduce the thesis results.
- Date: April 27, 2016
- Time: 2:00 p.m. to 4:00 p.m.
- Location: Jorgenson Hall 730, Ryerson University, Toronto, ON
- Chair: Dr. Lu Wang
- Examiner 1: Dr. Eric Vaz
- Examiner 2: Dr. Tony Hernandez
- Result: Pass with minor revisions
- Windows 8.1 64-bit
- i7-6700k 4.0 GHz Quad-Core
- 16 GB DDR4 2133 RAM
- 256 GB SSD + 512 GB SSD (Read: Up to 540 MB/sec, Write: Up to 520 MB/sec)
- Runtime: ~30-45 minutes
Virtual machine generously provided by Ryerson RC4:
- Debian Linux
- 6-Core CPU
- 6 GB RAM
- 66 GB Storage
- Runtime: ~50-60 minutes