Geospatial Semantic Pattern Recognition in Volunteered Geographic Data Using the Random forest Algorithm
Richard Wen
rwen@ryerson.ca
Masters of Spatial Analysis, Ryerson University, 2016
Thesis Defended on April 27, 2016
Supervised by Dr. Claus Rinner
The ubiquitous availability of location technologies has enabled large quantities of Volunteered Geographic Data (VGD) to be produced by users worldwide. VGD has been a cost effective and scalable solution to obtaining unique and freely available geospatial data. However, VGD suffers from reliability issues as user behaviour is often variable. Large quantities make manual assessments of the user generated data inefficient, expensive, and impractical. This research utilized a random forest algorithm based on geospatial semantic variables in order to aid the improvement and understanding of multi-class VGD without ground-truth reference data. An automated Python script of a random forest based procedure was developed. A demonstration of the automated script on OpenStreetMap (OSM) data with user generated tags in Toronto, Ontario, was effective in recognizing patterns in the OSM data with predictive performances of ~0.71 (where 0 is the worse, and 1 is the best) based on a class weighted metric, and the ability to reveal variable influences and outliers.
The code was written in Python 3.5 and has been tested for the Mapzen Toronto data for Windows and Linux operating systems. The code is described in Section 4 of the PDF, which used a tree-optimized random forest model to learn geospatial patterns for the prediction and outlier detection of known spatial object classes (Figure 1).
Figure 1. Flowchart of code process
- Install Anaconda Python 3.5 for windows
- Download wheel files: GDAL, Fiona, pyproj, and shapely for Python 3.5 (cp35)
- Uninstall existing OSGeo4W, GDAL, Fiona, pyproj, or shapely libraries
- Navigate to downloaded wheel files using the console
cd path/to/downloaded_wheels
- Install the wheel (.whl) files and libraries using
pip install
64-bit Example (Same wheel files used in thesis)
cd path/to/downloaded_wheels
pip install GDAL-2.0.3-cp35-cp35m-win_amd64.whl
pip install Fiona-1.7.0-cp35-cp35m-win_amd64.whl
pip install pyproj-1.9.5.1-cp35-cp35m-win_amd64.whl
pip install Shapely-1.5.16-cp35-cp35m-win_amd64.whl
pip install geopandas
pip install joblib
pip install seaborn
pip install treeinterpreter
pip install tqdm
conda install -c ioos rtree
32-bit Example
cd path/to/downloaded_wheels
pip install GDAL-2.0.3-cp35-cp35m-win32.whl
pip install Fiona-1.7.0-cp35-cp35m-win32.whl
pip install pyproj-1.9.5.1-cp35-cp35m-win32.whl
pip install Shapely-1.5.16-cp35-cp35m-win32.whl
pip install geopandas
pip install joblib
pip install seaborn
pip install treeinterpreter
pip install tqdm
conda install -c ioos rtree
Thanks to Geoff Boeing for the Using geopandas on windows blog post and Christoph Gohlke for the wheel files.
- Install Anaconda Python 3.5 for linux
- Install libraries using
pip install
andconda install
pip install treeinterpreter
pip install tqdm
conda install -c conda-forge geopandas
conda install joblib
conda install seaborn
conda install -c ioos rtree
- Download this repository
- Unzip the file and navigate to the code folder
cd path/to/msa-thesis-master/py
- Execute the code using
python thesis.py
cd path/to/msa-thesis-master/py
python thesis.py config.txt path/to/output_folder
The config file can be used to apply and alter the methods to other datasets.
Please see Section 4.1 in the PDF for more details.
Note: The unedited config.txt file contains the settings used to obtain results for the most recent Mapzen Toronto data. The data used in the thesis is provided in the reproduce release which contains instructions to reproduce the thesis results.
- Date: April 27, 2016
- Time: 2:00 p.m. to 4:00 p.m.
- Location: Jorgenson Hall 730, Ryerson University, Toronto, ON
- Chair: Dr. Lu Wang
- Examiner 1: Dr. Eric Vaz
- Examiner 2: Dr. Tony Hernandez
- Result: Pass with minor revisions
Personal machine:
- Windows 8.1 64-bit
- i7-6700k 4.0 GHz Quad-Core
- 16 GB DDR4 2133 RAM
- 256 GB SSD + 512 GB SSD (Read: Up to 540 MB/sec, Write: Up to 520 MB/sec)
- Runtime: ~30-45 minutes
Virtual machine generously provided by Ryerson RC4:
- Debian Linux
- 6-Core CPU
- 6 GB RAM
- 66 GB Storage
- Runtime: ~50-60 minutes