Thesis titled "Geospatial Semantic Pattern Recognition in Volunteered Geographic Data Using the Random forest Algorithm" for the degree of Masters of Spatial Analysis at Ryerson University in 2016
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
py Fix for serialization greater than 4GB Aug 28, 2016
.gitignore Initial commit Aug 27, 2016
LICENSE Initial commit Aug 27, 2016 Update Apr 25, 2017
methods.png Add methods flowchart Aug 29, 2016

Geospatial Semantic Pattern Recognition in Volunteered Geographic Data Using the Random forest Algorithm

Richard Wen
Masters of Spatial Analysis, Ryerson University, 2016
Thesis Defended on April 27, 2016
Supervised by Dr. Claus Rinner


The ubiquitous availability of location technologies has enabled large quantities of Volunteered Geographic Data (VGD) to be produced by users worldwide. VGD has been a cost effective and scalable solution to obtaining unique and freely available geospatial data. However, VGD suffers from reliability issues as user behaviour is often variable. Large quantities make manual assessments of the user generated data inefficient, expensive, and impractical. This research utilized a random forest algorithm based on geospatial semantic variables in order to aid the improvement and understanding of multi-class VGD without ground-truth reference data. An automated Python script of a random forest based procedure was developed. A demonstration of the automated script on OpenStreetMap (OSM) data with user generated tags in Toronto, Ontario, was effective in recognizing patterns in the OSM data with predictive performances of ~0.71 (where 0 is the worse, and 1 is the best) based on a class weighted metric, and the ability to reveal variable influences and outliers.





The code was written in Python 3.5 and has been tested for the Mapzen Toronto data for Windows and Linux operating systems. The code is described in Section 4 of the PDF, which used a tree-optimized random forest model to learn geospatial patterns for the prediction and outlier detection of known spatial object classes (Figure 1).
Figure 1
Figure 1. Flowchart of code process


Windows Installation

  1. Install Anaconda Python 3.5 for windows
  2. Download wheel files: GDAL, Fiona, pyproj, and shapely for Python 3.5 (cp35)
  3. Uninstall existing OSGeo4W, GDAL, Fiona, pyproj, or shapely libraries
  4. Navigate to downloaded wheel files using the console cd path/to/downloaded_wheels
  5. Install the wheel (.whl) files and libraries using pip install

64-bit Example (Same wheel files used in thesis)

cd path/to/downloaded_wheels
pip install GDAL-2.0.3-cp35-cp35m-win_amd64.whl
pip install Fiona-1.7.0-cp35-cp35m-win_amd64.whl
pip install pyproj-
pip install Shapely-1.5.16-cp35-cp35m-win_amd64.whl
pip install geopandas
pip install joblib
pip install seaborn
pip install treeinterpreter
pip install tqdm
conda install -c ioos rtree

32-bit Example

cd path/to/downloaded_wheels
pip install GDAL-2.0.3-cp35-cp35m-win32.whl
pip install Fiona-1.7.0-cp35-cp35m-win32.whl
pip install pyproj-
pip install Shapely-1.5.16-cp35-cp35m-win32.whl
pip install geopandas
pip install joblib
pip install seaborn
pip install treeinterpreter
pip install tqdm
conda install -c ioos rtree

Thanks to Geoff Boeing for the Using geopandas on windows blog post and Christoph Gohlke for the wheel files.

Linux Installation

  1. Install Anaconda Python 3.5 for linux
  2. Install libraries using pip install and conda install
pip install treeinterpreter
pip install tqdm
conda install -c conda-forge geopandas
conda install joblib
conda install seaborn
conda install -c ioos rtree


  1. Download this repository
  2. Unzip the file and navigate to the code folder cd path/to/msa-thesis-master/py
  3. Execute the code using python
cd path/to/msa-thesis-master/py
python config.txt path/to/output_folder

The config file can be used to apply and alter the methods to other datasets.
Please see Section 4.1 in the PDF for more details.

Note: The unedited config.txt file contains the settings used to obtain results for the most recent Mapzen Toronto data. The data used in the thesis is provided in the reproduce release which contains instructions to reproduce the thesis results.



  • Date: April 27, 2016
  • Time: 2:00 p.m. to 4:00 p.m.
  • Location: Jorgenson Hall 730, Ryerson University, Toronto, ON
  • Chair: Dr. Lu Wang
  • Examiner 1: Dr. Eric Vaz
  • Examiner 2: Dr. Tony Hernandez
  • Result: Pass with minor revisions


Personal machine:

  • Windows 8.1 64-bit
  • i7-6700k 4.0 GHz Quad-Core
  • 16 GB DDR4 2133 RAM
  • 256 GB SSD + 512 GB SSD (Read: Up to 540 MB/sec, Write: Up to 520 MB/sec)
  • Runtime: ~30-45 minutes

Virtual machine generously provided by Ryerson RC4:

  • Debian Linux
  • 6-Core CPU
  • 6 GB RAM
  • 66 GB Storage
  • Runtime: ~50-60 minutes