Spatial Current Machine Learning Guide (sc-ml-guide)
A guide for machine learning with Internet-scale geospatial data. This guide includes information on classifiers and training datasets for parsing and understanding large geospatial datsets, including OpenStreetMap.
Many of the examples in this guide use the Python Natural Language Toolkit (NLTK) for machine learning tasks.
This work is distributed under the MIT License. See LICENSE file.
This guide includes a list of classifiers and how they can be used.
Naive Bayes Classifier
A naive Bayes classifier can be used for simple categorizing of a set of documents. A simple use case is using 2 related attributes of geospatial features to fill in sparse data, e.g., type of cuisine.
Below is a sample workflow with some
- Collect training dataset, e.g., OSM extract or Spatial Current SGOL query.
- Generate frequency distribution of words
- Create vocabulary (select top
Nmost common words from the frequency distribution)
- For each document in a training dataset, convert into a series of binary
- Train dataset
- Convert input into document
- Classify document, e.g.,
category = classifier.classify(convert(document)).
data.yml file in
/data/naive-bayes-classifier folder for bootstrapping a classifier to guess the cuisine of a point of interest, based on the name.
This dataset includes a list of amenities in Washington, DC and Virginia that have