# Unsupervised Machine Learning

In this tutorial, functions of unsupervised machine learning, also known as clustering, will be demonstrated.

## Import module

In [1]:
from mygeopackage import Geo
import mygeopackage.unsupervised

First, we need to import the mygeopackage.unsupervised module.

## K-Means Clustering

In [2]:
geojson = Geo(r'https://github.com/yungming0119/mygeopackage/blob/main/docs/notebooks/data/sample_points.geojson?raw=true')
cluster_results = mygeopackage.unsupervised.Cluster(geojson.data[0:100])

Unsupervised module has the core class, *Cluster*, which stores the results from the cluster analysis. To instantiate *Cluster*, give it the argument of a numpt array containing your spatial and attribute data.

In [3]:
mygeopackage.unsupervised.k_means(10,[0,1],cluster_results,2)

k_means() function require 4 arguments. THe first is the *n*, which is the desired number of clusters for K-Means analysis. The second arguments is a list of the fields for clustering. For Geo.data, spatial data located at column 0 and 1, so passing [0,1] will perform a clustering on the spatial data. The third argument is the Cluster object where results will be stored. Finally, the last arguments is the index of the identifier column for yot data. 

### Cluster object attributes

In [4]:
cluster_results.cluster_centers

array([[-86.0393075 ,  33.51200282],
       [-86.26039022,  34.31085656],
       [-88.00821942,  30.83185201],
       [-86.60439067,  33.55683823],
       [-87.01090287,  32.27727852],
       [-85.80784885,  32.85064648],
       [-86.88929961,  33.35613407],
       [-86.80296795,  34.71117208],
       [-86.05297041,  32.40150848],
       [-85.90201452,  31.42272588]])

Cluster has the attribute of cluster_centers. It records the center for every cluster.

In [5]:
cluster_results.labels

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 6,
       3, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7,
       7, 7, 7, 7, 7, 0, 0, 0, 0, 7, 3, 3, 3, 3, 1, 1, 1, 1, 1, 3, 3, 3,
       3, 3, 4, 8, 8, 8, 4, 7, 6, 6, 6, 6, 7, 7, 1, 8, 6, 5, 5, 5, 5, 5,
       3, 3, 3, 2, 0, 8, 5, 5, 2, 9, 2, 0])

The label attribute is a list that records the class label for every data.

In [6]:
cluster_results.data

array([['-86.20617875299996', '34.260200473000054', '1', ..., '01026',
        '01009', '2018-2019'],
       ['-86.20488875199999', '34.26223247300004', '2', ..., '01026',
        '01009', '2018-2019'],
       ['-86.22014875799994', '34.27332447500004', '3', ..., '01026',
        '01009', '2018-2019'],
       ...,
       ['-85.90201452099996', '31.422725875000026', '98', ..., '01091',
        '01031', '2018-2019'],
       ['-87.61753298699995', '31.07607574000008', '99', ..., '01064',
        '01022', '2018-2019'],
       ['-86.08673267299997', '33.43270630300003', '100', ..., '01035',
        '01011', '2018-2019']], dtype='<U60')

Data attribute holds the original numpy array that you passed in.

In [7]:
cluster_results.identifier

2

Identifier defines the identifier column for the data, like FID or ObjectID in a shapefile.

## DBSCSN

DBSCAN is a common density-basded clustering method that is also supported in mygeopacakge.

In [8]:
dbscan_results = mygeopackage.unsupervised.Cluster(geojson.data[0:100])
mygeopackage.unsupervised.dbscan(0.5,5,[0,1],dbscan_results,2)

dbscan() requires 5 arguments to run. The first argument is eps, which is the maximum distance between two samples for one to be considered as in the neighborhood of the other. The second argument is min_samples, which is the number of samples in a neighborhood for a point to be considered as a core point. The third argument is the fields used for clustering. The forth argument is the Cluster class for storing results. Finally, the last argument is the index of the field that can be used as the identifier in the dataset.

### Cluster class methods

In [9]:
cluster_results.show()

Cluster object also support the method of show, which plot your cluster results on the map with Folium. Colors are given randomly for different clusters. By clicking on the dots, you can identify the cluster each point belongs to.

In [10]:
geoj = cluster_results.toGeoJson()
geoj

'{"type": "FeatureCollection", "name": "K-Means Results", "features": [{"type": "Feature", "properties": {"ID": "1", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.20617875299996", "34.260200473000054"]}}, {"type": "Feature", "properties": {"ID": "2", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.20488875199999", "34.26223247300004"]}}, {"type": "Feature", "properties": {"ID": "3", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.22014875799994", "34.27332447500004"]}}, {"type": "Feature", "properties": {"ID": "4", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.22181075699996", "34.25270647100007"]}}, {"type": "Feature", "properties": {"ID": "5", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.19329375099994", "34.28985548000003"]}}, {"type": "Feature", "properties": {"ID": "6", "Class": 1}, "geometry": {"type": "Point", "coordinates": ["-86.22177475699993", "34.25328347100003"]}}, {"type": "Feature", "p

Finally, toGeoJson() converts your results to a JSON string in the GeoJSON format, which you can later save it.