# QuaSaR: Identifying EEW Rings

## GOAL AND OBJECTIVES
Quake Safety Rings ([QuSaR](https://en.wikipedia.org/wiki/Quasar)) are essentially autonomous [Rings](https://brilliant.org/wiki/ring-theory/) of sensors sharing discretized time-series of _waveform_ information to identify threats and forewarn to give man and machine a lead time to respond to harmful earthquakes.

The overall __gaol__ is to examine how the GeoNet seismic network can be augmented with a low-cost network to offer low-latency EEWs by making use of cutting-edge earthquake picking algorithms and machine learning techniques. The expected outcome is for the findings to serve as evidence for supporting a strategic deployment of a ring or rings of micro-array networks. 

The intent is to also make use of the analysis and tools is to serve as inputs for earthquake hazard risk assessment. Thereby, a community interested in operationalizing their own micro-array ring can make us of the analysis and tools to determining whether or not and how they may need to invest in building a micro-array ring.

### Objectives
1. _Understand the [topology](https://brilliant.org/wiki/topology/) (structure of connection of the units and their capabilities)_; also an axiomatic way to make sense of when two points in a set are "near" each other
   1. Retrieve data on all the operational NZ seismic stations to __map the inventory__ by types and location [git-issue #2](https://github.com/waidyanatha/quasar/issues/2).
   1. Build a __statoion fault topology space__ comprising all the operational stations within a bound of the fault line paths; such that we create a metric space _(X,d)_ comprising _X = { x,y | for all coordinate pairs of stations x and faults y}_ and a [haversine](https://math.stackexchange.com/questions/993236/calculating-a-perpendicular-distance-to-a-line-when-using-coordinates-latitude) distance function _d = x - y_; relative to the fault lines, stations, and earthquake detection role and capacity 
   1. Cluster the metric space into partially ordered __coarser topology__ of metric subspaces; essentially to make a nearest neigbour map of station fault clusters such that stations are within ___d < &epsilon;___ distance to ensure optimal EEW application performance; 
1. _Apply earthquake __picking algorithms__ on the GeoNet wave form data_
   1. Test the __standard GeoNet algorithms__ (e.g. LTS/STS, Pd, )
   1. Test with new __machine learning and wavefield algorithms__ (e.g. , 8bit Picking, PLUM)
   1. Test above picking algorithms with __simulated earthquakes__ and for __selected high risk faults__ to observe the response of the picking algorithms
   
1. _Determine ways for improving the station rings for an incremental effectiveness of EEW_
   1. Propose to __fit additional stations__ to improve the 30Km nearest neighbour cluster; then show how that improves the picking
   1. Apply the geodedic methodology to __interpolate seismic data__ for the proposed station locations 
   1. Try the earthquake __picking algorithms__ on the hypothetical network to measure effectiveness

In [None]:
from IPython.display import Image
Image(filename='../images/obspy.png',width=300, height=150)

### DEFINE data services and software modules

We make use of the International Federation Data of Seismic Networks (FDSN), the global standard and a [data service](http://www.fdsn.org/services/) for sharing seismic sensor wave form data. The Obspy librarires support FDSN. The list of resources and services that are used for retrieving station inventory and waveform data.

1. FDSN station service
   1. FSDN as Client data sources; both (i) the FDSN client service and the (ii) FDSN complient GoeNet API webservice
   1. retrieve station metadata information in a FDSN StationXML format or text format for all the channels in CECS station with no time limitations: https://service.geonet.org.nz/fdsnws/station/1/query?network=NZ&station=CECS&level=channel&format=text
1. ObsPy
   1. wavePicker is no longer supported by ObsPy; instead the [Pyrocko](http://pyrocko.org) Snuffler for seismic data inspection and picking is recommoended

In [None]:
import glob
from obspy import read_inventory
from obspy.clients.fdsn import Client
from obspy.core import read, UTCDateTime
#from datetime import date

# Establish start and end time for retrieving waveform data
t_start = UTCDateTime.now()-518400 #6 days ago = 60s x 60m x 24h x 6d
t_end = UTCDateTime.now()+86400 #1 day in the future = 60s x 60m x 24h
print('Station startime: ', t_start, '\n & ending time: ', t_end)

try:
    #use either or GeoNet station service webservice URL or Obspy FDSN Client protocol to retrieve station data
    st_ws = 'https://service.geonet.org.nz/fdsnws/station/1/query?network=NZ&level=station&endafter=2020-12-31&format=xml'
    #st_ws = 'https://service.geonet.org.nz/fdsnws/station/1/query?network=NZ&station=CECS&level=channel'
    # Set FDSN client URL to GEONET short code
    client  = Client('GEONET')
    print("Client is",client)
except Exception as err:
    print("Error message:", err)


### Define station types and channels

To learn about sensor type and channel code definitions [see section in ipynb](./stations_faultlnes_plot_1a.ipynb#sensor_code_desc)

#### Class of station data processing methods
The class is defined to manage all functions for retrieving, parsing, and preparing station data in an easily useable form.
* Class _station_data()_
   * _get_channels()_ returns abbreviated channel codes
   * _get_types()_ returns a list of all seismic station types with abbreviation and description
   * _get_stations()_ returns list of all stations with code, type abbr, lat/lon pair


In [None]:
''' All weak & strong motion, low gain, and mass possion sensor types '''
class station_data():
    def _init_(self, name):

        name = "station_metadata"
        return name
        
    def get_channels(self):
        channels = "UH*,VH*,LH*,BH*,SH*,HH*,EH*,UN*,VN*,LN*,BN*,SN*,HN*,EN*"
        return channels

    '''
        All combinations with definition of the first and second letter to define identify each station type
    '''
    def get_types(self):
        dict_st_types = {"UH" : "Weak motion sensor, e.g. measuring velocity\nUltra Long Period sampled at 0.01Hz, or SOH sampled at 0.01Hz",
                   "VH" : "Weak motion sensor, e.g. measuring velocity\nVery Long Period sampled at 0.1Hz, or SOH sampled at 0.1Hz",
                   "LH" : "Weak motion sensor, e.g. measuring velocity\nBroad band sampled at 1Hz, or SOH sampled at 1Hz",
                   "BH" : "Weak motion sensor, e.g. measuring velocity\nBroad band sampled at between 10 and 80 Hz, usually 10 or 50 Hz",
                   "SH" : "Weak motion sensor, e.g. measuring velocity\nShort-period sampled at between 10 and 80 Hz, usually 50 Hz", 
                   "HH" : "Weak motion sensor, e.g. measuring velocity\nHigh Broad band sampled at or above 80Hz, generally 100 or 200 Hz",
                   "EH" : "Weak motion sensor, e.g. measuring velocity\nExtremely Short-period sampled at or above 80Hz, generally 100 Hz",
                   "UN" : "Strong motion sensor, e.g. measuring acceleration\nUltra Long Period sampled at 0.01Hz, or SOH sampled at 0.01Hz",
                   "VN" : "Strong motion sensor, e.g. measuring acceleration\nVery Long Period sampled at 0.1Hz, or SOH sampled at 0.1Hz",
                   "LN" : "Strong motion sensor, e.g. measuring acceleration\nBroad band sampled at 1Hz, or SOH sampled at 1Hz",
                   "BN" : "Strong motion sensor, e.g. measuring acceleration\nBroad band sampled at between 10 and 80 Hz, usually 10 or 50 Hz",
                   "SN" : "Strong motion sensor, e.g. measuring acceleration\nShort-period sampled at between 10 and 80 Hz, usually 50 Hz",
                   "HN" : "Strong motion sensor, e.g. measuring acceleration\nHigh Broad band sampled at or above 80Hz, generally 100 or 200 Hz",
                   "EN" : "Strong motion sensor, e.g. measuring acceleration\nExtremely Short-period sampled at or above 80Hz, generally 100 Hz"}
        return dict_st_types

    '''Prepare an array of station data: (i) station code as a unique identifier, 
                                        (ii) coordinates longitude & latitude, and 
                                       (iii) elevation in meters above mean sea level
        return the construct as a list of stations including the list of invalid stations
    '''
    def get_stations(self):
        st_list = []
        invalid_st_list = []

        try:
            st_inv = client.get_stations(network='NZ', location="1?,2?", station='*', channel=self.get_channels(), level='channel', starttime=t_start, endtime = t_end)
        except Exception as err:
            print("Error message:", err)

        '''run through stations to parse code, type, and location'''
        try:
            for each_st in range(len(st_inv[0].stations)):
                ''' use lat/lon paris only in and around NZ remove all others '''
                if(st_inv[0].stations[each_st].latitude < 0 and st_inv[0].stations[each_st].longitude > 0):
                    each_st_type_dict = st_inv[0].stations[each_st].get_contents()
                    ''' get the second character representing the station type '''
#                    st_type_dict["st_type"].append(each_st_type_dict["channels"][0][-3:-1])
                    ''' list of corresponding station locations (lat / lon) '''
                    st_list.append([st_inv[0].stations[each_st].code, each_st_type_dict["channels"][0][-3:-1], st_inv[0][each_st].latitude, st_inv[0][each_st].longitude])
                else:
                    '''dictionary of all stations not in NZ visinity '''
                    invalid_st_list.append([st_inv[0].stations[each_st].code,st_inv[0][each_st].latitude, st_inv[0][each_st].longitude])

        except Exception as err:
            print("Error message:", err)
        
        return st_list, invalid_st_list
    

### Define fault lines

#### Class of Fault line methods

We have completed objective 1.A. However, we will also include a mapping of the fault lines to give a perception of the station distribution relative to that of the map of fault lines.

* Class fault_data()
   * _get_paths()_ to convert the WSG84 json file into a list
   * _interpolate_paths_ input results from get_paths() and spcify an interpolation distance ( e.g. distance=2.5)

In [None]:
class fault_data():

    ''' TODO at initiatlization download latest ZIP'd datasets from GeoNet then extract the *.json
    '''
    def _init_(self):
        pass

    ''' Extract nested values from a JSON tree to build a list of fault lines
        containing the fault name and lat / lon pairs of the path
    '''
    
    def get_paths(self):
        import json
        from dictor import dictor
        
        try:
            with open('../data/NZAFD/JSON/NZAFD_Oct_2020_WGS84.json') as json_file: 
                data = json.load(json_file)

            faults = []
            fault_path_count = 1
            for each_feature in range(len(data['features'])):
                flt = dictor(data,'features.{}.attributes.NAME'.format(each_feature))
                if flt==" ":
                    flt = 'Unnamed fault '+ str(fault_path_count)
                    fault_path_count += 1
                points = []
                path = dictor(data,'features.{}.geometry.paths.0'.format(each_feature))
                for each_coordinate in range(len(path)):
                    points.append([path[each_coordinate][0],path[each_coordinate][1]])
                faults.append([flt,points])

        except Exception as err:
            print("Error message:", err)
        return faults

    '''
        Interpolate more points for each fault line; if the distance between points > 1.5Km @ 0.5Km intervals
        Otherwise, fit a single halfway point
    '''
    def interpolate_paths(self, paths, distance=float(2.5)):
        from shapely.geometry import LineString
        
        interp_paths = []
        try:
            ''' loop through each fault path to breakdown into line segments; i.e. coordinate pairs '''
            for path in range(len(paths)):
                path_index = 0
                ''' add the two line segment coordinates to begin with
                    now loop through each path line segment to add interpolated points  '''
                while (path_index < len(paths[path][1])-1):
                    ip = []     # interpolated point
                    rel_origin_coord = paths[path][1][path_index]     # relative starting point of the path
                    rel_nn_coord = paths[path][1][path_index+1]

                    ''' change to a while loop until all distances between consecutive points < delta_distance'''
                    while LineString([rel_origin_coord, rel_nn_coord]).length*6371.0 > distance:
                        ip = LineString([rel_origin_coord,rel_nn_coord]).interpolate((10.0**3)/6371.0, normalized=True).wkt
                        # convertion needs to happen otherwise throws an exception
                        ip_lat = float(ip[ip.find("(")+1:ip.find(")")].split()[0])
                        ip_lon = float(ip[ip.find("(")+1:ip.find(")")].split()[1])
                        rel_nn_coord = list([ip_lat,ip_lon])
                        ''' If you want to add the already interpolated coordinates to the path to possibly speedup
                        and use those points to create a denser path; note that it may will results in uniequal
                        distant between consecutive points in the path. Comment the instruction below to disable.
                        '''
                        paths[path][1].insert(path_index+1,rel_nn_coord)    # interpolated coordinates closest to the relative origin

                    path_index += 1

                interp_paths.append([paths[path][0], paths[path][1]])

        except Exception as err:
            print("Error message:", err)

        return interp_paths

## OBJECTIVE 1.B - STATION FAULT METRIC

### Data preperation for analysis
The steps below build a set of list and array metric for the stations and fault lines:
1. Interpolate points between fault line path coordinates
1. Calculate the station to fault line perpendicular distances

#### Why interpolate more coordinates?
The fault line paths might have been reduced by applying the [Ramer-Douglus-Peuker algotithm](https://pypi.org/project/rdp/) before publishing the GeoNet fault line paths with an optimal set of coordinates sufficient for mapping - _Edward Lee pointed out that instead of using the "perpendicular distance" from a point to a line, the algorithm should use the 'Shortest Distance' from a point to a line segment._ Therefore, we are essentially inverting the rdp PyPi agoritm to interpolate more coordinates to reduce the line segment lengths to ~1.0 Km.

#### Interpolate coordinates in ~1.0Km separations
The average distance between consecutive coordinates in each fault line path latitude and longitude pairs range from 2.0 - 30.0 Km. Therefore; we use the [shapely interpolation](https://shapely.readthedocs.io/en/latest/manual.html#linear-referencing-methods) techniques to synthesize coordinates such that the distance between consecutive coordinates is ~ 1.0 Km.


In [None]:
'''
    Interpolation method - to add more lat/lon coordinates between fault line segments
'''
import sys
from shapely.geometry import LineString

try:
    faults = fault_data()     # declare fault lines class
    original_paths = faults.get_paths()     # get all fault line paths

    ''' analyse the distance between fault line path coordinates '''
    print("Statistics of {} original fault lines before interpolating".format(len(original_paths)))
    for path in range(len(original_paths)):
        sum_lengths = float(0)
        for coords in range(len(original_paths[path][1])-1):
            sum_lengths += LineString([original_paths[path][1][coords], 
                                       original_paths[path][1][coords+1]]).length*6371.0 
#        print("{0} has {1} coordinates with an average inter-coordinate distance: {2} Km".format(original_paths[path][0], len(original_paths[path][1]), str(sum_lengths/len(original_paths[path][1]))))
        sys.stdout.write("{0} has {1} coordinates with an average inter-coordinate distance: {2} Km".format(original_paths[path][0], len(original_paths[path][1]), str(sum_lengths/len(original_paths[path][1]))))
    
    interpolated_paths = faults.interpolate_paths(paths=original_paths,distance=2.5)
    print("\nWait until interpolation is complete ...")
    print("\nPost interpolation statistics of {} fault lines".format(len(interpolated_paths)))
    for path in range(len(interpolated_paths)):
        sum_lengths = float(0)
        for coords in range(len(interpolated_paths[path][1])-1):
            sum_lengths += LineString([interpolated_paths[path][1][coords], 
                                   interpolated_paths[path][1][coords+1]]).length*6371.0 
#        print("{0} has {1} coordinates with an average inter-coordinate distance: {2} Km".format(interpolated_paths[path][0], len(interpolated_paths[path][1]), str(sum_lengths/len(interpolated_paths[path][1]))))
        sys.stdout.write("{0} has {1} coordinates with an average inter-coordinate distance: {2} Km".format(interpolated_paths[path][0], len(interpolated_paths[path][1]), str(sum_lengths/len(interpolated_paths[path][1]))))
    '''TODO change output to give numbers only; e.g. mean, median, and variance of fault coordinate distances'''
        
    '''TODO write the non-empty interpolated dataset to a file'''
#    if :
#        with open('../data/NZAFD/JSON/interpolated_NZAFD_Oct_2020_WGS84.json', 'w') as outfile:
#            json.dump(interpolated_paths, outfile)

except Exception as err:
    print("Error message:", err)

#### Station to nearest fault line distance metric
Estimate distance from station to nearest fault line segment. Thereafter, associate each station with the nearest neigbour fault line segments. We have a station with coordinates _A=\[s_lat, s_lon\]_ and two coordinates _B=\[f1_lat,f1_lon\]_ and _C=\[f2_lat, f2_lon\]_, and want to project A onto the arc between B and C, and find the length of the projection arc. 

1. __Loop through stations and faults__ to build a distance metric that can be used to determine the station sequence that might be triggered by a particular earthquke from a location along a fault line
1. Ideally __calculate perpendicular distance__ from the station to the line segment; i.e. [shortest arc length](https://math.stackexchange.com/questions/993236/calculating-a-perpendicular-distance-to-a-line-when-using-coordinates-latitude)
    1. _Compute_ `n=A×B` ("×" the cross product) and `N=n/√n⋅n` ("⋅" the dot product)
    1. _Convert the coordinates_ A, B, & C to _\[x,y,z\]_ triples with `x=sinucosv; y=sinv; z=cosucosv`
    1. _Compute_ the angular distance between 
        1. a ray from the earth's center to A and the plane _n_ described above `s=90∘−|arccos(C⋅N)|`
        1. the "distance" between A and B as `s′=arccos(A⋅B)`; assuming working in degrees (range from 0 to 180)
1. For now, differ to __calculate the shortest distance__ recommended by Edward Lee discussed in [why we  interpolate?](#Why-interpolate-more-coordinates?) 
1. \[ERROR grumbling about lat / lon attributes\] __Obspy geodedics__ [inside_geobounds](https://docs.obspy.org/packages/autogen/obspy.geodetics.base.inside_geobounds.html#obspy.geodetics.base.inside_geobounds) function can confirm whether the fault line segments A-B are within a given radius of the station A.

In [None]:
from obspy.geodetics import base
from shapely.geometry import LineString
import sys

def get_station_fault_metric_list():
#    try:
    st_meta = station_data()
    st_list, invalid_st_list = st_meta.get_stations()
    print('There are {0} active valid stations and {1} invalid station(s)'.format(len(st_list),len(invalid_st_list)))
    print('The invalid stations are:{0}'.format(invalid_st_list))
    print('Unique station types 1st & 2nd letters of station codes are: {})'.format(set(item[1] for item in st_list)))        

#    except Exception as err:
#        print("Error message:", err)

    #st_meta = station_data()
    #st_list, invalid_st_list = st_meta.get_stations()

    '''
        move along each fault line coordinates to find a station closest to that point withing a 30Km radius.
    '''
    try:
        st_flt_metric = []
        short_dist_ub = float(10**4)
        null_nearest_flt_coord = [0.0000, 0.0000]

        print("Wait for a few minutes to build the metric comprising {} stations and {} faults...".format(len(st_list), len(interpolated_paths)))
        for indx, each_station in enumerate(st_list):
            sys.stdout.write("\r" + "{0} of {1} Calculating faults closest to Station {2}.".format(indx+1, len(st_list), each_station[0]))
   #         print("{0} Calculating faults closest to Station {1} latitude {2} and longitude {3}".format(indx+1, each_station[0],each_station[2],each_station[3]))

            for each_fault in interpolated_paths:
                st_coord = [each_station[3],each_station[2]]
                shortest_distance = short_dist_ub
                nearest_fault_coord = null_nearest_flt_coord
                for flt_coord in range(len(each_fault[1])):
                    st_to_flt = LineString([each_fault[1][flt_coord], st_coord]).length*6371.0

                    ''' TODO make the correct projection
                    st_to_flt = LineString([each_fault[1][flt_coord], st_coord])
                    st_to_flt.srid = 4326
                    st_to_flt.transform(3857)
                    st_to_flt.length
                    '''
                    if st_to_flt < shortest_distance:
                        shortest_distance = st_to_flt
                        nearest_fault_coord = each_fault[1][flt_coord]
                if shortest_distance < short_dist_ub :
                    shortest_distance = shortest_distance  
                    st_flt_metric.append([each_station[0], st_coord, each_fault[0], nearest_fault_coord, shortest_distance])
    
            '''
                TODO fix the error on the lat / lon attributes
                if base.inside_geobounds(each_fault[1], minlatitude=None, maxlatitude=None, 
                                         minlongitude=None, maxlongitude=None, 
                                         latitude=36, longitude=174, 
                                 minradius=1/6378137.0, maxradius=30.0/6378137):
                    print(each_fault[0],"yes")
                else:
                    print(each_fault[0],"no")
            '''
        print("Done building the metric size {}".format(len(st_flt_metric))  if len(st_flt_metric) > 0 else "Empty metric; no data was built")

        #            min_distance_to_fault = calc_vincenty_inverse(lat1, lon1, lat2, lon2, a=6378137.0, f=0.0033528106647474805)
        #            statio_faults.append[interpolated_paths[each_fault]]
    except Exception as err:
        print("Error message:", err)
        
    return st_flt_metric

#### Build a 2D array station-fault metric
Begins with the non-empty set station fault list comprising the station code and coordinates, fault name and coordinates, and the distance between them. The list is transformed into a n_station by n_fault 2D array with element values:
* _r\_station\_type_ - a ranking of the [station types](#Class-of-station-data-processing-methods) based on their contribution to earthquake detection
* _d\_station\_fault_ - distance between the station coordinates and the nearest interpolated fault path coordinate

In [None]:
def get_station_fault_metric_array(list_st_flt_metric: list, max_separation: float = 30000.0):
    import numpy as np

    if not isinstance(list_st_flt_metric, list):
        raise TypeError
#    return list_st_flt_metric[index]
    
    count = 0
    st_flt_arr = np.array([],[])
    st_flt_ub = max_separation     # 30Km distance between station and fault line
    cls_st = station_data()  # from the class 'station data' get dictionary of 'station types'
    lst_tmp_st_types = []
    for idx_st_type, val_st_type in enumerate(list(cls_st.get_types())):
        lst_tmp_st_types.append([idx_st_type, val_st_type])

    try:
        '''
            filter list to refelct maximum separation upper bound for the distance between faults and stations
        '''
        
        bounded_st_flt = [idx for idx, element in enumerate([row[4] for row in list_st_flt_metric]) if element <= st_flt_ub]
        print("\nNumber of stations and faults within {0}m distance {1} of a total {2}".format(st_flt_ub,len(bounded_st_flt), len(list_st_flt_metric)))

        '''
            Build the input array with rows = station and columns = faults
        '''
        unique_stations = set([row[0] for row in list_st_flt_metric])
        unique_faults = set([row[2] for row in list_st_flt_metric])
        st_flt_arr = np.zeros([len(unique_stations),len(unique_faults)], dtype = float)
        tmp_st_flts = []
        print("Wait a moment while we construct an array with shape {} for stations in rows and faults in columns".format(st_flt_arr.shape))

        '''
            TODO change the array element to a tuple with [station-type-ranking, station-fault-distance]
        '''
        for st_indx, st_val in enumerate(unique_stations):
            # retrieve the station code, fault name, and distance from the list
            tmp_st_flts = [[row[0],row[2],row[4]] for row in list_st_flt_metric if row[0] == st_val]
            tmp_st_type = [row[0] for row in lst_tmp_st_types if row[1] == str(st_val[1]+st_val[2])]
            for flt_indx, flt_val in enumerate(unique_faults):
                # from the st_flt_metric get the distance value corresponding to the st_val and flt_val
                for tmp_indx, row in enumerate(tmp_st_flts):
                    if  row[0] == st_val and row[1] == flt_val:
                        st_flt_arr[st_indx,flt_indx] = [tmp_st_type,row[2]]

        ''' TODO remove all zero rows and columns '''
        #st_flt_arr[~np.all(st_flt_arr == 0, axis=0)]
        #st_flt_arr[~np.all(st_flt_arr[..., :] == 0, axis=0)]
        print("station fault {1}D array shape {0} has {2} elements and an itemsize {3}".format(st_flt_arr.shape, st_flt_arr.ndim, st_flt_arr.size, st_flt_arr.itemsize))
    except Exception as err:
        print("Error message:", err)
    return st_flt_arr

## OBJECTIVE 1.C - STATION FAULT COARSEST TOPOGRAPHY

### Define clustering methods
[Learn about clustering methods](https://realpython.com/k-means-clustering-python/)
#### Class of Clustering algorithms
1. _get_dbscan_labels()_
    1. Compute the cluster property measures to estimate the acceptability
    1. Dump the output to a file including cluster label, lat/lon, station code, and so on
1. _get_nn_labels()_
    1. Compute the mean distance between [nearest neigbours](https://scikit-learn.org/stable/modules/neighbors.html) of a minimum 3 points
    1. Also consider [mean nearest neighbour distance](https://pysal.org/notebooks/explore/pointpats/distance_statistics.html#Mean-Nearest-Neighbor-Distance-Statistics)
1. _get_kmean_labels()_
    1. separates the station fault distances into _n\_clusters_ with similar variances from the mean centroid
    1. returns the cluster labels associated with the station fault metric

__Note 1:__ - Apply DBSCAN to cluster stations with an epsilon < 30Km. DBSCAN is preferred over K-means clustering because K-means clustering considance the variance while DBSCAN considers a distance function. It gives the capacity to build clusters serving the criteria of < 30Km distance between stations.

__Note 2:__ - Inherent __problem of DBSCAN__ is that it characterises data points to be in the same clusted if pair-wise data points satisfy the epsilon condition. This would not adequately satisfy the required condition that all data points in a a cluster are within the desired epsilon distance.


In [None]:
'''
    Relevant clustering functions necessary for the station-fault analysis
'''
class clustering():
    def __init__(self):
        pass
    
    '''
        TODO consider OPTICS (Ordering Points To Identify the Clustering Structure)
    '''

    '''
        DBSCAN clustering - lat/lon pairs
    '''
    def get_dbscan_labels(self,st_arr):
    
        from sklearn.cluster import DBSCAN
        from sklearn import metrics
        import sklearn.utils
        from sklearn.preprocessing import StandardScaler
        from sklearn.datasets import make_blobs

        err="0"
    #    try:
        X, labels_true = make_blobs(n_samples=len(st_arr), centers=st_arr, cluster_std=0.4,random_state=0)
        db = DBSCAN(eps=30.0/6371.0, min_samples=3, algorithm='ball_tree', metric='haversine').fit(np.radians(X))
        print('DBSCAN epsilon:',db.eps,'algorithm:', db.algorithm, 'metric: ', db.metric)
        core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
        core_samples_mask[db.core_sample_indices_] = True
#        print('core samples mask', len(core_samples_mask),core_samples_mask)
        labels = db.labels_
#        print("DBSCAN found %0.3f labels" % labels )
    #    except Exception as err:
    #        print("Error message:", err)
    #        labels = ""
        return labels, labels_true, core_samples_mask

    '''
        K nearest neigbour clustering
    '''
    def get_nn_labels(self,st_flt_list):
    
        from sklearn.neighbors import NearestNeighbors

        # Augment station array with cluster number
        # Start a new station coorinates and details tuple
        st_list = []
        i=0
        for i in range(len(labels)):
            st_row = [tmp_arr[i,0],labels[i],tmp_arr[i,1],tmp_arr[i,2],tmp_arr[i,3]]
            st_list.append(list(st_row))

        clusters = list({item[1] for item in st_list})

        for each_cluster in clusters:
            cluster_list = list(st_list[j] for j in range(len(st_list)) if st_list[j][1] == each_cluster)
            cluster_arr = np.delete(cluster_list, [0,1,4],axis=1).astype(np.float)
            nbrs = NearestNeighbors(n_neighbors=3, algorithm='brute', metric='haversine').fit(cluster_arr)
            distances, indices = nbrs.kneighbors(cluster_arr)
            print(nbrs.kneighbors_graph(cluster_arr).toarray())
    
            each_cluster_clique = client.get_stations(latitude=-42.693,longitude=173.022,maxradius=30.0/6371.0, starttime = "2016-11-13 11:05:00.000",endtime = "2016-11-14 11:00:00.000")
            print(each_cluster_clique)
            _=inventory.plot(projection="local")
    
            break

        sorted_rank = sorted(st_list, key=lambda i: (int(i[1])), reverse=True)
        #print('Code, Cluster, Latitude, Longitude, Elevation')
        #print(sorted_rank)
        return sorted_rank
    
    '''
        K Means clustering - station-fault distance metric
        Parameters:
            number of clusters = 5 gives optimal Homogeneity, V-measure, and Silhouette Coefficient
            maximum number of iterations = 300 to minimize clustering quality; i.e. sum of the squared error
    '''
    def get_kmean_labels(self, st_flt_arr, num_clusters=5):
        
        from sklearn.cluster import KMeans
#        import sklearn.utils
        from sklearn.preprocessing import StandardScaler
        from sklearn.datasets import make_blobs
        import numpy as np
        
        st_flt_arr.reshape(-1, 1)
        X, labels_true = make_blobs(n_samples=len(st_flt_arr), centers=st_flt_arr, cluster_std=0.4,random_state=0)
        scaler = StandardScaler()
        scaled_features = scaler.fit_transform(X)
#        kmeans = KMeans(init="random", n_clusters=num_clusters, n_init=5,max_iter=300, random_state=5).fit(X)
        kmeans = KMeans(init="random", n_clusters=num_clusters, n_init=5,max_iter=300, random_state=5)
        kmeans.fit(scaled_features)
        labels = kmeans.labels_
        print(kmeans.labels_)
        core_samples_mask = np.zeros_like(kmeans.labels_, dtype=bool)
#        core_samples_mask[kmeans.core_sample_indices_] = True
        
#        return kmeans
        return labels, labels_true, core_samples_mask

### Cluster Stations and faults by distance

#### Apply K-means clustering
We use the k-means function defined in the [clustering class](#Class-of-Clustering-algorithms). There are several [drawbacks SciKit preassumes](https://scikit-learn.org/stable/modules/clustering.html#k-means) that have been considered on the assumption that the clusters are convex and isotropic and a principle component analysis has been applied prior to the clustering. 

In [None]:
from sklearn import metrics
import numpy as np

'''
    reconstruct the station fault metric list to an array
'''
try:
    st_flt_arr = np.array([],[])
    print("Wait a moment to construct the station fault metric list ...")
#    st_flt_list = get_station_fault_metric_list()
    if not isinstance(st_flt_list, list):
        raise TypeError
    else:
        print("Received station fault list with distance metric and it looks like this:\n{}".format(st_flt_list[0:5]))
        st_flt_arr = get_station_fault_metric_array(st_flt_list)
        print("Received array with {0} dimensions of shape {1} and it looks like this:\n{2}".format(st_flt_arr.ndim, st_flt_arr.shape,st_flt_arr))
except Exception as err:
    print("Error message:", err)

''' 
    Apply k-means clustering to the 2D array metric
'''
try:
    cl_method = clustering()
    # Run k means to get the cluster labels
    print("Begin k means clustering ...")
    labels, labels_true, core_samples_mask = cl_method.get_kmean_labels(st_flt_arr, 13)
    #print('core samples mask', len(core_samples_mask),core_samples_mask)
    print("Clustering complete!")

    # Number of clusters in labels, ignoring noise if present.
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)

    print('\nPerformance evaluation ...')
    print('Total number of stations: %d' % len(labels))
    print('Estimated number of clusters: %d' % n_clusters_)
    print('Estimated number of noise points: %d' % n_noise_)

    print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
    print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
    print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
    print("Adjusted Rand Index: %0.3f"
          % metrics.adjusted_rand_score(labels_true, labels))
    print(f"Adjusted Mutual Information: %0.3f" % metrics.adjusted_mutual_info_score(labels_true, labels))
    print("Silhouette Coefficient: %0.3f"
          % metrics.silhouette_score(st_flt_arr, labels))

except Exception as err:
    print("Error message:", err)

#### Plot results
1. Plot clusters as [Voroni Cells](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.voronoi_plot_2d.html) with varied colors unique to each cluster and also displaying the centroid
1. plot fault lines to show closes sensor in cluster to the fault line

In [None]:
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap

plt.figure(figsize=(30, 40))
#nz_map = Basemap(width=15000,height=15000,projection='merc',
#            resolution='l',lat_0=-40,lon_0=176.)
#nz_map.drawcoastlines()

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = station_coordinates[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    # uncomment to plot the noise
    #xy = station_coordinates[class_member_mask & ~core_samples_mask]
    #plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
    #         markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.legend(loc='upper left', fontsize=20)
plt.xlabel('Latitude')
plt.ylabel('Longitude')
plt.show()

# DISCUSSION

## Data preperation

## Clustering

### DBSCAN results
It is evident from the cluster with large volume of data points are spread across the geography. Therefore, DBSCAN is shown to be innopriate for clustering stations to estimate whether they hold the property of being 30Km within each other.


## RESOURCES
1. [Global data services and standards](http://www.fdsn.org/services/) offered by the International Federation Data of Seismic Networks (FDSN). 
1. GEONET resources:
   1. [Stream Naming Conventions](https://www.geonet.org.nz/data/supplementary/channels) are based on historical usage together with recommendations from the [SEED manual](https://www.fdsn.org/seed_manual/SEEDManual_V2.4.pdf)
   1. [Python tutorials](https://www.geonet.org.nz/data/tools/Tutorials) for using GeoNet resources
1. [Seismo-Live](https://krischer.github.io/seismo_live_build/html/Workshops/2017_Baku_STCU_IRIS_ObsPy_course/07_Basic_Processing_Exercise_solution_wrapper.html) examples of get station waveform, inventory, event, arrival time, response, and plotting using obspy
1. Choosing [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html) over KMeans: 
   1. Discussion of the [three clustering methods](https://realpython.com/k-means-clustering-python/): K means, hierachical, and density-based clustering
   1. Fundermentally KMeans requires us to first select the number of clusters we wish to find and DBSCAN doesn't.
   1. [clustering to reduce spatial data sizes](https://geoffboeing.com/2014/08/clustering-to-reduce-spatial-data-set-size/) KMeans is not an ideal algorithm for latitude-longitude spatial data because it minimizes variance, not geodetic distance. 
   1. [Explanation of DBSCAN clustering](https://towardsdatascience.com/explaining-dbscan-clustering-18eaf5c83b31) also identifies a drawback of KMeans clustering as it is vulnerable to outliers and outliers have a significant impact on the way the centroids moves.
1. [Example of scikit-learn DBSCAN](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html)
1. [obspy.geodetics](https://docs.obspy.org/packages/obspy.geodetics.html) - various geodetic utilities for ObsPy - try an alternative clustering method with obspy geodetics
1. Mapping tutorials
   1. Visualization: [Mapping Global Earthquake Activity](http://introtopython.org/visualization_earthquakes.html)
   1. Plotting data on a map [(Example Gallery)](https://matplotlib.org/basemap/users/examples.html)
1. Calculating a [perpendicular distance to a line](https://math.stackexchange.com/questions/993236/calculating-a-perpendicular-distance-to-a-line-when-using-coordinates-latitude), when using coordinates (latitude & longitude)
1. Apply the [moment tensor](https://earthquake.usgs.gov/learn/glossary/?term=moment%20tensor); especially, the Seismic Moment Tensor Inversion (SMTI) analysis when computing fault line movement and picking the stations that might be first triggered because the [moment tensor would determine the intensity and wave propogation characteristics](https://www.esgsolutions.com/technical-resources/microseismic-knowledgebase/what-is-a-moment-tensor). 