# GEOL 593: Seismology and Earth (*and Mars*) Structure

## Lab assignment 10: Using machine learning algorithms to detect and locate earthquakes

In this lab, we will learn to use the machine learning-based research tools PhaseNet and GaMMA (Gaussian Mixture Model Associator) to build a seismic catalog. PhaseNet (Zhu et al., 2018) is a deep learning model for picking P and S wave arrivals from continuous seismic data. In order to relate individual travel time picks to earthquakes, phases must be associated with a common origin point (i.e., earthquake hypocenter). We will use GaMMA for the association problem. 

### 2022 Mauna Loa Eruption
On November 27, 11:30 PM HST the Hawaiian volcano Mauna Loa began erupting for the first time since 1984. During the ~2 week eruptive episode, effusive lava flows originating from Mauna Loa's summit caldera covered a total area of > 40 km$^2$. Prior to the eruption, an increased level of seismicity near Mauna Loa was observed, indicating signs of unrest. Additionally, during the eruption, a swarm of seismic activity was recorded, with USGS reporting > 200 events with magnitude greater than 1.2. We will use PhaseNet and GaMMA to further explore the seismicity during the early part of the eruptive sequence.

### QuakeFlow 

The end-to-end workflow of detecting and locating earthquakes with machine learning tools is multi-step problem that can get complicated. Luckily for us, researchers have developed tools to make the workflow simpler. In particular, QuakeFlow (Zhu et al., 2023) was recently developed to make it easier to use PhaseNet and GaMMA together to create earthquake catalogs from raw data. For this lab, we will need to set up an environment for QuakeFlow, and install the relevant packages. 

#### Creating a QuakeFlow environment
To avoid conflicting package requirements between QuakeFlow and your default python environment, we will make a specific environment for this project. To do this, open a terminal and enter the following commands.

`conda create -n quakeflow`

`conda activate quakeflow`

`conda install jupyter`

#### Installing PhaseNet and GaMMA
Both the PhaseNet and GaMMA packages are linked to the github QuakeFlow github repository. Therefore, to download all of the codes you will need, you can do:

`git clone --recursive https://github.com/AI4EPS/QuakeFlow`

Next, in a terminal, navigate to the directory where you installed QuakeFlow. To change directories in a terminal use `cd` (for both Windows and Unix-based operating systems). Make sure that your anaconda quakeflow environment is activated in the terminal. Once in the QuakeFlow directory, navigate into the PhaseNet directory, and install the package with `pip install -e .`. Do the same for the GaMMA package. Let me know if you have trouble with any of these steps.

In [None]:
import os
import json
import obspy
import pickle
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cartopy
import cartopy.crs as ccrs

from collections import defaultdict
from obspy import UTCDateTime
from obspy.clients.fdsn import Client

###  <font color='red'>Question 1 </font>

To keep track of all of the parameters required to generate a seismic catalog, we will make a 'configuration' file, in the 'json' format. The cell below provides a function where you can set all of the important parameters, including the geographical extent of your study region, the seismic network/ stations you would like to analyze, and the time period you are interested in. In this study, we would like to use broadband stations in the PT network, during the beginning of the eruptive sequence.

**Modify the `set_config` function so that the time period starts 1 hour prior to the eruption, and ends 3 hours after the start of the eruption.** To do this, you will need to set the `starttime` and `duration_s` parameters. Remember, the time reported above for the start of the eruption is in HST, and you want UTC!

In [None]:
def set_config(config_json='config.json'):
    region_name = "Hawaii"
    
    center = (-155.5, 19.5) #longitude, latitude 
    horizontal_degree = 1.0
    vertical_degree = 1.0

    #-----complete these parameters
    #starttime = obspy.UTCDateTime('')
    #duration_hrs = 
    #duration_s = 
    
    endtime = starttime + duration_s
    client = "IRIS"
    network_list = ["PT"]
    channel_list = 'HHZ'
    
    ####### save config ########
    degree2km = np.pi * 6371 / 180
    config = {}
    config["region"] = region_name
    config["center"] = center
    config["xlim_degree"] = [
        center[0] - horizontal_degree / 2,
        center[0] + horizontal_degree / 2,
    ]
    config["ylim_degree"] = [
        center[1] - vertical_degree / 2,
        center[1] + vertical_degree / 2,
    ]
    config["min_longitude"] = center[0] - horizontal_degree / 2
    config["max_longitude"] = center[0] + horizontal_degree / 2
    config["min_latitude"] = center[1] - vertical_degree / 2
    config["max_latitude"] = center[1] + vertical_degree / 2
    config["degree2km"] = degree2km
    config["starttime"] = starttime.datetime.isoformat(timespec="milliseconds")
    config["endtime"] = endtime.datetime.isoformat(timespec="milliseconds")
    config["networks"] = network_list
    config["channels"] = channel_list
    config["client"] = client

    ## PhaseNet
    config["phasenet"] = {}
    ## GaMMA
    config["gamma"] = {}
    ## HypoDD
    config["hypodd"] = {"MAXEVENT": 1e4}

    with open(config_json, "w") as fp:
        json.dump(config, fp, indent=2)

    print(json.dumps(config, indent=4))

###  <font color='red'>Question 2 </font>

Run the cell below to create the configuration file.

In [None]:
#Run configuration set up
set_config()

In [None]:
def download_stations(config_json='config.json',plot=True):

    client = Client("IRIS")
    fname_list = ["fname"]
    
    with open(config_json, "r") as fp:
        config = json.load(fp)
        
    #make a directory for this project (if it doesn't exist)
    base_dir = '{}'.format(config['region'])
    if not os.path.exists(base_dir):
        os.makedirs(base_dir)
         
    ####### Download stations ########
    stations = client.get_stations(
        network=",".join(config["networks"]),
        station="*",
        starttime=config["starttime"],
        endtime=config["endtime"],
        minlongitude=config["xlim_degree"][0],
        maxlongitude=config["xlim_degree"][1],
        minlatitude=config["ylim_degree"][0],
        maxlatitude=config["ylim_degree"][1],
        channel=config["channels"],
        level="response",
        # filename="stations.xml"
    )

    ####### Save stations ########
    station_locs = defaultdict(dict)
    for network in stations:
        for station in network:
            print(station)
            for chn in station:
                
                #print(chn)
                print(chn.response)
                
                sid = f"{network.code}.{station.code}.{chn.location_code}.{chn.code[:-1]}"
                if sid in station_locs:
                    if chn.code[-1] not in station_locs[sid]["component"]:
                        station_locs[sid]["component"].append(chn.code[-1])
                        station_locs[sid]["response"].append(round(chn.response.instrument_sensitivity.value, 2))
                else:
                    tmp_dict = {
                        "longitude": chn.longitude,
                        "latitude": chn.latitude,
                        "elevation(m)": chn.elevation,
                        "component": [
                            chn.code[-1],
                        ],
                        "response": [
                            round(chn.response.instrument_sensitivity.value, 2),
                        ],
                        "unit": chn.response.instrument_sensitivity.input_units.lower(),
                        #"unit": 'm/s',
                    }
                    station_locs[sid] = tmp_dict
                    
    stations.write('{}/stations.xml'.format(base_dir),format='STATIONXML')

    station_json = '{}/stations.json'.format(base_dir)
    with open(station_json, "w") as fp:
        json.dump(station_locs, fp, indent=2)

    station_pkl = '{}/stations.pkl'.format(base_dir)
    with open(station_pkl, "wb") as fp:
        pickle.dump(stations, fp)
    
    if plot:
        ######## Plot stations ########
        station_locs = pd.DataFrame.from_dict(station_locs, orient="index")
        plt.figure()
        plt.plot(station_locs["longitude"], station_locs["latitude"], "^", label="Stations")
        plt.xlabel("X (km)")
        plt.ylabel("Y (km)")
        plt.axis("scaled")
        plt.legend()
        plt.title(f"Number of stations: {len(station_locs)}")
        plt.show()

In [None]:
def download_waveforms(config_json='config.json'):

    client = Client("IRIS")
    fname_list = ["fname"]
    
    with open(config_json, "r") as fp:
        config = json.load(fp)
        
    waveform_dir = '{}/{}'.format(config['region'],'waveforms')
    
    if not os.path.exists(waveform_dir):
        os.makedirs(waveform_dir)
       
    station_pkl = '{}/stations.pkl'.format(config['region'])
    with open(station_pkl, "rb") as fp:
        stations = pickle.load(fp)

        
    #loop through stations and download data
    #-----------------------------------------------------------------------------
    max_retry = 10
    stream = obspy.Stream()
    
    starttime = UTCDateTime(config['starttime'])
    endtime = UTCDateTime(config['endtime'])
    
    num_sta = 0
    for network in stations:
        for station in network:
            print(f"********{network.code}.{station.code}********")
            
            retry = 0
            while retry < max_retry:
                try:
                    tmp = client.get_waveforms(
                        network.code,
                        station.code,
                        "*",
                        config["channels"],
                        starttime,
                        endtime,
                        )
                    stream += tmp
                    num_sta += len(tmp)
                    break
                except Exception as err:
                    print("Error {}.{}: {}".format(network.code, station.code, err))
                    message = "No data available for request."
                    if str(err)[: len(message)] == message:
                        break
                    retry += 1
                    time.sleep(5)
                    continue
            if retry == max_retry:
                print(f"{fname}: MAX {max_retry} retries reached : {network.code}.{station.code}")
   
    #-----------------------------------------------------------------------------   
    
    fname = "{}.mseed".format(starttime.datetime.strftime("%Y-%m-%dT%H:%M:%S"))
    
    if len(stream) > 0:
        stream.write(os.path.join(waveform_dir, fname))
        print('download successful')
    else:
        print('download failed')
        
    fname_list.append(fname)
    fname_csv = 'input_data.csv'
    with open(fname_csv, "w") as fp:
        fp.write("\n".join(fname_list))

###  <font color='red'>Question 3 </font>

The next step is to download the station information and waveforms. The two code blocks above provide the functions `download_stations` and `download_waveforms`, which read the parameters set in the `config.json` file. In the block below, run both of these functions. Note, the waveforms and station data will be saved in the directory `./Hawaii/` (or whatever you set your `region_name` to in `set_config`.

In [None]:
#Answer Q3 here.

###  <font color='red'>Question 4 </font>

**Make a map of your stations with cartopy**. The station metadata (including latitude & longitude) are written in the file `Hawaii/stations.json`. In the box below, I have written code to read this file, which loads it into a dictionary.

In [None]:
#Answer Q4 here.
import cartopy

with open('Hawaii/stations.json', "r") as fp:
    station_info = json.load(fp)
    
#complete code to plot stations map below

###  <font color='red'>Question 5 </font>

Make a plot of the waveforms that you downloaded. You can load the data stream with `obspy.read()`, and plot it with `st.plot()`. The data was written to a mseed file in `Hawaii/waveforms/`.

Hint: to see the name of the mseed file from jupyter, you can type `ls Hawaii/waveforms` (i.e., to list the contents of that directory). `ls` is a linux command, but most linux commands work when run in jupyter notebook code cells.


In [None]:
#Answer Q5 here.

###  <font color='red'>Question 6 </font>

To detect P and S waves in the data from each station, we will run PhaseNet's `predict.py` script. This is meant to be run from the command line, but we can run it within a jupyter notebook cell also. We just need to set some paths, including where the `predict.py` script is located, what model we would like use, and where our data is located.

**The block below is set up to run PhaseNet on my machine. Modify the code to run on your machine, by changing `predict_path` and `model_path` to the corresponding paths on your machine. Then run the code block!**

Note, if you are successful, you should see some code output, and a message that looks something like "Done with XXX P-picks and YYY S-picks". It may take a minute or two. Take note of how many P and S waves were picked by PhaseNet.

In [None]:
predict_path='/Users/rmaguire/Tools/QuakeFlow/PhaseNet/phasenet/predict.py'
model_path='/Users/rmaguire/Tools/QuakeFlow/PhaseNet/model/190703-214543/'

command = "python {} --model={} --data_dir=Hawaii/waveforms/ \
--data_list=input_data.csv --stations=Hawaii/stations.json \
--result_dir='./Hawaii' --format=mseed_array --amplitude".format(predict_path,model_path)

#run this as a command line argument
!{command}

###  <font color='red'>Question 7 </font>

If successful, PhaseNet should have created a file in the `Hawaii` directory called `picks.csv`, which contains a table of all of the potential P and S wave arrivals detected at each station. This file will serve as one of the inputs to GaMMA. To run the earthquake association problem and create a catalog of seismic events, run the block below.

This step can take several minutes. For ~1000 P-wave or S-wave picks, it runs in ~5 minutes on my laptop. If you have many more than 1000 picks from PhaseNet, you may have set some parameters incorrectly (e.g., the period of time is too long, or you included too many stations), and you should come to me for help.

In [None]:
import pandas as pd
import json
from pyproj import Proj
from gamma.utils import association

pick_csv = 'Hawaii/picks.csv'
station_json = 'Hawaii/stations.json'
config_json = 'config.json'

picks = pd.read_csv(pick_csv, parse_dates=["phase_time"])
picks["id"] = picks["station_id"]
picks["timestamp"] = picks["phase_time"]
picks["amp"] = picks["phase_amp"]
picks["type"] = picks["phase_type"]
picks["prob"] = picks["phase_score"]

with open(config_json, "r") as fp:
    config = json.load(fp)

with open(station_json, "r") as fp:
    stations = json.load(fp)
    stations = pd.DataFrame.from_dict(stations, orient="index")
    stations["id"] = stations.index
    proj = Proj(f"+proj=sterea +lon_0={config['center'][0]} +lat_0={config['center'][1]} +units=km")
    stations[["x(km)", "y(km)"]] = stations.apply(
        lambda x: pd.Series(proj(longitude=x.longitude, latitude=x.latitude)), axis=1)
    stations["z(km)"] = stations["elevation(m)"].apply(lambda x: -x / 1e3)
    
## setting GMMA configs
config["use_dbscan"] = False
config["use_amplitude"] = True
config["method"] = "BGMM"
if config["method"] == "BGMM":  ## BayesianGaussianMixture
    config["oversample_factor"] = 4
if config["method"] == "GMM":  ## GaussianMixture
    config["oversample_factor"] = 1

# Earthquake location
config["dims"] = ["x(km)", "y(km)", "z(km)"]
config["vel"] = {"p": 6.0, "s": 6.0 / 1.73}
config["x(km)"] = (np.array(config["xlim_degree"]) - np.array(config["center"][0])) * config["degree2km"]
config["y(km)"] = (np.array(config["ylim_degree"]) - np.array(config["center"][1])) * config["degree2km"]
config["z(km)"] = (0, 60)
config["bfgs_bounds"] = (
    (config["x(km)"][0] - 1, config["x(km)"][1] + 1),  # x
    (config["y(km)"][0] - 1, config["y(km)"][1] + 1),  # y
    (0, config["z(km)"][1] + 1),  # z
    (None, None),  # t
    )

# DBSCAN
config["dbscan_eps"] = 10  # second
config["dbscan_min_samples"] = 3  ## see DBSCAN

# Filtering
print(stations)
config["min_picks_per_eq"] = min(10, len(stations) // 3)
config["min_p_picks_per_eq"] = 0
config["min_s_picks_per_eq"] = 0
config["max_sigma11"] = 2.0  # s
config["max_sigma22"] = 2.0  # m/s
config["max_sigma12"] = 1.0  # covariance

# if use amplitude
if config["use_amplitude"]:
    picks = picks[picks["amp"] != -1]

# print(config)
event_idx0 = 1
assignments = []
for k, v in config.items():
    print(f"{k}: {v}")
    
#Run GaMMA association
catalogs, assignments = association(picks, stations, config, event_idx0, method=config["method"])
event_idx0 += len(catalogs)

## create catalog--------------------------------------------------------------------------------
catalogs = pd.DataFrame(
    catalogs,
    columns=["time"]
    + config["dims"]
    + [
        "magnitude",
        "sigma_time",
        "sigma_amp",
        "cov_time_amp",
        "event_index",
        "gamma_score",
    ],
    )

catalogs[["longitude", "latitude"]] = catalogs.apply(
    lambda x: pd.Series(proj(longitude=x["x(km)"], latitude=x["y(km)"], inverse=True)),
    axis=1,
    )
catalogs["depth(m)"] = catalogs["z(km)"].apply(lambda x: x * 1e3)

gamma_catalog_csv = 'gamma_catalog.csv'
catalogs.sort_values(by=["time"], inplace=True)
with open(gamma_catalog_csv, "w") as fp:
    catalogs.to_csv(
            fp,
            # sep="\t",
            index=False,
            float_format="%.3f",
            date_format="%Y-%m-%dT%H:%M:%S.%f",
            columns=[
                "time",
                "magnitude",
                "longitude",
                "latitude",
                "depth(m)",
                "sigma_time",
                "sigma_amp",
                "cov_time_amp",
                "gamma_score",
                "event_index",
            ],
        )

###  <font color='red'>Question 8 </font>

The above code block ran GaMMA and created the earthquake catalog `gamma_catalog.csv`. Let's take a look at what we found!

First, read in the catalog to a pandas dataframe, with `df = pd.read_csv`. How many earthquakes did you find?

Next, make a plot to summarize the earthquake statistics. Make a figure with 2 axes, that include histograms of the event magnitudes and event depths. You can do this with `plt.hist` (or `ax.hist` if you are using an axes object). Be sure to label axes. 

Hint: You can use the pandas dataframe in a similar way to a dictionary. For example, to select the array of magnitudes, you can use df['magnitude']. To see all of the variables in the catalog, you can type `df.keys()`.

In [None]:
#Answer Q8 here.

###  <font color='red'>Question 9 </font>
**Use cartopy to make a map of your seismicity catalog.**. To make things more interesting, instead of plotting simple scatter points, color the points by the event depth (and include a colorbar). You can color the points by setting the argument `c` in `ax.scatter()` to the array of event depths. 

In [None]:
#Answer Question 9 here.


###  <font color='red'>Question 10 </font>
Lastly, visit the USGS Earthquake Map (https://earthquake.usgs.gov/earthquakes/map/), and compare to see how your catalog looks with the USGS's. To search the catalog, click the cog-wheel icon in the top right-hand corner of the page, and scroll until you see the button labeled "SEARCH EARTHQUAKE CATALOG". This will allow you to perform a custom query. Your query should include events of all magnitudes during the same time frame as your analysis (i.e., starting 1 hour prior to eruption and ending 3 hours after the beggining of the eruption). To set the region, use the "Draw Rectangle on Map" feature, and draw a box around the big island of Hawaii. 

What differences do you notice? How many earthquakes are in the catalog compared to yours?

In [None]:
#Answer Q10 here.