# Part 1: Similarity and Distance 

This notebook covers the concepts of similarity and distance using k-nearest neighbors. First, we import our libraries with an alias (`import library as alias`) and load a couple helper functions I wrote to speed things up. These modules (collections of functions) are located in the `code` directory.

In [3]:
# packages
import pandas as pd
import numpy as np
import scipy.spatial.distance as dist

# helper modules
import untapped_plot

## Introduction

Distance surfaces over and over in my data science workflows. It's relatively simple but incredibly useful for both data science and engineering workflows. I also think that it is heavily underutilized. Let's take a reservoir engineering example. You have a field of wells that are producing oil. You've just installed a new well in the area and want to predict its production profile. Since this well is in a well developed area, there are lots of wells in close proximity, but the geology is complex. You start by taking its closest neighbor using the northing and easting. Unfortunately, you discover that the well is much shallower and in different geology than that well. So, you pull a couple wells that are closer to the same depth and in similar rock. When looking into these, you discover that all these wells are pretty old and that they were completed using different technology. This prompts you to search further and find newer wells with similar geology. After you've spent a couple hours picking similar wells you feel satisfied and average their profiles to produce an average curve.

You've just executed a k-Nearest Neighbors (kNN) regression manually, slowly, and without any statistical underpinning. Effectively, you ran through an ad-hoc way of measuring an abstracted distance between features of your well (distance, time, geology, etc.) to select the most appropriate examples for generating an average profile, also referred to as a type curve. But this process can be automated, evaluated for performance metrics, and optimized for statistical performance. Oh and did I mention that it's incredibly fast and repeatable?

The kNN algorithm is one example of a distance based data-science techniques. I hesitate to call it machine learning, because distance based techniques don't really 'learn' by training a model through fitting and hyperparameter tuning. Yet, this technique is incredibly robust and effective. There are also similar algorithms for unsupervised learning (i.e. clustering). In general, there are three steps to implementing a distance based technique successfully, all of which were considered in the ad-hoc process above.

1. Choose a good distance metric with appropriate features
2. Rank potential candidates based on their distance to your target
3. Use a select number of these candidates to perform regression, classification, or clustering

## Distance Metrics

So how do we measure distance? It's important to first understand that a) there are a lot of different ways to measure distance and b) that each method is a mathematical construct for a specific understanding of distance, and each has pros and cons. For example, Manhattan, or taxi cab, distance can be very useful when working with rasters or trying to evaluate connectivity of points. Euclidean distance is simple and provides the straight line distance between two points, which is why it is used exclusively in a lot of algorithms. There are even distance metrics for categorical and [binary data](https://en.wikipedia.org/wiki/Jaccard_index). The point of selecting a metric is to define our understanding of what constitutes 'near' and 'far' in multidimensional space. Don't underestimate the importance of this - the performance of any regression, classification, or clustering algorithm will ultimately depend on how you define your distance metric.

So let's define a couple using our completions dataset.

In [47]:
# read in the completions data
well_data = pd.read_csv('../data/bcogc_well_comp_info.csv')

# encode dates properly
date_cols = ['frac_start_date','frac_end_date', 'on_prod_date', 'last_reported_date']
well_data.loc[:,date_cols] = well_data.loc[:,date_cols].apply(pd.to_datetime, errors='coerce')

well_data.head()

Unnamed: 0,unique_surv_id,wa_num,drilling_event,ground_elevtn,mean_ss_tvd,mean_ss_easting,mean_ss_northing,survey_well_type,on_prod_date,last_reported_date,...,horiz_wells_in_5km,horiz_wells_in_10km,horiz_wells_in_25km,first_order_residual,isotherm,paleozoic_structure,raw_montney_top,third_order_residual,seismogenic,n_quakes
0,26709-0,26709,0,967.0,-1038.286881,541258.8972,6312071.186,horizontal,2016-07-01,2019-09-01,...,11,11,11,-36.180738,85.046572,-1248.991716,2018.9723,57.171039,True,4
1,26851-0,26851,0,1023.8,-1066.814758,545866.1632,6316437.733,horizontal,2013-10-01,2019-09-01,...,3,3,3,-60.378993,78.536477,-1176.143361,1891.19542,38.373928,True,2
2,27232-0,27232,0,1026.5,-1041.326136,546102.0775,6313238.866,horizontal,2014-01-01,2019-09-01,...,5,5,5,-56.81433,84.212037,-1198.455581,1904.202898,35.885353,True,1
3,27296-0,27296,0,989.6,-1009.757363,545620.7248,6308451.059,horizontal,2013-07-01,2019-09-01,...,1,1,1,-6.258352,89.416457,-1280.473172,1999.194236,78.190771,False,0
4,27302-0,27302,0,1045.0,-1118.887041,542922.7749,6315105.076,horizontal,2013-10-01,2019-05-01,...,6,6,6,-37.993603,90.301766,-1220.128996,2160.073261,60.405326,False,0


In [36]:
# select the numeric columns
num_well_data = well_data.select_dtypes(include=['int64','float64'])

well_data.frac_start_date - well_data.frac_end_date

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [11]:
dist.canberra(well_data_df,well_data_df.loc[0,])

ValueError: Input vector should be 1-D.

Unnamed: 0,wa_num,drilling_event,ground_elevtn,mean_ss_tvd,mean_ss_easting,mean_ss_northing,cum_gas_to_date_e3m3,cum_oil_to_date_m3,cum_water_to_date_m3,cum_cond_to_date_m3,...,horiz_wells_in_1km,horiz_wells_in_5km,horiz_wells_in_10km,horiz_wells_in_25km,first_order_residual,isotherm,paleozoic_structure,raw_montney_top,third_order_residual,n_quakes
0,26709,0,967.0,-1038.286881,541258.8972,6312071.186,60090.8,0,3228.6,5.1,...,3,11,11,11,-36.180738,85.046572,-1248.991716,2018.972300,57.171039,4
1,26851,0,1023.8,-1066.814758,545866.1632,6316437.733,16501.3,0,1284.3,459.3,...,1,3,3,3,-60.378993,78.536477,-1176.143361,1891.195420,38.373928,2
2,27232,0,1026.5,-1041.326136,546102.0775,6313238.866,36857.6,0,2359.4,1008.1,...,0,5,5,5,-56.814330,84.212037,-1198.455581,1904.202898,35.885353,1
3,27296,0,989.6,-1009.757363,545620.7248,6308451.059,108532.8,0,2097.9,345.9,...,0,1,1,1,-6.258352,89.416457,-1280.473172,1999.194236,78.190771,0
4,27302,0,1045.0,-1118.887041,542922.7749,6315105.076,69402.6,0,3968.6,1112.0,...,1,6,6,6,-37.993603,90.301766,-1220.128996,2160.073261,60.405326,0
5,27326,0,1100.9,-1074.878942,541335.5276,6325353.003,72294.2,0,5514.8,927.8,...,1,1,4,4,-108.997979,85.167526,-1093.196140,1957.981005,9.754361,0
6,27366,0,1020.2,-1063.912279,544876.5772,6303945.383,138218.9,0,2188.3,5.9,...,1,3,3,3,24.495887,94.506372,-1342.719775,2080.839299,100.966933,0
7,27367,0,1020.1,-1064.613386,544664.9623,6303620.220,206069.2,0,2185.7,5.2,...,1,3,3,3,28.878601,94.877356,-1349.984357,2089.210826,104.910890,0
8,27451,0,908.7,-985.922762,544209.2403,6329951.104,26684.6,0,3335.8,1356.1,...,1,1,1,1,-79.951575,74.752051,-1080.463243,1711.086822,45.287549,0
9,27487,0,1044.8,-991.957500,542934.5776,6315105.723,96660.9,0,2842.1,2647.3,...,1,6,6,6,-38.125017,90.297489,-1219.898580,2160.845496,60.312230,0


## Acknowledgments

This presentation wouldn't have been possible without all the support I've received from the following organizations
<img src="../images/untapped_sponsors.jpg" alt="My amazing sponsors" style="width: 400px;"  align="left"/>
