# Repairing geographic locations, place names, and extracted data for the NWT Climate Explorer.

## Issue
The location of Inuvik, NWT was found to be incorrect by about ten degrees of longitude. A closer examination of all point locations used by the web tool found that many of the geographic coordinates were in need of refinement and that some place names needed to be updated.

## Fix
A revised spreadsheet of NWT geographic locations was produced (see https://github.com/ua-snap/geospatial-vector-veracity/blob/main/vector_data/point/nwt_point_locations.csv) and used to re-extract downscaled data for each location.

## Validation
This notebook will compare the existing data to the newly extracted data and check for data model integrity and for qualitative similarity. 

In [1]:
import subprocess
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict
from glob import glob

The previously extracted CSV files (incorrect for Inuvik) are still on branch `master` while the freshly extracted data is on branch `inuvik-rextraction` (forgive the missing 'e'). I want to pull both sets of CSV files and compare them. I'll use the `subprocess` module to checkout the different git branches. This is probably not advisable and should only be used if both branches have clean working trees (no changes, nothing staged, etc.). Using git in the notebook is confusing (but fun!) because it is sensitive to the state of the git repository as it was left in the terminal. Only two branches are relevant here so I can be agnostic regarding which branch this "starts" on. This technique will also hang up if this notebook is being tracked in git because it will see the modifications and yell at you about performing a `checkout`. The recommendation is to tell git to ignore notebooks while working here, and then re-track notebooks when you are done / ready to commit.

In [2]:
di_branches = defaultdict()
di_branches['master'] = 'inuvik-rextraction'
di_branches['inuvik-rextraction'] = 'master'

def which_branch():
    proc_branch = subprocess.Popen(["git", "branch"], stdout=subprocess.PIPE, universal_newlines=True)
    out_branch = proc_branch.communicate()[0].splitlines()
    current_branch = [x for x in out_branch if x[0] == '*'][0].split(' ')[-1]
    return current_branch

def switch_branch(di_branches):
    checkout = subprocess.Popen(["git", "checkout", di_branches[which_branch()]],
                                stdout=subprocess.PIPE,
                                universal_newlines=True)

In [3]:
which_branch()

'inuvik-rextraction'

In [4]:
scenarios = ['historical', 'rcp45', 'rcp60', 'rcp85']

di_csv = defaultdict()

def qa_prep():
    di_csv[which_branch()] = {}
    for sc in scenarios:
        di_csv[which_branch()][sc] = {}
        df = pd.read_csv(glob('../data/*' + sc + '*.csv')[0])
        di_csv[which_branch()][sc]['csv'] = df
        di_csv[which_branch()][sc]['shape'] = df.shape
        di_csv[which_branch()][sc]['models'] = sorted(list(df.model.unique()))
        di_csv[which_branch()][sc]['years'] = sorted(list(df.model.unique()))
        di_csv[which_branch()][sc]['place_names'] = sorted(list(df.community.unique()))

qa_prep()

In [5]:
switch_branch(di_branches)

In [6]:
which_branch()

'inuvik-rextraction'

In [7]:
qa_prep()

In [8]:
# Expectation: models and time spans (years) did not change across extractions
for sc in scenarios:
    model_check = di_csv['master'][sc]['models'] == di_csv['inuvik-rextraction'][sc]['models']  
    yr_check = di_csv['master'][sc]['years'] == di_csv['inuvik-rextraction'][sc]['years']
    print(sc, model_check, yr_check)    

KeyError: 'master'

However, changes to the place names were made, and the number of communities was reduced by one in freshly extracted data. Bechoko was formerly represented by separate extractions for Rae and Edzo, two former communities that are are only a few miles apart. They are now together...see encyclopedia article.

In [None]:

for sc in scenarios:
    print(sc)
    old_places = di_csv['master'][sc]['place_names']
    new_places = di_csv['fix-point-locations-and-names'][sc]['place_names']
    old_locs_not_in_new = list(set(old_places) - set(new_places))
    new_locs_not_in_old = list(set(new_places) - set(old_places))
    print(sorted(old_locs_not_in_new))
    print(sorted(new_locs_not_in_old))
    print("Old places: %d" % len(old_places))
    print("New places: %d" % len(new_places))



In [None]:
# OK so we reduced the number of communities by one
# so the shape of the newly extracted dataframes has changed
# how many rows per community for each scenario?
# That should be the difference in dataframe shapes
for sc in scenarios:
    print(sc)
    old_shape = di_csv['master'][sc]['shape']
    new_shape = di_csv['fix-point-locations-and-names'][sc]['shape']
    row_delta = old_shape[0] - new_shape[0]
    print("Old CSV Shape:", old_shape)
    print("New CSV Shape:", new_shape)
    print("Old - New Shape Difference (Number Rows):", row_delta)
    rows_per_location = di_csv['master'][sc]['csv'].query("community == 'Inuvik'").shape[0]
    print("Shape Difference Accounted for by reduction of point locations by one:", row_delta == rows_per_location)


I am satisfied that the data is essentially intact between the two extractions. The same time ranges, models, and scenarios are all accounted for. Now I'll do a brief qualitative examination to look at a few changes in the actual data itself - and I will certainly look at Inuvik because it was known to be incorrect.

In [None]:
plt.figure(figsize=(8, 5))
old_inuvik = di_csv['master']['rcp85']['csv'].query("model == '5ModelAvg' and community == 'Inuvik'")
new_inuvik = di_csv['fix-point-locations-and-names']['rcp85']['csv'].query("model == '5ModelAvg' and name == 'Inuvik'")

plt.plot(new_inuvik[['year', 'tas']].groupby('year').mean(), label='NEW')
plt.plot(old_inuvik[['year', 'tas']].groupby('year').mean(), label='OLD')    
plt.legend()
plt.title('Inuvik')
plt.xlabel('Year')
plt.ylabel('temp (RCP 8.5)')

In [None]:
plt.figure(figsize=(8, 5))
old_inuvik = di_csv['master']['rcp60']['csv'].query("model == '5ModelAvg' and community == 'Inuvik'")
new_inuvik = di_csv['fix-point-locations-and-names']['rcp60']['csv'].query("model == '5ModelAvg' and name == 'Inuvik'")

plt.plot(new_inuvik[['year', 'pr']].groupby('year').mean(), label='NEW')
plt.plot(old_inuvik[['year', 'pr']].groupby('year').mean(), label='OLD')    
plt.legend()
plt.title('Inuvik')
plt.xlabel('Year')
plt.ylabel('Precip')

In [None]:
plt.figure(figsize=(8, 5))
old_inuvik = di_csv['master']['rcp60']['csv'].query("model == '5ModelAvg' and community == 'Behchoko (Edzo)'")
new_inuvik = di_csv['fix-point-locations-and-names']['rcp60']['csv'].query("model == '5ModelAvg' and name == 'Behchokǫ̀'")

plt.plot(new_inuvik[['year', 'tas']].groupby('year').mean(), label='NEW')
plt.plot(old_inuvik[['year', 'tas']].groupby('year').mean(), label='OLD')    
plt.legend()
plt.title('Behchokǫ̀')
plt.xlabel('Year')
plt.ylabel('temp')

In [None]:
plt.figure(figsize=(8, 5))
old_inuvik = di_csv['master']['rcp60']['csv'].query("model == '5ModelAvg' and community == 'Behchoko (Edzo)'")
new_inuvik = di_csv['fix-point-locations-and-names']['rcp60']['csv'].query("model == '5ModelAvg' and name == 'Behchokǫ̀'")

plt.plot(new_inuvik[['year', 'pr']].groupby('year').mean(), label='NEW')
plt.plot(old_inuvik[['year', 'pr']].groupby('year').mean(), label='OLD')    
plt.legend()
plt.title('Behchokǫ̀')
plt.xlabel('Year')
plt.ylabel('temp')