In this notebook:
    - correlated gauges without predictions (y) to gauges that do have predictions (features)
    - use KNN based on lat, long, alt, basin to determine 5-10 closest gages to those that we are looking for
    - result should be a dataframe with first column as the y USGS gage and each additional column a feature USGS gage that has a NOAA prediction OR a dictionary with each key pointing to the ten closest gages
    - set threshold for how close a gage would need to be to be considered

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import os
import requests
import time
import re

In [2]:
import pickle
path="C:\Springboard\Github\gauge_info"
os.chdir(path)

In [3]:
# loade the dataframe of potential features that have NOAA predictions
df = pickle.load(open("USGS_features.pkl", "rb"))
df

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs,lat,long,alt,basin
0,SPRA3,San Pedro,AZ,2820.0,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,09472050,32.446183,110.488418,2820.00,Lower San Pedro
5,AFHA3,Agua Fria,AZ,4400.0,17,http://waterdata.usgs.gov/az/nwis/uv?09512450,09512450,34.485278,112.237500,4400.00,Agua Fria
6,AFMA3,Agua Fria,AZ,3434.0,18,http://waterdata.usgs.gov/az/nwis/uv?09512500,09512500,34.315307,112.064046,3434.00,Agua Fria
7,AFRA3,Agua Fria,AZ,1800.0,19,http://waterdata.usgs.gov/az/nwis/uv?09512800,09512800,34.015589,112.167938,1800.00,Agua Fria
10,ATPA3,Altar Wash,AZ,2975.0,24,http://waterdata.usgs.gov/az/nwis/uv?09486800,09486800,31.838972,111.404269,2975.15,Brawley Wash
...,...,...,...,...,...,...,...,...,...,...,...
450,YASC2,Yampa,CO,7240.0,2,http://waterdata.usgs.gov/co/nwis/uv?09237450,09237450,40.264261,106.891767,7240.00,Upper Yampa
451,YDLC2,Yampa,CO,5600.0,17,http://waterdata.usgs.gov/co/nwis/uv?09260050,09260050,40.451634,108.525101,5600.00,Lower Yampa
452,YMSC2,Yampa,CO,7050.0,4,http://waterdata.usgs.gov/co/nwis/uv?09237500,09237500,40.286544,106.829056,7050.00,Upper Yampa
453,YLLU1,Yellowstone,UT,7430.0,20,http://waterdata.usgs.gov/ut/nwis/uv?09292500,09292500,40.511893,110.341549,7430.00,Duchesne


In [4]:
# load the dataframe of target gages that we want to model
dt = pickle.load(open("USGS_targets.pkl", "rb"))
dt

Unnamed: 0,USGS,lat,long,alt,basin
0,10140700,41.231819,111.984497,4285.00,Lower Weber
1,10149000,40.118012,111.314622,6320.00,Spanish Fork
2,10155200,40.554398,111.433243,5691.59,Provo
3,10224000,39.481898,112.393834,4660.00,Lower Sevier
4,10308200,38.714627,119.764899,5400.00,Upper Carson
...,...,...,...,...,...
147,06892350,38.983337,94.964689,753.87,"Lower Kansas, Kansas"
148,07169800,37.375597,96.185548,897.30,Elk
149,06775900,41.778611,100.525278,2798.18,Dismal
150,06461500,42.902222,100.362222,2287.57,Middle Niobrara


One thing we didn't look at in the previous notebook: number of basins that are possible. If we are going to create dummy variables for thie, then we need to know how many dummies. Let's find this value for each DF.

In [5]:
dt['basin'].value_counts()

Upper Rio Grande               7
Upper North Platte             4
Animas                         4
Roaring Fork                   4
Bruneau                        3
                              ..
Blackfoot                      1
Yaak                           1
Middle Niobrara                1
Upper North Fork Clearwater    1
Eagle                          1
Name: basin, Length: 84, dtype: int64

In [6]:
df['basin'].value_counts()

Colorado Headwaters                  13
Upper Virgin                         11
Duchesne                             10
Rillito                               9
Upper Yampa                           9
                                     ..
Dirty Devil                           1
09519501                              1
Lower Gila-Painted Rock Reservoir     1
Black                                 1
Grand Canyon                          1
Name: basin, Length: 121, dtype: int64

So we have 84 basins in the targets list and 121 basins in the features dataframe. We'll probbably have to make dummy variables for these. We may want to consider throwing out gages from the targets if there is not a NOAA gage within that same drainage basin. Let's plot those locations out.

In [7]:
# to make life easier, let's create a column that is consistent with google format (West is negagtive)
df['lng']= -1*df['long']
dt['lng']= -1*dt['long']

In [None]:
# let's just plot them on a normal graph without the map


Help on method scatter in module gmplot.gmplot:

scatter(lats, lngs, color=None, size=None, marker=True, c=None, s=None, symbol='o', **kwargs) method of gmplot.gmplot.GoogleMapPlotter instance
    Plot a collection of points.
    
    Args:
        lats ([float]): Latitudes.
        lngs ([float]): Longitudes.
    
    Optional:
    
    Args:
        color/c/edge_color/ec (str or [str]):
            Color of each point. Can be hex ('#00FFFF'), named ('cyan'), or matplotlib-like ('c'). Defaults to black.
        size/s (int or [int]): Size of each point, in meters (symbols only). Defaults to 40.
        marker (bool or [bool]): True to plot points as markers, False to plot them as symbols. Defaults to True.
        symbol (str or [str]): Shape of each point, as 'o', 'x', or '+' (symbols only). Defaults to 'o'.
        title (str or [str]): Hover-over title of each point (markers only).
        label (str or [str]): Label displayed on each point (markers only).
        precision (int or

In the next notebook:
    - pull USGS data for length that each gage is in operation
    - potentially cut time scale of data based on when the flow has never been above a certain threshold
    - each of the 125 gages should have a dataframe (list of dataframes)

In the notebook after that:
    - build 125 models (using Lasso Regression, we believe)
    - set threshold for accuracy in predicting flow
    - throw out models that do not meet that accuracy
    - throw out features in models that are no longer needed
    - do other prep for putting models into production (test in real time and have accuracy feedback)