In this notebook: <br>
- Let's pull the CSV that corelated the NOAA to USGS
- pull the CSV of all the USGS gauges that we use currently in rivermaps.co (not future version)
- Find any items that are on that list but have a NOAA prediction
- Find list of USGS that are in current version, but not correlated with NOAA prediction - currently
- add those to the NOAA to USGS CSV so we can expand our prediction reach

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import os
import requests
import time

In [2]:
import pickle
path="C:\Springboard\Github\gauge_info"
os.chdir(path)

# Dataframe that contains all of the NOAA abbreviations and their corresponding USGS gage

In [3]:
# load DF with NOAA and USGS for all gauges in Colorado River Basin that have predictions in NOAA
df = pickle.load(open("NOAA_USGS.pkl", "rb"))
df.head()

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,9472050
1,MAOA3,Acdc,AZ,1230,6,0,0
2,MHFA3,Acdc,AZ,1225,7,0,0
3,MSXA3,Acdc,AZ,1220,8,0,0
4,ACHA3,Agua Caliente Wash,AZ,2588,2,0,0


## CSV that corelated the NOAA to USGS - currently used in production at /future

In [4]:
# load DF with NOAA and USGS that are CURRENTLY used in future forecast - these were put together manually
df2 = pd.read_csv("USGS_NOAA_new.csv", names=['USGS', 'NOAA'])
df2

Unnamed: 0,USGS,NOAA
0,09067020,EALC2
1,09057500,BGMC2
2,09066325,GRVC2
3,09070000,GPSC2
4,09070500,EGLC2
...,...,...
109,09504000,VDCA3
110,09508500,VDTA3
111,10130500,CLLU1
112,10128500,OAWU1


## List of all USGS (and CO Water) gages - currently used in production REAL-TIME

In [5]:
# load list from CSV of all USGS (and CO Water) gauges that are currently used
import csv
USGS_current = []
with open('USGS_list.csv', 'r') as f:
    readCSV = csv.reader(f, delimiter=',')
    for row in readCSV:
        for i in row:
            USGS_current.append(i)

In [6]:
len(USGS_current)

270

Before we proceed, let's review the data that we do have:
1. 459 NOAA sites througout the Colorado River Forecast Basin; we have the correponding USGS gauge for just about all of them. These are stored in df
2. 270 USGS (and CO Water) measures that are currently being is in the real-time display of water. These gauges are NOT just from the Colorado River Forecast Basin. These are stored in df2
3. 111 NOAA prediction sites that corresponded with USGS gauges. There are in the list USGS_current

Next, let's find all of the possible USGS sites (from the 270 currently used on the real time page) that have a corresponding NOAA forecast. 

In [7]:
USGS_in_NOAA = []
for g in df['usgs']:
    if g in USGS_current:
        USGS_in_NOAA.append(g)
len(USGS_in_NOAA)

111

That means there are 111 gauges that are in both my current list of gauges and the NOAA predictions. Since, this is more than 86 that I am currently using, I expect to gain 25 gauges that could have predictions. Let's see if that checks out

## Find the NOAA predictions that now need to be pulled for the new models.

In [8]:
# need a list of USGS gages from the models
USGS_models = pickle.load(open("model_gages.pkl", "rb"))

# create list of tuples first
model_USGS_NOAA = []

# load those list of USGS and put them up against the DF to pull the NOAA name
for g in USGS_models:
    model_USGS_NOAA.append((g, df[df['usgs']==g]['NOAA_gauge'].tolist()[0]))
    

# append those gages to the new list
print(model_USGS_NOAA)

[('09112500', 'ALEC2'), ('09124500', 'LFGC2'), ('09115500', 'TMCC2'), ('09067020', 'EALC2'), ('09024000', 'FRWC2'), ('09085000', 'GWSC2'), ('09073400', 'APNC2'), ('09070000', 'GPSC2'), ('09110000', 'ALTC2'), ('09342500', 'PSPC2'), ('09034250', 'CAWC2'), ('09085100', 'GCOC2'), ('09166500', 'DOLC2'), ('10105900', 'PRZU1'), ('09237500', 'YMSC2'), ('09107000', 'TRAC2'), ('09065100', 'CSSC2'), ('09081600', 'RCYC2'), ('09415000', 'VLTA3'), ('10140100', 'OPDU1'), ('09036000', 'WFLC2'), ('10141000', 'WWPU1')]


In [10]:
# create dataframe from that list of tuples
df_new = pd.DataFrame(model_USGS_NOAA, columns =['USGS', 'NOAA'])
df_new

Unnamed: 0,USGS,NOAA
0,9112500,ALEC2
1,9124500,LFGC2
2,9115500,TMCC2
3,9067020,EALC2
4,9024000,FRWC2
5,9085000,GWSC2
6,9073400,APNC2
7,9070000,GPSC2
8,9110000,ALTC2
9,9342500,PSPC2


In [11]:
df2 = df2.append(df_new, ignore_index=True)
df2

Unnamed: 0,USGS,NOAA
0,09067020,EALC2
1,09057500,BGMC2
2,09066325,GRVC2
3,09070000,GPSC2
4,09070500,EGLC2
...,...,...
131,09081600,RCYC2
132,09415000,VLTA3
133,10140100,OPDU1
134,09036000,WFLC2


That is what we wanted. Let's make sure their were no duplicates

In [15]:
df2['USGS'].value_counts()

09085000    2
09065100    2
09070000    2
09124500    2
09034250    2
           ..
10011500    1
09330500    1
10130500    1
09279000    1
09171100    1
Name: USGS, Length: 125, dtype: int64

This makes sense, as a few of the gages were already being used and pulled for predictions. Let's eliminate the duplicates.

In [16]:
df2.drop_duplicates(inplace=True)
df2

Unnamed: 0,USGS,NOAA
0,09067020,EALC2
1,09057500,BGMC2
2,09066325,GRVC2
3,09070000,GPSC2
4,09070500,EGLC2
...,...,...
128,09237500,YMSC2
129,09107000,TRAC2
132,09415000,VLTA3
134,09036000,WFLC2


This is excellent! Exactly what we wanted from this addition. These are the gages that we need for predicting with the new models and pulling directly for predictions.

## Export those USGS and NOAA to CSV that can be used on the website

In [17]:
df2.to_csv("USGS_NOAA_newer.csv", index=False, header=False)
# already exported this, so no need to do that again

## Find the list of USGS gauges that are NOT covered by the NOAA predictions - THIS IS ONLY MEANINGFUL IF WE LOADED THE NEW TARGET GAGES FOR THE MODEL. INSTEAD WE LOADED MODEL FEATURES.

In [18]:
USGS_missing = []
old_USGS = df2['USGS'].tolist()
for g in USGS_current:
    if g not in old_USGS:
        USGS_missing.append(g)
len(USGS_missing)

156

This list of 156 gauges are ones that we don't have predictions for. We will hopefully be able to build models for some of these gauges so that we can use the existing predictions 

In [19]:
df_missing = pd.DataFrame(USGS_missing, columns =['USGS'])

In [20]:
## Export DF so that we can use them in the future notebooks
df_missing.to_pickle("USGS_missing.pkl")
df_missing.to_csv("USGS_missing.csv")

In the future notebook
    - go to USGS page
    - pull long and lat for that gauge; also pull river
    - correlated to close by gauges