In this notebook
    - go to USGS page
    - pull long and lat for that gauge; also pull river and any other info that could be used to correlate gauges to each other


Looking at gauge meta data for just a couple of gauges, we find a few really useful bits of information:
    - latitude
    - longitude
    - altitude
    - Subbasin hydrologic unit (River System)

In [30]:
import pandas as pd
from bs4 import BeautifulSoup
import os
import requests
import time
import re

In [31]:
import pickle
path="C:\Springboard\Github\gauge_info"
os.chdir(path)

In [32]:
# load DF with NOAA and USGS for all gauges in Colorado River Basin that have predictions in NOAA
df = pickle.load(open("NOAA_USGS.pkl", "rb"))
df.head()

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,9472050
1,MAOA3,Acdc,AZ,1230,6,0,0
2,MHFA3,Acdc,AZ,1225,7,0,0
3,MSXA3,Acdc,AZ,1220,8,0,0
4,ACHA3,Agua Caliente Wash,AZ,2588,2,0,0


This is the list of potential features to use for predicting the water flow for the gauges

In [33]:
# load DF with the list of gauges that we would like to get predictions on
dfm = pickle.load(open("USGS_missing.pkl", "rb"))
dfm

Unnamed: 0,USGS
0,10140700
1,10149000
2,10155200
3,10224000
4,10308200
...,...
151,09076300
152,RFBASALT
153,ESCCREEKCO
154,SFPAYETTEID


### Data loaded from previous notebook. Let's see if we can scrape websites for the appropriate data about each gauge:
    - latitude
    - longitude
    - altitude
    - Subbasin hydrologic unit (River System)

Looks like this is the general link that we need to scrape: <br>
https://waterdata.usgs.gov/monitoring-location/GAUGE NUMBER/#parameterCode=00060&period=P7D <br>
Let's try to pull the data from one to take a look

In [34]:
## try to pull on just the first gauge to get it correct
link = 'https://waterdata.usgs.gov/monitoring-location/09073400/#parameterCode=00060&period=P7D'
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")

In [35]:
site_sum = soup.find_all(id='site-summary')
print(len(site_sum))

1


That's the data that we want, but we need to break it down more; let's sort it by the table rows

In [36]:
rows_sum = site_sum[0].find_all('tr')
print(len(rows_sum))

43


That's a lot of rows. We could count on the data that we want being in the same place every time, but we think a better way would be for the first entry in the row to correspond to the name we are looking for

In [37]:
print(rows_sum[7])

<tr>
<th scope="row">Decimal latitude
                                                    
                                                </th>
<td class="loc-metadata">
                                                    
                                                        39.17998755
                                                    
                                                    
                                                </td>
<td><sub>n/a</sub></td>
</tr>


In [38]:
# do an example of this first before writing the loop
rows_sum[7].th.get_text()


'Decimal latitude\n                                                    \n                                                '

In [39]:
# this is the text that we want to match (we change the case just in case somebody messed with that)
'Decimal latitude'.lower() in rows_sum[7].th.get_text().lower()

True

In [40]:
# this is the data that we want to pull from that row
lat = rows_sum[7].td.get_text()
lat

'\n                                                    \n                                                        39.17998755\n                                                    \n                                                    \n                                                '

that is the number that we want, but we need to pull it out from all of the row breaks - we just want the numbers and not all of the other excess stuff

In [41]:
# let's try to extract just the digits
float(re.findall(r'\d+.\d+', lat)[0])

39.17998755

# this is exactly the data we want!
## let's see if we can make this work for the other fields we are looking for

In [42]:
for row in rows_sum:
    if 'Decimal latitude' in row.th.get_text():
        print("Latitude: ", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif  'decimal longitude' in row.th.get_text().lower():
        print("Longitude: ", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif 'altitude of gage' in row.th.get_text().lower():
        print("Altitude: ", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif  'subbasin' in row.th.get_text().lower():
        print('Subbasin: ', row.td.get_text())
        

Latitude:  39.17998755
Longitude:  106.8019826
Altitude:  8014.01
Subbasin:  
Roaring Fork



This looks good, as long as each gage uses this text to identify these values. I also need to clean up the text I get from the Subbasin - maybe regex for it.

In [43]:
rows_sum[24].td.get_text()

'\nRoaring Fork\n'

That's actually not that hard. I will drop '\n' and hopefully, life will be good

In [44]:
rows_sum[24].td.get_text().replace('\n', '')

'Roaring Fork'

Excellent! Let's try that all at once!

In [45]:
for row in rows_sum:
    if 'Decimal latitude' in row.th.get_text():
        print("Latitude:", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif  'decimal longitude' in row.th.get_text().lower():
        print("Longitude:", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif 'altitude of gage' in row.th.get_text().lower():
        print("Altitude:", float(re.findall(r'\d+.\d+', row.td.get_text())[0]))
    elif  'subbasin' in row.th.get_text().lower():
        print('Subbasin:', row.td.get_text().replace('\n', ''))

Latitude: 39.17998755
Longitude: 106.8019826
Altitude: 8014.01
Subbasin: Roaring Fork


This is exactly the data that I was looking for. Let's iterate through the gages in each dataframe to make that work

In [52]:
df.head(12)

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,9472050
1,MAOA3,Acdc,AZ,1230,6,0,0
2,MHFA3,Acdc,AZ,1225,7,0,0
3,MSXA3,Acdc,AZ,1220,8,0,0
4,ACHA3,Agua Caliente Wash,AZ,2588,2,0,0
5,AFHA3,Agua Fria,AZ,4400,17,http://waterdata.usgs.gov/az/nwis/uv?09512450,9512450
6,AFMA3,Agua Fria,AZ,3434,18,http://waterdata.usgs.gov/az/nwis/uv?09512500,9512500
7,AFRA3,Agua Fria,AZ,1800,19,http://waterdata.usgs.gov/az/nwis/uv?09512800,9512800
8,AVOA3,Agua Fria,AZ,970,26,0,0
9,MAFA3,Agua Fria,AZ,1115,24,0,0


Before we iterate, let's drop the rows that have no correponding usgs

In [59]:
# switch columns from 0 to NaN
import numpy as np
df['usgs'].replace(to_replace=0, value=np.nan, inplace=True)

In [61]:
df.dropna(subset=['usgs'], inplace=True)
df

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,09472050
5,AFHA3,Agua Fria,AZ,4400,17,http://waterdata.usgs.gov/az/nwis/uv?09512450,09512450
6,AFMA3,Agua Fria,AZ,3434,18,http://waterdata.usgs.gov/az/nwis/uv?09512500,09512500
7,AFRA3,Agua Fria,AZ,1800,19,http://waterdata.usgs.gov/az/nwis/uv?09512800,09512800
10,ATPA3,Altar Wash,AZ,2975,24,http://waterdata.usgs.gov/az/nwis/uv?09486800,09486800
...,...,...,...,...,...,...,...
450,YASC2,Yampa,CO,7240,2,http://waterdata.usgs.gov/co/nwis/uv?09237450,09237450
451,YDLC2,Yampa,CO,5600,17,http://waterdata.usgs.gov/co/nwis/uv?09260050,09260050
452,YMSC2,Yampa,CO,7050,4,http://waterdata.usgs.gov/co/nwis/uv?09237500,09237500
453,YLLU1,Yellowstone,UT,7430,20,http://waterdata.usgs.gov/ut/nwis/uv?09292500,09292500


In [66]:
# create new columns in DF
df['lat'] = df['Elevation']
df['long'] = df['Elevation']
df['alt'] = df['Elevation']
df['basin'] = df['usgs']
df

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs,lat,long,alt,basin
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,09472050,2820,2820,2820,09472050
5,AFHA3,Agua Fria,AZ,4400,17,http://waterdata.usgs.gov/az/nwis/uv?09512450,09512450,4400,4400,4400,09512450
6,AFMA3,Agua Fria,AZ,3434,18,http://waterdata.usgs.gov/az/nwis/uv?09512500,09512500,3434,3434,3434,09512500
7,AFRA3,Agua Fria,AZ,1800,19,http://waterdata.usgs.gov/az/nwis/uv?09512800,09512800,1800,1800,1800,09512800
10,ATPA3,Altar Wash,AZ,2975,24,http://waterdata.usgs.gov/az/nwis/uv?09486800,09486800,2975,2975,2975,09486800
...,...,...,...,...,...,...,...,...,...,...,...
450,YASC2,Yampa,CO,7240,2,http://waterdata.usgs.gov/co/nwis/uv?09237450,09237450,7240,7240,7240,09237450
451,YDLC2,Yampa,CO,5600,17,http://waterdata.usgs.gov/co/nwis/uv?09260050,09260050,5600,5600,5600,09260050
452,YMSC2,Yampa,CO,7050,4,http://waterdata.usgs.gov/co/nwis/uv?09237500,09237500,7050,7050,7050,09237500
453,YLLU1,Yellowstone,UT,7430,20,http://waterdata.usgs.gov/ut/nwis/uv?09292500,09292500,7430,7430,7430,09292500


with those columns added, we are now ready to iterate through the USGS info and pull the data that we need. We'll put that into the dataframe and see how it looks in a few minutes.

In [84]:
# found a missing page and will remove that gage from the DF
df.drop(labels=151, axis=0, inplace=True) # Flamingo Creek near Las Vegas

KeyError: '[151] not found in axis'

In [96]:
# found a missing page and will remove that gage from the DF
df.drop(labels=181, axis=0, inplace=True) # 

In [100]:
# found a missing page and will remove that gage from the DF
df.drop(labels=226, axis=0, inplace=True) # Las Vegas Wash

In [101]:
# iterate over the rows in the first dataframe (that have NOAA predictions)
for i, r in df.iterrows():
    # if it has a USGS
    if r['usgs'] != 0 and i > 226:
        # then get what we need from the website
        link = 'https://waterdata.usgs.gov/monitoring-location/'+ r['usgs'] + '/#parameterCode=00060&period=P7D'
        response = requests.get(link)
        soup = BeautifulSoup(response.text, "html.parser")
        site_sum = soup.find_all(id='site-summary')
        rows_sum = site_sum[0].find_all('tr')
        # iterate through all of the rows from the page
        for row in rows_sum:
            if 'Decimal latitude' in row.th.get_text():
                df['lat'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
            elif  'decimal longitude' in row.th.get_text().lower():
                df['long'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
#             elif 'altitude of gage' in row.th.get_text().lower():
#                 df['alt'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
            elif  'subbasin' in row.th.get_text().lower():
                df['basin'][i] = row.td.get_text().replace('\n', '')
        # standard 3 second delay in pulling data
        time.sleep(3)

In [102]:
df

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs,lat,long,alt,basin
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,09472050,32.4462,110.488,2820,Lower San Pedro
5,AFHA3,Agua Fria,AZ,4400,17,http://waterdata.usgs.gov/az/nwis/uv?09512450,09512450,34.4853,112.237,4400,Agua Fria
6,AFMA3,Agua Fria,AZ,3434,18,http://waterdata.usgs.gov/az/nwis/uv?09512500,09512500,34.3153,112.064,3434,Agua Fria
7,AFRA3,Agua Fria,AZ,1800,19,http://waterdata.usgs.gov/az/nwis/uv?09512800,09512800,34.0156,112.168,1800,Agua Fria
10,ATPA3,Altar Wash,AZ,2975,24,http://waterdata.usgs.gov/az/nwis/uv?09486800,09486800,31.839,111.404,2975.15,Brawley Wash
...,...,...,...,...,...,...,...,...,...,...,...
450,YASC2,Yampa,CO,7240,2,http://waterdata.usgs.gov/co/nwis/uv?09237450,09237450,40.2643,106.892,7240,Upper Yampa
451,YDLC2,Yampa,CO,5600,17,http://waterdata.usgs.gov/co/nwis/uv?09260050,09260050,40.4516,108.525,5600,Lower Yampa
452,YMSC2,Yampa,CO,7050,4,http://waterdata.usgs.gov/co/nwis/uv?09237500,09237500,40.2865,106.829,7050,Upper Yampa
453,YLLU1,Yellowstone,UT,7430,20,http://waterdata.usgs.gov/ut/nwis/uv?09292500,09292500,40.5119,110.342,7430,Duchesne


In [104]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 373 entries, 0 to 454
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   NOAA_gauge  373 non-null    object
 1   River       373 non-null    object
 2   State       373 non-null    object
 3   Elevation   373 non-null    object
 4   Segment     373 non-null    object
 5   USGS_link   373 non-null    object
 6   usgs        373 non-null    object
 7   lat         373 non-null    object
 8   long        373 non-null    object
 9   alt         373 non-null    object
 10  basin       373 non-null    object
dtypes: object(11)
memory usage: 45.0+ KB


This is good that they are all non-null, but Elevation, lat, long, and altitude should be int or float

In [105]:
df['Elevation'] = df['Elevation'].astype(float)

In [106]:
df['lat'] = df['lat'].astype(float)

In [107]:
df['long'] = df['long'].astype(float)

In [108]:
df['alt'] = df['alt'].astype(float)

In [109]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 373 entries, 0 to 454
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   NOAA_gauge  373 non-null    object 
 1   River       373 non-null    object 
 2   State       373 non-null    object 
 3   Elevation   373 non-null    float64
 4   Segment     373 non-null    object 
 5   USGS_link   373 non-null    object 
 6   usgs        373 non-null    object 
 7   lat         373 non-null    float64
 8   long        373 non-null    float64
 9   alt         373 non-null    float64
 10  basin       373 non-null    object 
dtypes: float64(4), object(7)
memory usage: 45.0+ KB


Much better!

Let's work with the other data frame now to pull those coordinates, altitude, and river basin

In [118]:
dfm['lat'] = 0.0
dfm['long'] = 0.0
dfm['alt'] = 0.0
dfm['basin'] = 'home' 

In [119]:
dfm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156 entries, 0 to 155
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   USGS    156 non-null    object 
 1   lat     156 non-null    float64
 2   long    156 non-null    float64
 3   alt     156 non-null    float64
 4   basin   156 non-null    object 
dtypes: float64(3), object(2)
memory usage: 6.2+ KB


In [120]:
# iterate over the rows in the second dataframe (that have NOAA predictions)
for i, r in dfm.iterrows():
    # then get what we need from the website
    link = 'https://waterdata.usgs.gov/monitoring-location/'+ r['USGS'] + '/#parameterCode=00060&period=P7D'
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    site_sum = soup.find_all(id='site-summary')
    if len(site_sum) > 0:
        rows_sum = site_sum[0].find_all('tr')
        # iterate through all of the rows from the page
        for row in rows_sum:
            if 'Decimal latitude' in row.th.get_text():
                dfm['lat'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
            elif  'decimal longitude' in row.th.get_text().lower():
                dfm['long'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
            elif 'altitude of gage' in row.th.get_text().lower():
                dfm['alt'][i] = float(re.findall(r'\d+.\d+', row.td.get_text())[0])
            elif  'subbasin' in row.th.get_text().lower():
                dfm['basin'][i] = row.td.get_text().replace('\n', '')
    # standard 3 second delay in pulling data
    time.sleep(3)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [121]:
dfm

Unnamed: 0,USGS,lat,long,alt,basin
0,10140700,41.231819,111.984497,4285.00,Lower Weber
1,10149000,40.118012,111.314622,6320.00,Spanish Fork
2,10155200,40.554398,111.433243,5691.59,Provo
3,10224000,39.481898,112.393834,4660.00,Lower Sevier
4,10308200,38.714627,119.764899,5400.00,Upper Carson
...,...,...,...,...,...
151,09076300,39.223200,106.857308,7575.00,Roaring Fork
152,RFBASALT,0.000000,0.000000,0.00,home
153,ESCCREEKCO,0.000000,0.000000,0.00,home
154,SFPAYETTEID,0.000000,0.000000,0.00,home


This looks pretty good. Let's focus on the USGS, so we'll drop the Colorado Division of Water gages from the dataframe. These would be all of the rows that still have 0.0 for the latitude.

In [123]:
# drop the rows that we couldn't pull the data for 
dfn = dfm[dfm['lat'] != 0]
dfn

Unnamed: 0,USGS,lat,long,alt,basin
0,10140700,41.231819,111.984497,4285.00,Lower Weber
1,10149000,40.118012,111.314622,6320.00,Spanish Fork
2,10155200,40.554398,111.433243,5691.59,Provo
3,10224000,39.481898,112.393834,4660.00,Lower Sevier
4,10308200,38.714627,119.764899,5400.00,Upper Carson
...,...,...,...,...,...
147,06892350,38.983337,94.964689,753.87,"Lower Kansas, Kansas"
148,07169800,37.375597,96.185548,897.30,Elk
149,06775900,41.778611,100.525278,2798.18,Dismal
150,06461500,42.902222,100.362222,2287.57,Middle Niobrara


In [124]:
dfn.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 125 entries, 0 to 151
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   USGS    125 non-null    object 
 1   lat     125 non-null    float64
 2   long    125 non-null    float64
 3   alt     125 non-null    float64
 4   basin   125 non-null    object 
dtypes: float64(3), object(2)
memory usage: 5.9+ KB


That looks really good. That is the list of gauges that we want to build models for. Let's review the data that we have now: <br>
    - dfn: dataframe with all 125 of the USGS gages that we want to build models for. This dataframe includes some details about the gages we can get use those to determine potential features
    - df: dataframe will all 373 of the USGS gages that have NOAA predictions. These are potential features for predicting the flow in dfn

In [None]:
# export the dataframes 


In next notebook:
    - correlated gauges without predictions to gauges that do have predictions (KNN)
    - result should be a dataframe with first column as the y USGS gage and each additional column a feature USGS gage that has a NOAA prediction
    - set threshold for accuracy in predicting flow
    - set models up to use in production

In the notebook after that:
    - pull USGS data for length that each gage is in operation
    - potentially cut time scale of data based on when the flow has never been above a certain threshold
    - each of the 125 gages should have a dataframe (list of dataframes)

In the notebook after that:
    - build 125 models (using Lasso Regression, we believe)
    - set threshold for accuracy in predicting flow
    - throw out models that do not meet that accuracy
    - throw out features in models that are no longer needed
    - do other prep for putting models into production (test in real time and have accuracy feedback)