In this notebook
    - go to USGS page
    - pull long and lat for that gauge; also pull river and any other info that could be used to correlate gauges to each other


Looking at gauge meta data for just a couple of gauges, we find a few really useful bits of information:
    - latitude
    - longitude
    - altitude
    - Subbasin hydrologic unit (River System)

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import os
import requests
import time

In [2]:
import pickle
path="C:\Springboard\Github\gauge_info"
os.chdir(path)

In [4]:
# load DF with NOAA and USGS for all gauges in Colorado River Basin that have predictions in NOAA
df = pickle.load(open("NOAA_USGS.pkl", "rb"))
df.head()

Unnamed: 0,NOAA_gauge,River,State,Elevation,Segment,USGS_link,usgs
0,SPRA3,San Pedro,AZ,2820,7,http://waterdata.usgs.gov/az/nwis/uv?09472050,9472050
1,MAOA3,Acdc,AZ,1230,6,0,0
2,MHFA3,Acdc,AZ,1225,7,0,0
3,MSXA3,Acdc,AZ,1220,8,0,0
4,ACHA3,Agua Caliente Wash,AZ,2588,2,0,0


This is the list of potential features to use for predicting the water flow for the gauges

In [6]:
# load DF with the list of gauges that we would like to get predictions on
df = pickle.load(open("USGS_missing.pkl", "rb"))
df

Unnamed: 0,USGS
0,10140700
1,10149000
2,10155200
3,10224000
4,10308200
...,...
151,09076300
152,RFBASALT
153,ESCCREEKCO
154,SFPAYETTEID


### Data loaded from previous notebook. Let's see if we can scrape websites for the appropriate data about each gauge:
    - latitude
    - longitude
    - altitude
    - Subbasin hydrologic unit (River System)

Looks like this is the general link that we need to scrape: <br>
https://waterdata.usgs.gov/monitoring-location/GAUGE NUMBER/#parameterCode=00060&period=P7D <br>
Let's try to pull the data from one to take a look

In [7]:
## try to pull on just the first gauge to get it correct
link = 'https://waterdata.usgs.gov/monitoring-location/09073400/#parameterCode=00060&period=P7D'
response = requests.get(link)
soup = BeautifulSoup(response.text, "html.parser")

In [16]:
site_sum = soup.find_all(id='site-summary')
print(len(site_sum))

1


That's the data that we want, but we need to break it down more; let's sort it by the table rows

In [18]:
rows_sum = site_sum[0].find_all('tr')
print(len(rows_sum))

43


That's a lot of rows. We could count on the data that we want being in the same place every time, but we think a better way would be for the first entry in the row to correspond to the name we are looking for

In [21]:
print(rows_sum[7])

<tr>
<th scope="row">Decimal latitude
                                                    
                                                </th>
<td class="loc-metadata">
                                                    
                                                        39.17998755
                                                    
                                                    
                                                </td>
<td><sub>n/a</sub></td>
</tr>


In [23]:
# do an example of this first before writing the loop
rows_sum[7].th.get_text()
# this is the text that we want to match

'Decimal latitude\n                                                    \n                                                '

In [27]:
# this is the data that we want to pull from that row
lat = rows_sum[7].td.get_text()
lat

'\n                                                    \n                                                        39.17998755\n                                                    \n                                                    \n                                                '

that is the number that we want, but we need to pull it out from all of the row breaks - we just want the numbers and not all of the other excess stuff

In [28]:
# let's try to extract just the digits
import re
re.findall(r'\d+', lat)

['39', '17998755']

# stopped here. Need to figure out Regex before proceeding

In [None]:
for row in rows_sum:
    if row.th.get_text() contains 'Decimal latitude':
        

In next notebook:
    - correlated gauges without predictions to gauges that do have predictions
    - set threshold for accuracy in predicting flow
    - set models up to use in production