# Capstone - Balloon Data Webscraping and Pre-Processing
Date started: 2021.10.26<br>
Date completed: 2021.10.28<br>
William Matthews

### Introduction

Windspeed and cloud cover in the alpine we belive are going to be some of the primary factors driving lift opening times with two primary effects.  The windspeed in the past 24 hrs is going to shift snow around the mountain and make avalanche control more difficult.  Windspeed on the day is going to affect lift operation directly.  The power (stength) of the wind increases with the [cube of the velocity](http://xn--drmstrre-64ad.dk/wp-content/wind/miller/windpower%20web/en/tour/wres/enrspeed.htm), so there is a threashold above which it is unsafe to operate the lifts.  Cloud cover on any given day is going to affect the ability of ski patrol to conduct avalanche control.  If the slopes are not safe, the lifts will not open, or may open late.

For wind data, there are two high-quality anemometers, one on top of each of Whistler and Blackcomb, but they are privately owned by Vail Resorts and we could not find the data from these weather stations.  On the advice of [David Jones](https://www.linkedin.com/in/david-jones-meteorologist-23913871/?originalSubdomain=ca) we are going to attempt to correlate data from two weather ballon stations located near by to get an approximation of wind in the alpine.  Specifically we are interested in the wind speed and wind direction at the 700 milli-bar (or approximately 3000m) elevation.

For cloud cover data, there are no easily available sources of daily cloud cover data.  Again, on the kind advice of David Jones, we will try to use a combination of relative humidity, temperature, and dew point at the 700 milli-bar elevation to construct an estimate of cloud cover. 

### Report Objectives

The purpose of this report is to outline the process of data retrieval for two weather ballon data sets, one from Port Hardy and one from Quillayute.  We will be scraping the data from the University of Wyoming's upper air sounding data website.  It contains the records for both balloon sites, and this is handy because they are in two different countries and run by two different national organizations.

### Determining Request Format

The first 3 requests will be for Dec 2015, Jan 2016, and Feb 2016:

_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2015&MONTH=12&FROM=0100&TO=3112&STNM=71109_<br>
_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2016&MONTH=01&FROM=0100&TO=3112&STNM=71109_<br>
_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2016&MONTH=02&FROM=0100&TO=2912&STNM=71109_<br>

This looks promising.  The fields for `YEAR`, `MONTH`, and `TO` apear to be our key variables with `TO` representing an end time.  There is also `STNM` representing the station climate id code.  We will need to change this to `72797` to access the station at Quillayute (Port Hardy is 71109).  Further tinkering shows we will have to be careful of the ending time we pass to the `requests.get()` function.  Leap vs. non leap years for February as well as each months unique ending date need to be specified correctly.  An incorrect ending date for a month does not revert to the last date of the month but throws and error.  Let's do the following:
- Modify the above request and paste it to the browser to confirm we can get what we need
- Confrim input for `TO` parameter for months ending in 30 and 28.

Let's try and get data from 2021 March with the following url:

_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2021&MONTH=03&FROM=0100&TO=3112&STNM=71109_<br>

Looks like that works.  Now we will manually run Feb 2017 for a month ending in 28 and April 2017 for a month ending in 30.  It looks like the format of `TO` is `ddhh` but we will run the following to be sure:

_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2017&MONTH=02&FROM=0100&TO=2812&STNM=71109_<br>
_http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST&YEAR=2017&MONTH=04&FROM=0100&TO=3012&STNM=71109_<br>

Both of those worked, so we have the format our requests need to take.  Specifically, we can use the url above with the following parameters:
- `YEAR` in the format `yyyy`
- `MONTH` in the format `mmmm`
- `TO` in the format `ddhh`

Next step is to get one page downloaded and start exploring how to access the information we need.

### Determining Website Format

Before we automate grabbing all the data we need, we have to determine the website format so we are extracting the information we want.  We will start by importing the libraries we will need and grabbing our first request and exploring its contents.

In [1]:
# libraries for getting websites and interpreting results
import requests
import lxml
import bs4
import time

# working with datetimes
from datetime import datetime
from datetime import timedelta
from datetime import date
from dateutil.relativedelta import relativedelta

# for extracting data from strings
import re

# dataframes for organizing data nicely for output
import pandas as pd

In [2]:
# set parameters for page request
year = '2015'
month = '01'
to = '3112'
stnm = '71109'

# make request
req = requests.get(f"http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST"
                   f"&YEAR={year}" 
                   f"&MONTH={month}"
                   f"&FROM=0100"
                   f"&TO={to}"
                   f"&STNM={stnm}")

In [3]:
# look at the raw text
req.text

'<HTML>\n<TITLE>University of Wyoming - Radiosonde Data</TITLE>\n<BODY BGCOLOR="white">\n<H2>71109 YZT Port Hardy Observations at 00Z 01 Jan 2015</H2>\n<PRE>\n-----------------------------------------------------------------------------\n   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV\n    hPa     m      C      C      %    g/kg    deg   knot     K      K      K \n-----------------------------------------------------------------------------\n 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9\n 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4\n  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4\n  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8\n  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1\n  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8\n  938.0    797    0.0   -5.0     69   2.82    324  

It looks like we can get what we want.  Now it is time to start exploring how to extract elements by getting this into a more readable format.

In [4]:
# utilize beautiful soup
soup = bs4.BeautifulSoup(req.text, "lxml")

# take a look
soup.text

'\nUniversity of Wyoming - Radiosonde Data\n\n71109 YZT Port Hardy Observations at 00Z 01 Jan 2015\n\n-----------------------------------------------------------------------------\n   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV\n    hPa     m      C      C      %    g/kg    deg   knot     K      K      K \n-----------------------------------------------------------------------------\n 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9\n 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4\n  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4\n  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8\n  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1\n  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8\n  938.0    797    0.0   -5.0     69   2.82    324     18  278.2  286.2  278.7\n  930.0    865    2.2  -10.8 

It looks like we thankfully have a simple website to extract the information from.  Our date and time information are contained inside the `<h2>` tag and the measurement data we are interested in is contained inside the `<pre>` tag.  Let's deal with the header tag first.

In [5]:
# get all <h2> tags
header_dates = soup.select('h2')

# check out how many headers there are
print(f'Number of headers {len(header_dates)}')

Number of headers 62


In [6]:
# show first 10 headers
for i in range(10):
    print(f"Header {i}: {header_dates[i]}")

Header 0: <h2>71109 YZT Port Hardy Observations at 00Z 01 Jan 2015</h2>
Header 1: <h2>71109 YZT Port Hardy Observations at 12Z 01 Jan 2015</h2>
Header 2: <h2>71109 YZT Port Hardy Observations at 00Z 02 Jan 2015</h2>
Header 3: <h2>71109 YZT Port Hardy Observations at 12Z 02 Jan 2015</h2>
Header 4: <h2>71109 YZT Port Hardy Observations at 00Z 03 Jan 2015</h2>
Header 5: <h2>71109 YZT Port Hardy Observations at 12Z 03 Jan 2015</h2>
Header 6: <h2>71109 YZT Port Hardy Observations at 00Z 04 Jan 2015</h2>
Header 7: <h2>71109 YZT Port Hardy Observations at 12Z 04 Jan 2015</h2>
Header 8: <h2>71109 YZT Port Hardy Observations at 00Z 05 Jan 2015</h2>
Header 9: <h2>71109 YZT Port Hardy Observations at 12Z 05 Jan 2015</h2>


From the above we are going to be able to extract all the datetime information we need.  One thing to note is that the times are given in Zulu time.  A quick google search and [NOAA](https://www.ready.noaa.gov/READYtime.php) provides 00Z is 4PM Pacific Standard Time and 12Z is 4AM Pacific Standard Time.  This means that all of the dates for 00Z time need to be shifted backwards one day as Jan 01 00Z is Dec 31 4PM PST.  Said simply, we are 8 hours behind England which is where Zulu time is defined from.

Let's write a function to extract the info we need and get it in a datetime format.

In [39]:
def zulu_to_datetime(zulu_time):
    """
    Takes a string containing zulu time and a date.  Converts zulu time to PST.
    Pushes day back one for 00Z to 007Z as PST is one day behind at that time.
    Returns a datetime object.
    ____________________
    
    Parameters:
                zulu_time: string array ending in format 00Z 01 Jan 2015.
    ____________________
    
    Returns:
                a datetime object
    
    """
    # set default for flag to deal with 00Z-07Z time needing to go backwards a day
    subtract_day = False
    
    # get string into an array, split on spaces
    arr_zulu_time = zulu_time.split()
    
    # get components of zulu time
    year = arr_zulu_time[-1]
    month = arr_zulu_time[-2]
    day = arr_zulu_time[-3]
    
    # convert time to PST and flag for moving day backwards for 00Z
    hour = int(arr_zulu_time[-4][:2])
    if hour < 8:
        hour += 16
        subtract_day = True
    else:
        hour -= 8        
    
    # rebuild string to pass to datetime.strptime
    d_time = f"{month} {day} {year}"
    
    # create datetime
    time_stamp = datetime.strptime(d_time, '%b %d %Y')
        
    # update with correct hour
    time_stamp = time_stamp.replace(hour=hour)
    
    # if we need to subtract time, go backwards a day
    if subtract_day:
        time_stamp = time_stamp - timedelta(days = 1)
    
    return time_stamp

In [34]:
# test function
for h in header_dates:
    print(h.getText())
    print(zulu_to_datetime(h.getText()))

71109 YZT Port Hardy Observations at 00Z 01 Jan 2015
2014-12-31 16:00:00
71109 YZT Port Hardy Observations at 12Z 01 Jan 2015
2015-01-01 04:00:00
71109 YZT Port Hardy Observations at 00Z 02 Jan 2015
2015-01-01 16:00:00
71109 YZT Port Hardy Observations at 12Z 02 Jan 2015
2015-01-02 04:00:00
71109 YZT Port Hardy Observations at 00Z 03 Jan 2015
2015-01-02 16:00:00
71109 YZT Port Hardy Observations at 12Z 03 Jan 2015
2015-01-03 04:00:00
71109 YZT Port Hardy Observations at 00Z 04 Jan 2015
2015-01-03 16:00:00
71109 YZT Port Hardy Observations at 12Z 04 Jan 2015
2015-01-04 04:00:00
71109 YZT Port Hardy Observations at 00Z 05 Jan 2015
2015-01-04 16:00:00
71109 YZT Port Hardy Observations at 12Z 05 Jan 2015
2015-01-05 04:00:00
71109 YZT Port Hardy Observations at 00Z 06 Jan 2015
2015-01-05 16:00:00
71109 YZT Port Hardy Observations at 12Z 06 Jan 2015
2015-01-06 04:00:00
71109 YZT Port Hardy Observations at 00Z 07 Jan 2015
2015-01-06 16:00:00
71109 YZT Port Hardy Observations at 12Z 07 Jan 201

It looks like the above is working well.  We will stop here with the timestamp as we will insert it as the first column in each row of data we generate.  Let's move onto the main data in the `<pre>` block and check out how many instances there are and how we will get them.

In [10]:
# get all <pre> tags
data = soup.select('pre')

# check out how many pre blocks there are
print(f'Number of pre blocks {len(data)}')
print()

# print first 5 to see what the pattern is
for i in range(5):
    print(f"Element number {i}")
    print(data[i])

Number of pre blocks 124

Element number 0
<pre>
-----------------------------------------------------------------------------
   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV
    hPa     m      C      C      %    g/kg    deg   knot     K      K      K 
-----------------------------------------------------------------------------
 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9
 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4
  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4
  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8
  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1
  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8
  938.0    797    0.0   -5.0     69   2.82    324     18  278.2  286.2  278.7
  930.0    865    2.2  -10.8     38   1.81    331     21  281.1  286.5  281.4
  928.0    883 

From the above it looks like the main block of data we are interested in is contained every second element starting with element 0.  That means there are 62 blocks of data, and that matches with the 62 dates we are extracting above.  Let's take a look at how this shows up in raw text format.

In [11]:
data[0].getText()

'\n-----------------------------------------------------------------------------\n   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV\n    hPa     m      C      C      %    g/kg    deg   knot     K      K      K \n-----------------------------------------------------------------------------\n 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9\n 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4\n  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4\n  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8\n  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1\n  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8\n  938.0    797    0.0   -5.0     69   2.82    324     18  278.2  286.2  278.7\n  930.0    865    2.2  -10.8     38   1.81    331     21  281.1  286.5  281.4\n  928.0    883    2.2  -10.8     38   1.82    332

This looks a bit messy.  Let's start by trying to get it into an array, seperated on newline (\n) characters.

In [12]:
# split the data on new line characters
data_arr = data[0].getText().split('\n')

# show result
data_arr

['',
 '-----------------------------------------------------------------------------',
 '   PRES   HGHT   TEMP   DWPT   RELH   MIXR   DRCT   SKNT   THTA   THTE   THTV',
 '    hPa     m      C      C      %    g/kg    deg   knot     K      K      K ',
 '-----------------------------------------------------------------------------',
 ' 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9',
 ' 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4',
 '  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4',
 '  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8',
 '  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1',
 '  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8',
 '  938.0    797    0.0   -5.0     69   2.82    324     18  278.2  286.2  278.7',
 '  930.0    865    2.2  -10.8     38   1.81    331     21  281.1  286.5  281.4',
 '  928.0  

That looks much better.  Let's build out a regular expression to get back all of the numeric data.  Credit to [this Stack Overflow post](https://stackoverflow.com/questions/4703390/how-to-extract-a-floating-number-from-a-string) for the regex pattern.

In [13]:
# get one row of data
data_row = data_arr[5]

# pass through regex
display(re.findall(r"[-+]?\d*\.\d+|\d+", data_row ))

# grab a row with negative values to confirm t hat works too
data_row = data_arr[6]
display(re.findall(r"[-+]?\d*\.\d+|\d+", data_row ))

['1033.0',
 '17',
 '4.6',
 '1.1',
 '78',
 '4.03',
 '290',
 '2',
 '275.2',
 '286.3',
 '275.9']

['1000.0',
 '282',
 '2.6',
 '-0.6',
 '79',
 '3.68',
 '315',
 '6',
 '275.8',
 '286.0',
 '276.4']

With the regrex expression working well and access to all of our data, let's start building out a series of functions to put everything together and loop through all are desired dates.  We will build the functions roughly working from the inner most part of the loop outwards and then test everything on a small subset of time to ensure the results are what we want.

The first piece of the puzzle is to have a function to cast a number to a float or int, depending if it contains a decimal point.

In [41]:
def cast_number(num):
    """
    Takes a string representation of a number.  If it contains a '.' cast it to a float,
    otherwise, cast to an int.
    """
    
    # if contains period, cast to float
    if num.count('.') > 0:
        return float(num)
    
    # otherwise, cast to int
    else:
        return int(num)



In [40]:
# test above function
x = cast_number('564')
display(type(x))
x = cast_number('-54.5')
display(type(x))

int

float

That works well.  Next we want to extract the flight measurements from each flight, restricted by a floor and ceiling pressure, as we are only interested in the data around 700 milli-bar (700 hPa).

In [15]:
def extract_flight_measurements(flight_data, floor_pressure = 2000.0, ceiling_pressure = 0.0):
    """
    Takes a lower and upper bound for pressure measurements that we want to extract
    from the data of one given flight of a weather ballon.
    ____________________
    
    Parameters:
                    flight_data: list of strings where each row represents one sounding at
                             at a given pressure elevation.
      floor_pressure (optional): float or int, the lower bound of the pressure (in hPa)
                                to extract data for, default 2000.0 hPa (well below sea level)
    ceiling_pressure (optional): float or int, the upper bound of the pressure (in hPa)
                                  to extract data for, defualt 0.0 hPa (space)
                                
                
    ____________________
    
    Returns:
            a list, where each element is a list representing one sounding at a given
            pressure elevation.
    
    """
    # define return list
    valid_records = []
    
    # define index of pressure measurement
    pres_index = 0
    
    # define regex pattern
    regex_pattern = r"[-+]?\d*\.\d+|\d+"
    
    # loop through each element in the array
    for record in flight_data:
        
        # confirm there is information in this record.  If not, skip
        if len(record) < 1:
            pass
        
        # otherwise, get the valid record
        else:
        
            # parse data with regex
            record = re.findall(regex_pattern, record)

            # cast each element to correct data type
            for i in range(len(record)):
                record[i] = cast_number(record[i])

            # if pressure less than/equal to floor_pressure:
            if record[pres_index] <= floor_pressure:

                # if pressure greater than ceiling pressure:
                if record[pres_index] >= ceiling_pressure:

                    # append array element to return array
                    valid_records.append(record)

                # else we are past the upper limit and need no more data from this flight
                else:
                    # break out of for loop
                    break
                
    # return the list of elevation soundings
    return valid_records

In [16]:
# testing for function

# get all <pre> tags
data = soup.select('pre')
# split the data on new line characters
flight_record =  data[0].getText().split('\n')

# get valid records from row 5 onwards
valid_records = extract_flight_measurements(flight_record[5:], 730, 670)

for rec in flight_record[5:]:
    print(rec)

for rec in valid_records:
    print(rec)

 1033.0     17    4.6    1.1     78   4.03    290      2  275.2  286.3  275.9
 1000.0    282    2.6   -0.6     79   3.68    315      6  275.8  286.0  276.4
  997.1    305    2.4   -0.7     80   3.66    310      6  275.8  286.0  276.4
  967.0    553    0.4   -1.6     86   3.53    306     11  276.2  286.1  276.8
  960.1    610    0.2   -2.7     81   3.28    305     12  276.5  285.8  277.1
  944.0    746   -0.3   -5.3     69   2.74    319     17  277.4  285.2  277.8
  938.0    797    0.0   -5.0     69   2.82    324     18  278.2  286.2  278.7
  930.0    865    2.2  -10.8     38   1.81    331     21  281.1  286.5  281.4
  928.0    883    2.2  -10.8     38   1.82    332     21  281.3  286.7  281.6
  925.0    909    2.4  -15.6     25   1.23    335     22  281.8  285.5  282.0
  924.4    914    2.5  -16.2     24   1.18    335     22  281.9  285.5  282.1
  920.0    953    3.4  -20.6     15   0.81    337     23  283.2  285.8  283.4
  906.0   1078    5.2  -10.8     30   1.86    343     26  286.3 

Looks like extracting flight measurements is working well.  Next on the list is to build a loop that will repeat this for one month (one request) worth of data and store it into a list, paired with the date for each record.

In [36]:
def extract_one_month_data(date_data, measurement_data):

    """
    Takes an array of date data and arry of mesaurement data.  Returns an array where each
    element is an array of form [datetime, measurement1 ... measurementn]
    ____________________
    
    Parameters:
                    date_data: array of bs4.element.Tag where each contains the date and time
                               of a given ballon flight.
             measurement_data: array of bs4.element.Tag where each contains measurement data 
                               of a given ballon flight
                
    ____________________
    
    Returns:
            a list, where each element is a list representing one sounding at a given
            pressure elevation on a given date.
    
    """
    # set constants for lower and upper bounds on milli-bar elevation (hPa)
    lower_bound_hPa = 730
    upper_bound_hPa = 670
    
    # setup array to store all values to be returned
    one_month_flight_records = []
    
    # loop through measurement array, getting every second element
    for i in range(0, len(measurement_data), 2):

        # get one date (divide i by 2 as array is half the size)
        d_i = int(i/2)
        date_str = date_data[d_i]
        
        # get text from bs4 element
        date_str = date_str.getText()
        
        # extract datetime
        flight_datetime = zulu_to_datetime(date_str)
        
        # get one element of measurement data
        measurement_str = measurement_data[i]
        
        # get text from bs4 element
        measurement_str = measurement_str.getText()
        
        # split into lines and ditch header elements (first 5 rows)
        measurement_str_arr = measurement_str.split('\n')[5:]
        
        # extract flight measurements
        flight_measurements = extract_flight_measurements(measurement_str_arr,
                                                          lower_bound_hPa,
                                                          upper_bound_hPa)
        
        # put results into one array
        for sounding in flight_measurements:
            
            # array to store one sounding
            sounding_record = []
            
            # store date and sounding data
            sounding_record.append(flight_datetime)
            sounding_record.extend(sounding)
        
            # append this array to return array
            one_month_flight_records.append(sounding_record)
        
    # return this array
    return one_month_flight_records

In [42]:
# testing for function

# get all <h2> tags
header_dates = soup.select('h2')

# get all <pre> tags
data = soup.select('pre')

# test function
month_data = extract_one_month_data(header_dates, data)

# looking for approx 5 soundings per flight, 2 flights per day, 31 days for the
# month we were testing

print(f"Expected number of records: {5*2*31}")
print(f"Actual number of records: {len(month_data)}")

# look at output
for row in month_data:
    print(row)

Expected number of records: 310
Actual number of records: 287
[datetime.datetime(2014, 12, 31, 16, 0), 724.0, 2899, -1.7, -5.3, 76, 3.58, 351, 25, 297.7, 308.6, 298.3]
[datetime.datetime(2014, 12, 31, 16, 0), 714.0, 3010, -1.9, -6.6, 70, 3.28, 344, 25, 298.7, 308.8, 299.2]
[datetime.datetime(2014, 12, 31, 16, 0), 703.0, 3134, -1.1, -10.1, 50, 2.54, 337, 26, 300.9, 308.9, 301.3]
[datetime.datetime(2014, 12, 31, 16, 0), 700.0, 3168, -1.3, -11.3, 47, 2.31, 335, 26, 301.0, 308.4, 301.4]
[datetime.datetime(2014, 12, 31, 16, 0), 699.0, 3179, -1.3, -11.3, 47, 2.32, 333, 26, 301.1, 308.5, 301.6]
[datetime.datetime(2014, 12, 31, 16, 0), 683.9, 3353, -2.5, -10.0, 56, 2.63, 310, 22, 301.7, 310.0, 302.2]
[datetime.datetime(2015, 1, 1, 4, 0), 730.0, 2794, -0.3, -8.3, 55, 2.81, 347, 15, 298.5, 307.3, 299.0]
[datetime.datetime(2015, 1, 1, 4, 0), 700.0, 3128, -2.7, -7.7, 68, 3.07, 325, 18, 299.5, 309.0, 300.0]
[datetime.datetime(2015, 1, 1, 16, 0), 729.8, 2743, -1.7, -2.4, 95, 4.41, 280, 27, 297.0, 31

It looks compiling the date and sounding data for each record is working well.  Let's move onto our (second) last function to scrape out one month of data.

In [44]:
def scrape_one_month(year, month, stnm):
    """
    Takes a year, month, and station name.  Requests the appropriate
    page, parses the data from it, and returns the data in a list where each element
    is a list representing one sounding elevation on a given flight.
    ____________________
    
    Parameters:
                year: an int, representing a year
               month: an int, representing a month
                stnm: a string, representing a staion name (id number)
    ____________________
    
    Returns:
          month_data: a list of lists.  Each element is record from one sounding on one flight
               error: a list contain year and month if an error occured, blank list otherwise
    
    """
    # set last_flight time
    last_flight_time = '12'
    
    # set the code for the last flight of the month, format is ddhh where hh is in zulu time
    # get first day of the current month
    current_month = date(year, month, 1)
    # jump to first day of next month
    next_month = current_month + relativedelta(months = +1)
    # subtracting one day to get the last day of the current month
    last_day = (next_month - timedelta(days=1)).day
    to = f"{last_day}{last_flight_time}"
    
    # build page request 
    url = (f"http://weather.uwyo.edu/cgi-bin/sounding?region=naconf&TYPE=TEXT%3ALIST"
           f"&YEAR={year}" 
           f"&MONTH={month}"
           f"&FROM=0100"
           f"&TO={to}"
           f"&STNM={stnm}")
    
    # set a control variable to throw an error if we are told busy more than 5 times in a row
    # for a single request
    too_busy_count = 0
    
    # loop up to five times trying request.
    while too_busy_count < 5:
    
        # make page request
        req = requests.get(url)

        # parse page request using bs4
        soup = bs4.BeautifulSoup(req.text, "lxml")
        
        # look for a keyword that only shows up in a good request
        if soup.text.find('Observations') > -1:
            
            # parse respective classes of bs4 object, break loop
            header_dates = soup.select('h2')
            data = soup.select('pre')
            break
        
        # otherwise if we have not received a successful request yet
        else:
            # iterate counter
            too_busy_count += 1
            # wait 10 seconds to hopefully avoid being too busy next time
            time.sleep(10)
        
    # instantiate empty array to return in case of success
    error = []
    # instantiate empty array to return in case of error
    month_data = [] 
    
    # if the to busy count is 5, we have no data. write to the error list.
    if too_busy_count == 5:
        error.extend([year, month])
    else:
        month_data = extract_one_month_data(header_dates, data)
    
    # return one month of data, also return error log
    return month_data, error

In [45]:
# test function
month_data, error = scrape_one_month(2016, 5, '71109')

# looking for approx 5 soundings per flight, 2 flights per day, 31 days for the
# month we were testing

print(f"Expected number of records: {5*2*31}")
print(f"Actual number of records: {len(month_data)}")

# look at output
for row in month_data:
    print(row)

Expected number of records: 310
Actual number of records: 289
[datetime.datetime(2016, 4, 30, 16, 0), 709.5, 3048, 0.8, -6.3, 59, 3.39, 285, 15, 302.2, 312.8, 302.8]
[datetime.datetime(2016, 4, 30, 16, 0), 703.0, 3123, 0.6, -7.4, 55, 3.13, 285, 16, 302.8, 312.6, 303.3]
[datetime.datetime(2016, 4, 30, 16, 0), 700.0, 3157, 0.6, -6.4, 59, 3.4, 285, 16, 303.1, 313.8, 303.7]
[datetime.datetime(2016, 4, 30, 16, 0), 697.0, 3192, 0.4, -5.6, 64, 3.63, 287, 16, 303.3, 314.6, 303.9]
[datetime.datetime(2016, 4, 30, 16, 0), 683.0, 3353, -0.9, -5.8, 70, 3.65, 295, 16, 303.5, 315.0, 304.2]
[datetime.datetime(2016, 5, 1, 4, 0), 729.0, 2820, 3.0, -2.0, 70, 4.55, 263, 12, 302.2, 316.3, 303.1]
[datetime.datetime(2016, 5, 1, 4, 0), 700.0, 3148, 0.4, -3.6, 75, 4.21, 275, 14, 302.9, 315.9, 303.7]
[datetime.datetime(2016, 5, 1, 4, 0), 682.3, 3353, -1.1, -4.2, 80, 4.14, 260, 15, 303.4, 316.3, 304.2]
[datetime.datetime(2016, 5, 1, 4, 0), 678.0, 3403, -1.5, -4.3, 81, 4.12, 262, 15, 303.6, 316.4, 304.3]
[datetim

Looks like the above is returning the same information as our previous funciton, which is exactly what we wanted.  Moving onto our final piece of the puzzle, we are going to scrape monthly data for our date range for both ballon stations, get everything into a data frame, and finally save that data frame as a csv.

In [38]:
# define stations
stations = [['Port Hardy', '71109'],
            ['Quillayute', '72797']]

# define start and end dates as per seasons we are working with
# as well as months we don't need data for
start_date = date(2015, 1, 1)
end_date = date(2021, 4, 18)
out_of_season_months = [5,6,7,8,9,10]

# loop through stations
for station in stations:
    station_name = station[0]
    stnm = station[1]
    
    # setup array to store all station records and one for errors
    station_records = []
    error_records = []
    
    # set the current date so we can work with it without altering
    # the start date (needed for next loop iteration)
    current_date = start_date
    
    # loop through months in years, skip out of season months
    while (current_date <= end_date):
        
        # proceed if we are in season, else skip
        if current_date.month not in out_of_season_months:
        
            # get year and month from current date as ints
            year = int(current_date.year)
            month = int(current_date.month)

            # get records from current station for given year and month
            station_record, error = scrape_one_month(year, month, stnm)  
            
            # add records to the ongoing tally of records
            station_records.extend(station_record)

            # if there was anything in the error log, append it to error records
            # also output message of success or failure for given year/month combo
            if len(error) != 0:
                error_records.append(error)
                print(f"Year: {year}   Month: {month} produced error")
            else:
                print(f"Year: {year}   Month: {month} success!")
        
        # increment counter to next month
        current_date += relativedelta(months = +1)
        
    # get data into df
    df = pd.DataFrame(station_records,
                      columns = ['DATE',
                                 'PRES',
                                 'HGHT',
                                 'TEMP',
                                 'DWPT',
                                 'RELH',
                                 'MIXR',
                                 'DRCT',
                                 'SKNT',
                                 'THTA',
                                 'THTE',
                                 'THTV'])
    # add station name column
    df['STNM'] = station_name
    
    # write to csv
    path_fname = f"./Data/BallonData/{station_name}.csv"
    df.to_csv(path_fname,
              index = False)
    
    # get errors into a df
    df = pd.DataFrame(error_records, columns = ['Year', 'Month'])
    
    # write error records to csv
    path_fname = f"./Data/BallonData/{station_name}-Errors.csv"
    df.to_csv(path_fname,
              index = False)    

Year: 2015   Month: 1 success!
Year: 2015   Month: 2 success!
Year: 2015   Month: 3 success!
Year: 2015   Month: 4 success!
Year: 2015   Month: 11 success!
Year: 2015   Month: 12 success!
Year: 2016   Month: 1 success!
Year: 2016   Month: 2 success!
Year: 2016   Month: 3 success!
Year: 2016   Month: 4 success!
Year: 2016   Month: 11 success!
Year: 2016   Month: 12 success!
Year: 2017   Month: 1 success!
Year: 2017   Month: 2 success!
Year: 2017   Month: 3 success!
Year: 2017   Month: 4 success!
Year: 2017   Month: 11 success!
Year: 2017   Month: 12 success!
Year: 2018   Month: 1 success!
Year: 2018   Month: 2 success!
Year: 2018   Month: 3 success!
Year: 2018   Month: 4 success!
Year: 2018   Month: 11 success!
Year: 2018   Month: 12 success!
Year: 2019   Month: 1 success!
Year: 2019   Month: 2 success!
Year: 2019   Month: 3 success!
Year: 2019   Month: 4 success!
Year: 2019   Month: 11 success!
Year: 2019   Month: 12 success!
Year: 2020   Month: 1 success!
Year: 2020   Month: 2 success

So the above has worked well and we now have our weather ballon data!  The excitement is due to a lot of hiccups putting everything together.  More specifically, the testing on all the helper functions was not robust enough, but all is sorted now.  The good news is we also have the functionallity to get more data in the future if needed.

_Personal Note: 12hrs to this point, including re-learning webscraping_