# Creating the wunder_scraper( ) function

This Jupyter Notebook creates the wunder_scraper() function we used to scrape historical weather data from Weather Underground (wunderground.com) to compare with our NYC Yellow Taxi dataset.

In [1]:
# import dependencies

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
import csv
from selenium import webdriver
import time
from splinter import Browser

We created a function to scrape the Weather Underground website's "Daily Observations" table of historical hourly weather data in order to get the appropriate weather data for our NYC taxi data. This weather data used to be available for free via API key but is now locked behind a paywall, so we scrape! First, we created the function wunder_scraper() to retrieve the site html via splinter and chromedriver.exe. We then initialized empty lists for each column of the table, created a for loop to hunt the html for appropriate html object tags and classes, and appended each found item to the appropriate list. Once all lists are complete, wunder_scraper() dumps all lists into a dataframe.

In [2]:
# create web scraping function
def wunder_scraper(url):
    # connect to website
    executable_path = {"executable_path": "chromedriver.exe"}
    browser = Browser("chrome", **executable_path, headless=False)        
    browser.visit(url)
    html = browser.html
    soup = bs(html, "html.parser")
    data = soup.find_all("table", class_="mat-table")
        
    # initialize lists
    col = []
    time = []
    temperature = []
    dew_point = [] 
    humidity = []
    wind = []
    wind_speed = []
    wind_gust = []
    pressure = []
    precipitation = []
    condition = []
        
    # scrape table data
    for d in data:
        headers = d.find_all("button", class_="mat-sort-header-button")
        for h in headers:
            col.append(h.text)
        times = d.find_all("td", "cdk-column-dateString")
        for tm in times:
            t = tm.find("span", "ng-star-inserted").text
            time.append(t)
        temps = d.find_all("td", class_="mat-column-temperature")
        for temp in temps:
            te = temp.find("span", class_="wu-value").text
            temperature.append(te)
        dew_pts = d.find_all("td", class_="mat-column-dewPoint")
        for dew in dew_pts:
            de = dew.find("span", class_="wu-value").text
            dew_point.append(de)
        humidities = d.find_all("span", class_="wu-unit-humidity")
        for humid in humidities:
            hu = humid.find("span", class_="wu-value").text
            humidity.append(hu)
        winds = d.find_all("td", class_="mat-column-windcardinal")
        for win in winds:
            w = win.find("span", class_="ng-star-inserted").text
            wind.append(w)
        wind_speeds = d.find_all("td", class_="mat-column-windSpeed")
        for wind_spd in wind_speeds:
            ws = wind_spd.find("span", class_="wu-value").text
            wind_speed.append(ws)
        wind_gusts = d.find_all("td", class_="mat-column-windGust")
        for wind_gst in wind_gusts:
            wg = wind_gst.find("span", class_="wu-value").text
            wind_gust.append(wg)        
        pressures = d.find_all("td", class_="mat-column-pressure")
        for press in pressures:
            ps = press.find("span", class_="wu-value").text
            pressure.append(ps)
        prcps = d.find_all("td", class_="mat-column-precipRate")
        for precip in prcps:
            pr = precip.find("span", class_="wu-value").text
            precipitation.append(pr)
        conditions = d.find_all("td", class_="mat-column-condition")
        for cond in conditions:
            c = cond.find("span").text
            condition.append(c)  
            
    # dump into dataframe
    df = pd.DataFrame({'Day (June 2019)': day,
                       'Time': time,
                       'Temperature (F)': temperature, 
                       'Dew Point (F)': dew_point,
                       'Humidity (%)': humidity, 
                       'Wind': wind,
                       'Wind Speed (mph)': wind_speed, 
                       'Wind Gust (mph)': wind_gust,
                       'Pressure (in)': pressure,
                       'Precipitation (in)': precipitation,
                       'Condition': condition})
    return(df)

Next, we initialized an empty dataframe. We then queried a centrally located weather NYC weather station for the month of June 2019 to populate the dataframe. Each webpage contained a single day's hourly weather table, so we created a for loop to scrape the table, append our dataframe and increment the day by 1 (via the URL) thru June 30. Finally, we exported the dataframe to a CSV to more easily access the data--we don't want to run the function every time we need to access the data.

In [3]:
june_df = pd.DataFrame()

for day in range(1,31):
    # querying weather station KNYNEWYO595, "East Village Station," NYC
    url = f'https://www.wunderground.com/history/daily/us/ny/new-york-city/KNYNEWYO595/date/2019-6-{day}'
    df = wunder_scraper(url)
    june_df = june_df.append(df, ignore_index=True)    

june_df.to_csv("Resources/wunderground_june_2019.csv", index=False)
june_df

Unnamed: 0,Day (June 2019),Time,Temperature (F),Dew Point (F),Humidity (%),Wind,Wind Speed (mph),Wind Gust (mph),Pressure (in),Precipitation (in),Condition
0,1,11:51 PM,71,59,66,S,3,0,29.78,0.0,Fair
1,1,12:51 AM,68,58,70,NW,5,0,29.77,0.0,Fair
2,1,1:51 AM,68,57,68,NNW,7,0,29.75,0.0,Fair
3,1,2:51 AM,67,57,70,N,7,0,29.76,0.0,Fair
4,1,3:51 AM,65,57,75,E,7,0,29.79,0.0,Partly Cloudy
...,...,...,...,...,...,...,...,...,...,...,...
751,30,6:51 PM,78,55,45,NNW,13,0,29.77,0.0,Mostly Cloudy
752,30,7:51 PM,77,53,43,NNW,21,25,29.80,0.0,Fair / Windy
753,30,8:51 PM,75,54,48,NNW,14,0,29.83,0.0,Cloudy
754,30,9:51 PM,73,54,51,NNW,13,0,29.84,0.0,Partly Cloudy
