<H1>Datascraping Weather Data from Web Search</H1>
<p><STRONG>Author:</STRONG> Siraj Sabihuddin, <STRONG>Date:</STRONG> July 27, 2021</p>
<p>The below code demonstrates the basics of data scraping datascraping. The goal of this code is very simple. Its to go into some search engines and grab some weather data from the search results for a particular region. To help with this process I'll be refering to the following tutorials as well:</p> 

<ol>
  <li><a href=https://aris-pattakos.medium.com/advanced-web-scraping-concepts-to-help-you-get-unstuck-17c0203de7ab>https://aris-pattakos.medium.com/advanced-web-scraping-concepts-to-help-you-get-unstuck-17c0203de7ab</a></li>
  <li><a href=https://www.geeksforgeeks.org/how-to-extract-weather-data-from-google-in-python/>https://www.geeksforgeeks.org/how-to-extract-weather-data-from-google-in-python/</a></li>
</ol> 

<HR>

The first step in this process is some imports. In this case there are some basic tools we need for web scarping. The <STRONG>requests</STRONG> library provides tools for fetching HTML files from the web. Combined with a custom library directly from git called <STRONG>request_files</STRONG> that can be used to allow loading of a local HTML file as well. You can find the requests-file repository at: https://github.com/dashea/requests-file. To install it follow the instructions at: https://medium.com/i-want-to-be-the-very-best/installing-packages-from-github-with-conda-commands-ebf10de396f4. Finally <STRONG>BeautifulSoup</STRONG> allows for parsing of HTML data. 

In [1]:
import requests                                  # For fetching HTML files
from bs4 import BeautifulSoup                    # For Parsing HTML text data
from requests_file import FileAdapter            # For fetching local HTML files
import os                                        # For getting the directory path

At this point we can grab the input from the user for a particular location for which we want to grab weather data. The format of this data is important. Urls need to have spaces and the like properly converted to be recognizable. So this must be done. Likewise we need to add language and locality information to the query to make sure we are searching and getting results in the right language. Finally we also need to make sure the right browser identifier is being used for the query.

In [2]:
# Enter the City Name
city = input("Enter the City Name: ")
search = "Weather+in+{}".format(city)

# Construct the query URL string for google. 
lang = "hl=en&gl=en"
url = f"http://www.google.com/search?q={search}&{lang}" 
#url = f"file:///G:/My%20Drive/Random/test.html"

# Setup the request so that it has the proper 
# browser identified and setup the header for the request
head = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36' }

Enter the City Name:  Taipei


Now we are ready to grab the web data. Traditionally we would do this as shown below with the use of requests. This was if we were not going ahead and using a headless browser. Once the request is made using this approach we can store it and use it later for BeautifulSoup parsing.

In [3]:
# Setup requests local file adapter for fetching local files
req_ = requests.Session()
req_.mount('file://', FileAdapter())

# Send HTTP request. 
# Pull HTTP data from internet or local file
req = req_.get(url, headers=head)

Once the data has been collected we use Beautiful Soup to get the contents. We can make these prettier and display them as well. We see when we display the prettified version of the HTML fetched from google in this way that there is a big difference between the Inspect element version and the verion obtained from direct request. 

In [4]:
# Parse the HTML received
sor = BeautifulSoup(req.content) 

# Clean up the html for viewing (this is for debugging 
# and figuring out the right class names etc. to extract from)
prettyhtml = sor.prettify()

We can do this comparison by saving the prettified version. To save the file we need to reconstruct the local directory base path and store it. You can find out how to do this here: https://www.makeuseof.com/how-to-get-the-current-directory-in-python/

In [5]:
# Construct a full path based on the operating system being used.
current_dir = os.getcwd()
prettyhtmlfile = r'extracted_html.html'
full_path =os.path.join(current_dir, prettyhtmlfile)

# Save the file with the constructed full path
save_file = open(full_path, 'w', encoding="utf-8")
save_file.write(prettyhtml)
save_file.close()

Now from here we can find the temperature using the id "wob_tm". Likewise we an inspect the HTML and find other data such as the precipitation ("wob_pp"), humidity ("wob_hm") and wind speed ("wob_ws"). In the case that we fail to add a header, the delivered HTML isn't the same as that as we see when the header is included. Without the header we need to look for the BNeawe class which is actually not present in the inspect element version of the HTML. Thus this second class is specific to the HTML fetched through the requests library. 

In [6]:
# Find the temperature data in Celsius
temp = sor.find('span', attrs={'id': 'wob_tm'}).text 
# Find the precipitation chance data in %
precip = sor.find('span', attrs={'id': 'wob_pp'}).text
# Find the humidity in %
humid = sor.find('span', attrs={'id': 'wob_hm'}).text
# Find the wind speed in km/h
wind = sor.find('span', attrs={'id': 'wob_ws'}).text

# Output data
print('The temperature in {} is {} C'.format(city, temp))
print('The chance of precipitation in {} is {}'.format(city,precip))
print('The relative humidity in {} is {}'.format(city,humid))
print('The wind speed in {} is {}'.format(city,wind))

The temperature in Taipei is 28 C
The chance of precipitation in Taipei is 13%
The relative humidity in Taipei is 99%
The wind speed in Taipei is 3 km/h


There were some problems making get requests for a url. When the request is made without specifying the headers, it grabs the HTML delivered before all the javascript dynamic content is loaded. This makes it look very different from the Inspect Element (Browser function) version of the HTML sometimes which has very different HTML elements in the case of google than that delivered by <STRONG>requests.get()</STRONG>. To alleviate this problem at the cost of slowness, I'm using the <STRONG>selenium</STRONG> library to create a headless browser and control the browser directly to extract the final HTML. More details at: https://selenium-python.readthedocs.io/getting-started.html.

In [7]:
from selenium import webdriver                         # Using a headless web browser 
from selenium.webdriver.common.keys import Keys        # Provides keyboard input into headless browser

To use selenium, say with chromium, I need to also have a driver for chrome installed and placed in the appropriate path. The same is needed for firefox, safari, edge, etc. The URL at: https://selenium-python.readthedocs.io/installation.html explains how to install selenium and get it running quickly.

In [12]:
chromeOptions = webdriver.ChromeOptions() 
chromeOptions.add_argument("--remote-debugging-port=9222")
driver = webdriver.Chrome(options=chromeOptions)

Alternatively, we can use the browser approach with selenium. Here we grab the url and do an element 

In [13]:
driver.get(url)

As with the previous direct attempt at data-scraping, we can extract the temperature, precipitation, humidity and wind speed data through selenium. The difference lies in making a call directly to chrome libraries and extracting the data through <STRONG>find_element_by_id</STRONG>.

In [17]:
# Find the temperature data in Celsius
temp = driver.find_element_by_id("wob_tm").text
precip = driver.find_element_by_id("wob_pp").text
humid = driver.find_element_by_id("wob_hm").text
wind = driver.find_element_by_id("wob_ws").text

# Output temperature
print('The temperature in {} is {} C'.format(city, temp))
print('The chance of precipitation in {} is {}'.format(city,precip))
print('The relative humidity in {} is {}'.format(city,humid))
print('The wind speed in {} is {}'.format(city,wind))

The temperature in Taipei is 28 C
The chance of precipitation in Taipei is 13%
The relative humidity in Taipei is 99%
The wind speed in Taipei is 3 km/h


Once the data has been captured we can close the selenium browser and end the program.

In [11]:
driver.close()