<img src = "https://escp.eu/sites/default/files/logo/ESCP-logo-white-misalign.svg" width = 400 style="background-color: #240085;">
<h1 align=center><font size = 6>ESCP Business School</font></h1>
<h3 align=center><font size = 5>SCOR Datathon</font><br/>
<font size = 3>The Data Science Challenge Bridging Indian Agricultureal Protection Gap</font></h3>
<h6 align=center>Additional Data - Web Scraping (worldweatheronline.com)</h6>

Last Updated: February 15, 2022\
Author: Group 21 - Anniek Brink, Jeanne Dubois, and Resha Dirga

<h3>Chapter Objectives</h3>

<p>As part of the analysis, we will utilise external data, such as: Temprature, Humidity, Sun hours, etc. These information is obtained from websites that available publicly. Thus, this chapter will automate the information obtaining process using web-scraping method.</p>

<p><i><u>Note:</u></i> Since every website is uniquely coded, web scraping document will be dedicated to retrieve information from a dedicated website: <a href="https://www.worldweatheronline.com/">worldweatheronline.com</a></p>

<h3>Chapter 1: Import modules</h3>
<p>This chapter lists all modules that being used on this document. The module import process will be performed on this chapter</p>

In [1]:
# Install dependencies
# !pip install pandas
# !pip install beautifulsoup4
# !pip install selenium

# Load dependencies
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup
import json
import time

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

<h3>Chapter 2: Setups</h3>
<p>This chapter prepares the document with key functions, information regarding the driver to use to perfrom web scraping and website links to retrive.</p>

In [2]:
# Create function to get response
def get_response(url):
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options, executable_path=geckodriver_path)
    driver.get(url)
    time.sleep(15)
    html = driver.execute_script('return document.documentElement.outerHTML')
    soup = BeautifulSoup(html, 'html.parser')
    return soup

In [3]:
# Setup geckodriver_path
geckodriver_path = '/Users/admin/Downloads/geckodriver' # This path will need to be modified for each PC with the location of the geckodriver

In [4]:
# Read dataset containing links for webscraping
df = pd.read_csv('cities_climate/link_to_extract.csv')

In [5]:
# Get the links from worldweatheronline.com to be retrieved
list_link_region = df['Link'].dropna()
list_link_region = [k for k in list_link_region if 'worldweatheronline.com' in k]
link_to_retrieve = list_link_region

<h3>Chapter 3: Web scraping</h3>
<p>This chapter performs web scraping from the links that has been defined on the Setup chapter.</p>

In [6]:
# Extract data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

list_states = []
list_cities = []
list_months = []
list_avg_temps = []
list_min_temps = []
list_max_temps = []
list_rainfalls = []
list_humidities = []
list_rainy_days = []
list_sun_hours = []
list_index = []

for link in link_to_retrieve:
    print(str(link_to_retrieve.index(link)+1) + '/' + str(len(link_to_retrieve)))
    url = link
    try:
        response = get_response(url)
        response_zoom = response.select('div[class="box2"]')[0]
        response_cities = response_zoom.find_all('h1')[0].text
        cities = response_cities.split(' ',1)[0]
        cities = [cities] * len(months)
        response_states = response_zoom.find_all('p')[0].text
        states = response_states.split(',',1)[0]
        states = [states] * len(months)

        avg_temps = [np.nan] * len(months)
        humidities = [np.nan] * len(months)
        indexes = [link] * len(months)

        list_states = list_states + states
        list_cities = list_cities + cities
        list_months = list_months + months
        list_avg_temps = list_avg_temps + avg_temps
        list_humidities = list_humidities + humidities
        list_index = list_index + indexes

        response_zoom = response.select('path[aria-label]')
        for tag in response_zoom:
            aria_labels = BeautifulSoup(tag['aria-label']).text.split('\n')
            temps = [labels for labels in aria_labels if 'Average' in labels]
            if len(temps):
                labels = temps[0]
                param = labels.split(',')[1]
                param_value = param.split(' ', 2)[1]
                param_value = param_value.split('.')[0]
                param_type = param.split(' ', 2)[2]

                if param_type == 'Average High Temp (°c).':
                    list_max_temps.append(int(param_value))
                elif param_type == 'Average Low Temp (°c).':
                    list_min_temps.append(int(param_value))
                elif param_type == 'Average Rainfall Days.':
                    list_rainy_days.append(int(param_value))

        response_zoom = response.select('rect[aria-label]')
        for tag in response_zoom:
            aria_labels = BeautifulSoup(tag['aria-label']).text.split('\n')
            temps = [labels for labels in aria_labels if 'Sun Hours' in labels]
            if len(temps):
                labels = temps[0]
                param = labels.split(',')[1]
                param_value = param.split(' ', 2)[1]
                param_value = param_value.split('.')
                if param_value[1] == '':
                    param_value[1] = 0
                param_value = int(param_value[0]) + 0.1 * int(param_value[1])
                list_sun_hours.append(int(param_value))

            temps = [labels for labels in aria_labels if 'Precipitation' in labels]
            if len(temps):
                labels = temps[0]
                param = labels.split(',')[1]
                param_value = param.split(' ', 2)[1]
                list_rainfalls.append(float(param_value))
                
        print('Extraction complete')
        
    except:
        print('Cannot extract data!')
        
result = {
                'State': list_states,
                'City': list_cities,
                'Month': list_months,
                'Avg. Temperature °C (°F)': list_avg_temps,
                'Min. Temperature °C (°F)': list_min_temps,
                'Max. Temperature °C (°F)': list_max_temps,
                'Precipitation / Rainfall mm (in)': list_rainfalls,
                'Humidity(%)': list_humidities,
                'Rainy days (d)': list_rainy_days,
                'avg. Sun hours (hours)': list_sun_hours,
                'index': list_index
            }

df_result = pd.DataFrame(result)
print('Data extraction complete!')

1/1


  driver = webdriver.Firefox(options=options, executable_path=geckodriver_path)


Extraction complete
Data extraction complete!


In [7]:
# Export csv as checkpoint for further preprocessing
df_result.to_csv('cities_climate/SCOR_Cities_Climate_WWO.csv')