<img src = "https://escp.eu/sites/default/files/logo/ESCP-logo-white-misalign.svg" width = 400 style="background-color: #240085;">
<h1 align=center><font size = 6>ESCP Business School</font></h1>
<h3 align=center><font size = 5>SCOR Datathon</font><br/>
<font size = 3>The Data Science Challenge Bridging Indian Agricultureal Protection Gap</font></h3>
<h6 align=center>Additional Data - Web Scraping (climate-data.org)</h6>

Last Updated: February 15, 2022\
Author: Group 21 - Anniek Brink, Jeanne Dubois, and Resha Dirga

<h3>Chapter Objectives</h3>

<p>As part of the analysis, we will utilise external data, such as: Temprature, Humidity, Sun hours, etc. These information is obtained from websites that available publicly. Thus, this chapter will automate the information obtaining process using web-scraping method.</p>

<p><i><u>Note:</u></i> Since every website is uniquely coded, web scraping document will be dedicated to retrieve information from a dedicated website: <a href="https://en.climate-data.org/">en.climate-data.org</a></p>

<h3>Chapter 1: Import modules</h3>
<p>This chapter lists all modules that being used on this document. The module import process will be performed on this chapter</p>

In [1]:
# Install dependencies
# !pip install pandas
# !pip install beautifulsoup4
# !pip install selenium

# Load dependencies
import pandas as pd
import numpy as np

import requests
from bs4 import BeautifulSoup
import json
import time

import selenium
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

<h3>Chapter 2: Setups</h3>
<p>This chapter prepares the document with key functions, information regarding the driver to use to perfrom web scraping and website links to retrive.</p>

In [2]:
# Create function to get response
def get_response(url):
    options = Options()
    options.headless = True
    driver = webdriver.Firefox(options=options, executable_path=geckodriver_path)
    driver.get(url)
    html = driver.execute_script('return document.documentElement.outerHTML')
    soup = BeautifulSoup(html, 'html.parser')
    return soup

In [3]:
# Setup geckodriver_path
geckodriver_path = '/Users/admin/Downloads/geckodriver' # This path will need to be modified for each PC with the location of the geckodriver

In [4]:
# Read dataset containing links for webscraping
df = pd.read_csv('cities_climate/link_to_extract.csv')

In [7]:
# Get the links from climate-data.org to be retrieved
list_link_region = df['Link'].dropna()
list_link_region = [k for k in list_link_region if 'climate-data.org' in k]
link_to_retrieve = list_link_region

In [11]:
link_to_retrieve

['https://en.climate-data.org/asia/india/warangal/warangal-968182/',
 'https://en.climate-data.org/asia/india/khammam/khammam-4940/']

<h3>Chapter 3: Web scraping</h3>
<p>This chapter performs web scraping from the links that has been defined on the Setup chapter.</p>

In [12]:
# Extract data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

list_states = []
list_cities = []
list_months = []
list_avg_temps = []
list_min_temps = []
list_max_temps = []
list_rainfalls = []
list_humidities = []
list_rainy_days = []
list_sun_hours = []
list_index = []

for link in link_to_retrieve:
    print(str(link_to_retrieve.index(link)+1) + '/' + str(len(link_to_retrieve)))
    url = link
    try:
        response = get_response(url)
        response_zoom = response.select('table[id="weather_table"]')[0]
        response_zoom = response_zoom.find_all('tr')[1:]
        cities = url.rsplit('/',2)[1].rsplit('-',1)[0]
        cities = [cities] * len(months)
        states = url.rsplit('/',3)[1]
        states = [states] * len(months)
        indexes = [link] * len(months)
        
        list_states = list_states + states
        list_cities = list_cities + cities
        list_months = list_months + months
        list_index = list_index + indexes
    
        # Immoweb code            
        for tr in response_zoom:           
            tds = tr.find_all('td')
            for response_text in tds[1:]:
                values = response_text.decode_contents()
                values = BeautifulSoup(values).select('p')[0].decode_contents()

                if tds[0].text == 'Avg. Temperature °C (°F)':
                    list_avg_temps.append(values)
                elif tds[0].text == 'Min. Temperature °C (°F)':
                    list_min_temps.append(values)
                elif tds[0].text == 'Max. Temperature °C (°F)':
                    list_max_temps.append(values)
                elif tds[0].text == 'Precipitation / Rainfall mm (in)':
                    list_rainfalls.append(values)
                elif tds[0].text == 'Humidity(%)':
                    list_humidities.append(values)
                elif tds[0].text == 'Rainy days (d)':
                    list_rainy_days.append(values)
                elif tds[0].text == 'avg. Sun hours (hours)':
                    list_sun_hours.append(values)
        
    except:
        print('Cannot extract data')
        
result = {
                'State': list_states,
                'City': list_cities,
                'Month': list_months,
                'Avg. Temperature °C (°F)': list_avg_temps,
                'Min. Temperature °C (°F)': list_min_temps,
                'Max. Temperature °C (°F)': list_max_temps,
                'Precipitation / Rainfall mm (in)': list_rainfalls,
                'Humidity(%)': list_humidities,
                'Rainy days (d)': list_rainy_days,
                'avg. Sun hours (hours)': list_sun_hours,
                'index': list_index
            }

df_result = pd.DataFrame(result)
print('Data extraction complete!')

1/2


  driver = webdriver.Firefox(options=options, executable_path=geckodriver_path)


2/2
Data extraction complete!


In [13]:
# Export csv as checkpoint for further preprocessing
df_result.to_csv('cities_climate/SCOR_Cities_Climate_CDO_add.csv')