<a href="https://colab.research.google.com/github/tophercollins/eifo-data-extraction/blob/main/eifo_webscraping_v1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# EIFO Country Risk & Cover Policies Web Scraping

## Task
*   Write a python script to scrape data from the EIFO website (https://subdomain.eifo.dk/en/countries).
*   Scrape the Country Risk Classification & EIFO Cover Policies for Public Buyers, Private Buyers, and Banks for each country.
*   Export the data into an Output Sheet in *.xlsm format.

## Notes

* https://subdomain.eifo.dk/en/countries - Initial page has a input box with the list of all countries to scrape
* https://subdomain.eifo.dk/en/countries/germany - Country example, here Germany
* https://subdomain.eifo.dk/en/countries/COUNTRY_NAME_HERE - Country url format

Some have universal policies for all situations/timelines, see below.
<img src='https://tophercollins.github.io/images/universal_policy.png'>

Others have contextual policies for all situations/timelines, see below.
<img src='https://tophercollins.github.io/images/contextual_policy.png'>

## Steps
1. Scrape list of countries.
  * Countries list found in input box class `autocomplete__search-input`.
  * Requires interaction to produce list of countries so will use selenium.
  * Each country found in div class `autocomplete-suggestions__item`.
  * Also contains info for rating/risk classification.
2. Scrape a single country.
  * Each set of policy info is found in div class `accordion-item`.
  * Type found in div class `accordion-header`.
  * Info found in div class `accordion-table`.
3. Scrape all countries.
  * Format country names containing symbols that won't convert to the URL.
    * Spaces and single quote marks ' change to dash -
    * Period . and curved brackets (  ) are removed
  * After scraping format the policy information
    * Provide universal policy where possible
    * Provide policies with contexts in curved brackets ( ) for multi cases policies.
4. Export Dataframe to xlsm output file called `Final.xlsm` with `Output` sheet.



## 0. Setup

In [None]:
# Install Packages
!pip install selenium
!apt-get update
!apt install -y chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin

Hit:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Ign:4 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy Release [5,713 B]
Hit:6 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:7 https://r2u.stat.illinois.edu/ubuntu jammy Release.gpg [793 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Hit:9 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Get:10 https://r2u.stat.illinois.edu/ubuntu jammy/main amd64 Packages [2,545 kB]
Hit:11 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Hit:13 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Get:14 https://r2u.stat.illinois.edu/ubuntu jammy/ma

In [None]:
# Import Packages
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time

# Set up Chrome options for headless browsing
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [None]:
# Base url variable
BASE_URL = "https://subdomain.eifo.dk/en/countries"

## 1. Scrape list of countries

In [None]:
# Setup up the WebDriver
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1920x1080')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-extensions')
options.add_argument('--disable-popup-blocking')
options.add_argument('--disable-logging')

driver = webdriver.Chrome(options=options)

# Send driver to website
driver.get(BASE_URL)

try:
    # Wait for the input box to be present
    wait = WebDriverWait(driver, 10)
    input_box = wait.until(EC.presence_of_element_located((By.ID, "autocomplete__search-input")))

    # Interact with the input box (send keys to trigger the autocomplete)
    input_box.send_keys("a")
    time.sleep(5)  # Wait for the autocomplete results to load

    # Find all autocomplete suggestions
    autocomplete_suggestions = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "autocomplete-suggestions__item")))

    # Extract the text of each suggestion
    output = []
    for autocomplete_suggestion in autocomplete_suggestions:
        output.append(autocomplete_suggestion.text)

finally:
    # Close the WebDriver
    driver.quit()

In [None]:
# Check output
print(output[:5])

['azerbaijan\n4', 'andorra\n0', 'angola\n6', 'argentina\n7', 'austria\n0']


Upon inspection we receive both the country names and their risk rating from our initial scrape, so there is no need to scrape the risk rating from the individual country scraping later.

In [None]:
# Make countries and ratings lists
countries = []
ratings = []

for item in output:
  country, rating = item.split("\n")
  countries.append(country)
  ratings.append(int(rating))

print(countries[:5])
print(ratings[:5])

['azerbaijan', 'andorra', 'angola', 'argentina', 'austria']
[4, 0, 6, 7, 0]


We can make the pandas dataframe now that we have the ratings info already.

In [None]:
# Make a Pandas Dataframe with countries and ratings
df = pd.DataFrame({'Country_Name': countries, 'Country_Risk_Classification': ratings})
df.head()

Unnamed: 0,Country_Name,Country_Risk_Classification
0,azerbaijan,4
1,andorra,0
2,angola,6
3,argentina,7
4,austria,0


In [None]:
# Sort dataframe alphabetically and readjust index
df = df.sort_values(by='Country_Name')
df = df.reset_index(drop=True)
df.head()

Unnamed: 0,Country_Name,Country_Risk_Classification
0,afghanistan,7
1,albania,5
2,algeria,5
3,andorra,0
4,angola,6


## 2. Scrape a single country

In [None]:
# Make request and soup
SINGLE_COUNTRY_URL = BASE_URL + '/' + 'germany'
response = requests.get(SINGLE_COUNTRY_URL)
soup = BeautifulSoup(response.content, 'html.parser')

# Find all policy groups
policy_groups = soup.find_all('div', class_='accordion-item')

# Initiate policy dictionaries
public_buyer_policy = {}
private_buyer_policy = {}
bank_policy = {}

# Loop through soups to add info to dictionaries
for policy in policy_groups:
    heading = policy.find('div', class_='accordion-header').text
    content = policy.find_all('div', class_='accordion-table')
    info = []
    for item in content:
      sentence = item.find_all('div', class_='cell')
      for title, value in zip(sentence[0::2], sentence[1::2]):
        if heading == 'Public buyer':
          public_buyer_policy[title.text] = value.text
        elif heading == 'Private buyer':
          private_buyer_policy[title.text] = value.text
        elif heading == 'Bank':
          bank_policy[title.text] = value.text

In [None]:
# Check dictionaries
print(public_buyer_policy)
print(private_buyer_policy)
print(bank_policy)

{'Guarantees without credit': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', 'Up to 1 year': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', '1-5 years': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', 'Over 5 years': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.'}
{'Guarantees without credit': 'EIFO accepts all creditworthy buyers.', 'Up to 1 year': 'EIFO accepts all creditworthy buyers.', '1-5 years': 'EIFO accepts all creditworthy buyers.', 'Over 5 years': 'EIFO accepts all creditworthy buyers.'}
{'Guarantees without credit': 'EIFO accepts creditworthy banks.', 'Up to 1 year': 'EIFO accepts creditworthy banks.', '1-5 years': 'EIFO accepts creditworthy banks.', 'Over 5 years':

## 3. Scrape all countries

In [None]:
# Scrape all countries from df['Country_Name']
public_buyer_policies_all_countries = []
private_buyer_policies_all_countries = []
bank_policies_all_countries = []

for country in df['Country_Name']:

  # Adjust country names with symbols not suitable for urls
  country = country.replace(" ", "-")
  country = country.replace("'", "-")
  country = country.replace(".", "")
  country = country.replace("(", "")
  country = country.replace(")", "")

  # Make request and soup
  COUNTRY_URL = BASE_URL + '/' + country
  response = requests.get(COUNTRY_URL)
  soup = BeautifulSoup(response.content, 'html.parser')

  # Find all policy groups
  policy_groups = soup.find_all('div', class_='accordion-item')

  # Initiate policy dictionaries
  public_buyer_policy = {}
  private_buyer_policy = {}
  bank_policy = {}

  # Loop through soups to add info to dictionaries
  for policy in policy_groups:
    heading = policy.find('div', class_='accordion-header').text
    content = policy.find_all('div', class_='accordion-table')
    info = []
    for item in content:
      sentence = item.find_all('div', class_='cell')
      for title, value in zip(sentence[0::2], sentence[1::2]):
        if heading == 'Public buyer':
          public_buyer_policy[title.text] = value.text
        elif heading == 'Private buyer':
          private_buyer_policy[title.text] = value.text
        elif heading == 'Bank':
          bank_policy[title.text] = value.text

  # Append dictionaries to lists
  public_buyer_policies_all_countries.append(public_buyer_policy)
  private_buyer_policies_all_countries.append(private_buyer_policy)
  bank_policies_all_countries.append(bank_policy)

In [None]:
# Check lists
print(public_buyer_policies_all_countries[:2])
print(private_buyer_policies_all_countries[:2])
print(bank_policies_all_countries[:2])

[{'Guarantees without credit': 'Off cover. EIFO provides no guarantee cover for business transactions with public-sector buyers.', 'Up to 1 year': 'Off cover. EIFO provides no guarantee cover for business transactions with public-sector buyers.', '1-5 years': 'Cover is suspended.', 'Over 5 years': 'Cover is suspended.'}, {'Guarantees without credit': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', 'Up to 1 year': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', '1-5 years': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.', 'Over 5 years': 'EIFO accepts guarantees from the Ministry of Finance. EIFO considers cover of other public buyers on a case-by-case basis.'}]
[{'Guarantees without credit': 'EIFO does not provide cover for private buyers – for information 

We should reformat the policy information to group together repetition of policies.

Ideally providing the one universal policy across all contexts. Or providing the contexts in brackets when there are multiple policies.

In [None]:
# Reformat the policies
def format_policies(policy_list):

  formatted_policy_list = []

  for info in policy_list:
    # Flip the dictionaries so policies have grouped contexts
    unique_values = {}
    for key, value in info.items():
      if value in unique_values.keys():
        unique_values[value].append(key)
      else:
        unique_values[value] = [key]

    if len(unique_values) == 1:
      # Add single universal policy
      formatted_policy_list.append(list(unique_values.keys())[0])
    else:
      # Add policies with context in curved brackets
      combined_info = ''
      for key, value in unique_values.items():
        combined_info += f'{key} ({", ".join(value)}), '
      formatted_policy_list.append(combined_info[:-2]) # Remove the last ', '

  return formatted_policy_list

In [None]:
# Format policy lists
formatted_public_buyer_policies_all_countries = format_policies(public_buyer_policies_all_countries)
formatted_private_buyer_policies_all_countries = format_policies(private_buyer_policies_all_countries)
formatted_bank_policies_all_countries = format_policies(bank_policies_all_countries)

In [None]:
# Add formatted policy lists to dataframe
df['EIFOs_cover_policy(Public_Buyer)'] = formatted_public_buyer_policies_all_countries
df['EIFOs_cover_policy(Private_Buyer)'] = formatted_private_buyer_policies_all_countries
df['EIFOs_cover_policy (Bank)'] = formatted_bank_policies_all_countries

In [None]:
# Check dataframe
df.head(10)

Unnamed: 0,Country_Name,Country_Risk_Classification,EIFOs_cover_policy(Public_Buyer),EIFOs_cover_policy(Private_Buyer),EIFOs_cover_policy (Bank)
0,afghanistan,7,Off cover. EIFO provides no guarantee cover fo...,EIFO does not provide cover for private buyers...,EIFO does not cover risks on banks in the coun...
1,albania,5,EIFO accepts guarantees from the Ministry of F...,EIFO accepts all creditworthy buyers. (Guarant...,EIFO only accepts the most solid banks.
2,algeria,5,EIFO accepts guarantees from the Ministry of F...,EIFO is open for private buyers with certain r...,EIFO accepts creditworthy banks.
3,andorra,0,EIFO accepts guarantees from the Ministry of F...,EIFO accepts all creditworthy buyers.,EIFO accepts creditworthy banks.
4,angola,6,EIFO accepts guarantees from the Ministry of F...,EIFO is open to creditworthy buyers for transa...,EIFO is open to creditworthy banks for transac...
5,antigua and barbuda,7,Not considered – contact EIFO.,Not considered – contact EIFO.,Not considered – contact EIFO.
6,argentina,7,Usually off cover - contact EIFO.,Usually off cover - contact EIFO.,Usually off cover - contact EIFO. (Guarantees ...
7,armenia,6,Cover of public sector buyers is subject to so...,EIFO accepts all creditworthy buyers. (Guarant...,EIFO accepts creditworthy banks.
8,aruba,5,EIFO accepts guarantees from the Ministry of F...,EIFO accepts all creditworthy buyers.,EIFO accepts creditworthy banks.
9,australia,0,EIFO accepts guarantees from the Ministry of F...,EIFO accepts all creditworthy buyers.,EIFO accepts creditworthy banks.


## 4. Export data to xlsm

In [None]:
# Export data to xlsm
df.to_excel('Final.xlsm', sheet_name='Output', index=False)