# UNHCR Refugee Applications Data Scraper
This scraper grabs entity names, new refugee applications (to leave) for each of those entities, application acceptances (recognitions or protections), application rejections, and pending decisions. The UNHCR has data for 229 countries/entities.

Data are retrieved from this site: [Refugee Data Finder](https://www.unhcr.org/refugee-statistics/data-summaries?data_summaries%5Bregion%5D=&data_summaries%5Bcountry%5D=&data_summaries%5BwithinFrom%5D=from&data_summaries%5Bview%5D=asylum_applications_decisions&data_summaries%5Byear%5D=2024&data_summaries%5BpopType%5D=FDP&data_summaries%5B_mode%5D=country&data_summaries%5B_token%5D=6987cded38eae.zI1K2yKKVqlikInn8U5IcnnnT9iyimmU85vCiq_BvfE.jcsIkhanJ-gx1POrwAMZO0CiKeji2ii5gNKY4P-4y8S-_3uuEv4Tnw3RzA&data_summaries%5Bsubmit%5D=)


In [1]:
from bs4 import BeautifulSoup
import requests
import time
import csv

hdr = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/133.0.0.0 Safari/537.36'}

# Main summary page to retrieve URLs
url = "https://www.unhcr.org/refugee-statistics/data-summaries?data_summaries%5Bregion%5D=&data_summaries%5Bcountry%5D=&data_summaries%5BwithinFrom%5D=from&data_summaries%5Bview%5D=asylum_applications_decisions&data_summaries%5Byear%5D=2024&data_summaries%5BpopType%5D=FDP&data_summaries%5B_mode%5D=country&data_summaries%5B_token%5D=6987cded38eae.zI1K2yKKVqlikInn8U5IcnnnT9iyimmU85vCiq_BvfE.jcsIkhanJ-gx1POrwAMZO0CiKeji2ii5gNKY4P-4y8S-_3uuEv4Tnw3RzA&data_summaries%5Bsubmit%5D="

## `get_urls(main_url)`
A function that retrieves the option values of countries from a dropdown menu on the main page, appends them in the appropriate position of the main URL to create the country URLs, and puts all country URLs into a list that is returned. 

To access a country's page, the user must select it from the dropdown menu. The country URL is the user's selection. The option value is used in the URL to specify which country is selected from the dropdown menu. The structure of the URL for each country's data summary page is nearly identical to the main page's URL except that the country's option value is logically inserted somewhere in the URL. The list that is returned through this function is to be looped through to access each country's URL.

In [6]:
def get_urls(main_url):
    page = requests.get(main_url, headers=hdr)
    soup = BeautifulSoup(page.text, 'html.parser')
    url_list = []
    
    # Get dropdown menu
    dropdown_menu = soup.find(id="data_summaries_country")
    # Get options in menu (every option is a country/entity)
    options = dropdown_menu.find_all("option")
    time.sleep(1)
    # Get the value for each option and append to urls list
    for option in options:
        time.sleep(1)
        entity_value = option.attrs['value']
        if entity_value != "":
            url = "https://www.unhcr.org/refugee-statistics/data-summaries?data_summaries%5Bregion%5D=&data_summaries%5Bcountry%5D=" + entity_value + "&data_summaries%5BwithinFrom%5D=from&data_summaries%5Bview%5D=asylum_applications_decisions&data_summaries%5Byear%5D=2024&data_summaries%5BpopType%5D=FDP&data_summaries%5B_mode%5D=country&data_summaries%5B_token%5D=6987cded38eae.zI1K2yKKVqlikInn8U5IcnnnT9iyimmU85vCiq_BvfE.jcsIkhanJ-gx1POrwAMZO0CiKeji2ii5gNKY4P-4y8S-_3uuEv4Tnw3RzA&data_summaries%5Bsubmit%5D="
            url_list.append(url)

    return url_list

# Call get_urls()
url_list = get_urls(url)

## `get_info(page_url)`
A function that retrieves the entity name and four data points from a summary page and returns them in a list.

In [7]:
def get_info(page_url):
    page = requests.get(page_url, headers=hdr)
    time.sleep(3)
    soup = BeautifulSoup(page.text, 'html.parser')
    data_list = []

    summary = soup.find(id="data_summaries_content")
    country_hed = summary.find("h2")
    # Clean country heading
    country = country_hed.text.strip()
    # Find first space in the country heading
    trim = country.find(" ")
    # Trim off "Country: " in country heading
    country = country[trim+1:]
    # Append country name to list
    data_list.append(country)
    
    # Grab four data points
    info_figures = summary.find_all("h4")
    
    for fig_text in info_figures:
        # Clean string
        fig_text = fig_text.text.strip()
        if "," in fig_text:
            fig_text = fig_text.replace(",","")
        data_list.append(fig_text)

    return data_list

## `write_csv(url_list)`
A function that writes a list of data points (returned through the `get_info()` function) in each row for every URL in a list of URLs (returned through the `get_urls()` function).

In [8]:
def write_csv(url_list):
    csvfile = open('refugee_apps.csv', 'w', newline='', encoding='utf-8')
    c = csv.writer(csvfile)
    
    # Write column headings in a row
    c.writerow(['entity', 'new applications', 'acceptances', 'rejections', 'pending'])

    # Loop through each entity's URL to retrieve info and write into a row
    for entity_url in url_list:
        entity_data = get_info(entity_url)
        time.sleep(1)
        c.writerow(entity_data)
    
    # Close file
    csvfile.close()
    
    # Return None to bypass potential error
    return None

# Run function to create CSV file
write_csv(url_list)