## Web Scraping bundes-klinik.atlas.de
The purpose of this notebook is to web scrape data from the website and store it to CSV files for further processing. <br>
Some code is served in functions. The web scraping process is basically divided into three parts:
1) Opening the list with all hospitals available on the webiste
2) Retrieving general details from the hospital pages
3) Retrieving specific treatment details from the pages for these treatments

In [1]:
# load libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import sys
import os
from IPython.display import clear_output

# load functions
current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)
sys.path.append(parent_dir)
from lib.functions_webscrape_atlas import *

# set display options
pd.set_option('display.max_columns', None)

## 1. Prepare List of Hospital Ids
### 1.1 Load Hospital Locations
The list of all hospitals available on the website is loaded. <br>
This information is necessary in order to target each hospital's individual pages.

In [2]:
# read hospital locations data
df_locations = pd.read_json('../data/in/raw/atlas/locations.json', dtype={'zip': str})

# create list of hospital ids
hospital_id_list = list(df_locations.copy()['link'].apply(lambda x: x.split('/')[-2]).values)

### 1.2 Clean Hospital Locations and add Column for Hospital Id
First create column for hostpital id, then clean empty cells and save the data to CSV file for possible further processing.

In [3]:
# create new column for hospital_id
df_locations['hospital_id'] = pd.DataFrame(df_locations.copy()['link'].apply(lambda x: x.split('/')[-2])).rename(columns={'link': 'hospital_id'})
# replace empty cells with 'not reported'
df_locations = df_locations.replace('', 'not reported')
# save locations preprocessed data
df_locations.to_csv('../data/in/staging/hospital_locations.csv', index=False)

## 2. Webscrape Departments, Certificates and other Details

In [4]:

# initialize lists to store departments data
hospital_ids_departments = []
department_names = []
department_counts = []

# initialize lists to store certificates data
hospital_ids_certificates = []
certificates = []

# initialize lists to store details data
hospital_ids_details = []
total_treatments_counts = []
total_treatments_labels = []
nursing_quotient_counts = []
nursing_quotient_labels = []
nursing_counts = []
provider_types = []
bed_counts = []
semi_residential_counts = []
emergency_services = []

# loop through all hospital ids
k = 0
for hospital_id in hospital_id_list:
    k += 1
    print(f'{hospital_id} - {round(k/len(hospital_id_list)*100, 1)} %')
    
    # load website content for specific hospital
    soup = load_hospital_site(hospital_id)

    # extract department data from website content
    hospital_ids_, department_names_, department_counts_ = get_departments(soup, hospital_id)

    hospital_ids_departments.extend(hospital_ids_)
    department_names.extend(department_names_)
    department_counts.extend(department_counts_)

    # extract certificates data from website content
    hospital_ids_certificates_, certificates_ = get_certificates(soup, hospital_id)

    hospital_ids_certificates.extend(hospital_ids_certificates_)
    certificates.extend(certificates_)

    # extract details data from website content
    hospital_id, total_treatments_count, total_treatments_label, nursing_quotient_count, nursing_quotient_label, nursing_count, provider_type, bed_count, semi_residential_count, emergency_service = get_details(soup, hospital_id)

    hospital_ids_details.append(hospital_id)
    total_treatments_counts.append(total_treatments_count)
    total_treatments_labels.append(total_treatments_label)
    nursing_quotient_counts.append(nursing_quotient_count)
    nursing_quotient_labels.append(nursing_quotient_label)
    nursing_counts.append(nursing_count)
    provider_types.append(provider_type)
    bed_counts.append(bed_count)
    semi_residential_counts.append(semi_residential_count)
    emergency_services.append(emergency_service)

    # wait 15 sec between requests as requested by robots.txt of the website
    time.sleep(15)
    clear_output(wait=True)

# create dataframes from lists
df_departments = pd.DataFrame({'hospital_id': hospital_ids_departments, 'department_name': department_names, 'department_count': department_counts})
df_certificates = pd.DataFrame({'hospital_id': hospital_ids_certificates, 'certificate': certificates})
df_details = pd.DataFrame({'hospital_id': hospital_ids_details, 'total_treatments': total_treatments_counts, 'total_treatments_label': total_treatments_labels, 'nursing_quotient': nursing_quotient_counts, 'nursing_quotient_label': nursing_quotient_labels, 'nursing_count': nursing_counts ,'provider_type': provider_types, 'bed_count': bed_counts, 'semi_residential_count': semi_residential_counts, 'emergency_service': emergency_services})

# save department and details data for further processing
df_departments.to_csv('../data/in/staging/atlas_departments.csv', index=False, encoding='utf-8')
df_certificates.to_csv('../data/in/staging/atlas_certificates.csv', index=False, encoding='utf-8')
df_details.to_csv('../data/in/staging/atlas_details.csv', index=False, encoding='utf-8')

773870 - 100.0 %
no certificates found


## Webscrape Specific Treatments Information
First create a two treatments dictionaries serving two pruposes: first for (later) storing the treatment codes in the database, second to provide the necessary details for the web scraping process (e.g. different treatments have specific hashes in the URL). The first dictionary is directly saved in a CSV file. Then the web scraping process is conducted.

In [10]:
# Read dictionary with treatment names as keys and urls as values
url_dict = get_url_dict()

treatments_dictionary = {}
treatments_dict_for_db = {}

# Extract treatment code, searchlabel and cHash from urls and store them in a dictionary
for key, value in url_dict.items():
    treatment_name = key
    treatment_code = value.split('treatmentcode%5D=')[1].split('&')[0]
    treatment_searchlabel = value.split('searchlabel%5D=')[1].split('&')[0]
    treatment_cHash = value.split('cHash=')[1]
    treatments_dictionary[treatment_name] = {
        'code': treatment_code,
        'searchlabel': treatment_searchlabel,
        'cHash': treatment_cHash}
    treatments_dict_for_db[treatment_code] = treatment_name

In [14]:
# Save treatments dictionary to csv for database import
treatments_dict_df = pd.DataFrame({'treatment_code': treatments_dict_for_db.keys(), 'treatment_name': treatments_dict_for_db.values()})
treatments_dict_df.to_csv('../data/in/staging/treatments_dict.csv', index=False)

In [None]:
# Get treatments for all hospitals and save to csv in chunks
m = 50
for k in range(len(hospital_id_list)//m):
    k+=11
    print('k:', k)
    print(f'{k*m}-{k*m+m-1}')
    list_for_df_hospital_id, list_for_df_treatment_code, list_for_df_count_number, list_for_df_count_label = get_treatments(hospital_ids[k*m:k*m+m], treatments_dictionary)
    df_treatments = pd.DataFrame({'hospital_id': list_for_df_hospital_id, 'treatment_code': list_for_df_treatment_code, 'count_number': list_for_df_count_number, 'count_label': list_for_df_count_label})
    df_treatments.to_csv(f'../data/in/staging/treatments_chunks/atlas_treatments_sample_{k*m}-{k*m+m-1}.csv', index=False, encoding='utf-8')
    print(f'saved file {k*m}-{k*m+m-1}')
    del df_treatments