# 05c – Data Cleansing: Hospital Admissions

**Author:** Roberto Chiaiese  
**Project:** CovidReporting Data Warehouse  

##  Overview
This notebook cleans and reorganizes hospital and ICU admissions data into two analytical fact tables:
- **factDailyHospitalAdmissions**

##  Steps Performed
1. **Load data** from `staging.hospital_admissions`.  
2. **Split data** into *daily* and *weekly* subsets using the `indicator` and `reported_date` / `reported_year_week` fields.  
3. **Standardize indicators** (ICU vs. hospital occupancy).  
4. **Pivot data** to align ICU and hospital values per country and time.  
5. **Map countries** to `dimCountry` and dates to `dimDate`.  
6. **Create final cleaned outputs** ready for loading into the core schema.

##  Output Tables
**factDailyHospitalAdmissions**
- `country_key`, `date_key`, `icu_occupancy`, `hosp_occupancy`, `population_total`, `source`



In [35]:
import pandas as pd
import numpy as np 

In [72]:
df = pd.read_csv('/home/jovyan/datawarehouse/staging_layer/dataset/hospital_admissions.csv')
df = df.replace(':', np.nan)

In [73]:
# upload dimCountry and dimDate to create the fact table for the daily hospital admissions
dimCountry = pd.read_csv('/home/jovyan/datawarehouse/core_layer/processed/dimCountry.csv')
dimDate = pd.read_csv('/home/jovyan/datawarehouse/lookup/dim_date.csv')

In [74]:
# fill the NaN values with 0 and convert the value column into integers
df['value'] = df['value'].fillna(0)
df.value = df.value.apply("int64")

In [76]:
# join both dimCountry and dimDate to the processing daily hospital admissions fact table
df_merged = pd.merge(df, dimCountry, on='country', how='left')
df_merged = pd..merge(df_merged, dimDate, on='date', how='left')
df_merged.head()

Unnamed: 0,country,indicator,date,year_week_x,value,source,url,country_id,country_code,population,...,year,month,day,day_name,day_of_year,week_of_month,week_of_year,month_name,year_month,year_week_y
0,Austria,Daily hospital occupancy,2020-04-02,2020-W14,1057,Surveillance,https://www.sozialministerium.at/Informationen...,AT,AUT,8858775,...,2020.0,4.0,2.0,Thursday,93.0,1.0,14.0,April,202004.0,202014.0
1,Austria,Daily hospital occupancy,2020-04-08,2020-W15,1096,Surveillance,https://www.sozialministerium.at/Informationen...,AT,AUT,8858775,...,2020.0,4.0,8.0,Wednesday,99.0,2.0,15.0,April,202004.0,202015.0
2,Austria,Daily hospital occupancy,2020-04-15,2020-W16,1001,Surveillance,https://info.gesundheitsministerium.at/dashboa...,AT,AUT,8858775,...,2020.0,4.0,15.0,Wednesday,106.0,3.0,16.0,April,202004.0,202016.0
3,Austria,Daily hospital occupancy,2020-04-16,2020-W16,967,Surveillance,https://www.sozialministerium.at/Informationen...,AT,AUT,8858775,...,2020.0,4.0,16.0,Thursday,107.0,3.0,16.0,April,202004.0,202016.0
4,Austria,Daily hospital occupancy,2020-04-17,2020-W16,909,Surveillance,https://www.sozialministerium.at/Informationen...,AT,AUT,8858775,...,2020.0,4.0,17.0,Friday,108.0,3.0,16.0,April,202004.0,202016.0


In [77]:
#extract the daily hospital admissions
df_daily = df_merged[(df_merged['indicator'] == 'Daily hospital occupancy') | (df_merged['indicator'] == 'Daily ICU occupancy')]

In [78]:
df_daily.columns

Index(['country', 'indicator', 'date', 'year_week_x', 'value', 'source', 'url',
       'country_id', 'country_code', 'population', 'age_0_to_14',
       'age_15_to_24', 'age_25_to_49', 'age_50_to_64', 'age_65_to_79',
       'age_80_to_MAX', 'date_key', 'year', 'month', 'day', 'day_name',
       'day_of_year', 'week_of_month', 'week_of_year', 'month_name',
       'year_month', 'year_week_y'],
      dtype='object')

In [79]:
# pivot (partially) the table
df_pivot_daily = (
    df_daily.pivot_table(
        index = ['country_id', 'date_key','source'],
        columns = 'indicator',
        values = 'value'
        )
)


In [86]:
# sort the values, rename the columns, create the index and convert the ICU occupancy and hospital occupancy to integer values
df_daily_sorted = df_pivot_daily.sort_values(by=['country_id','date_key'], ascending=[True,False])
df_daily_sorted = df_daily_sorted.reset_index().rename(columns={'index':'fact_daily_hosp_id'})
df_daily_sorted[['Daily ICU occupancy','Daily hospital occupancy']] = df_daily_sorted[['Daily ICU occupancy','Daily hospital occupancy']].astype('Int64')
factDailyHospitalAdmissions = df_daily_sorted

In [98]:
factDailyHospitalAdmissions = factDailyHospitalAdmissions.rename(columns={'date_key':'date_id','Daily ICU occupancy':'icu_occupancy','Daily hospital occupancy':'hosp_occupancy'})

In [99]:
factDailyHospitalAdmissions.head()

indicator,fact_daily_hosp_id,country_id,date_id,source,icu_occupancy,hosp_occupancy
,,,,,,
0.0,0.0,AT,20201025.0,Country_Website,174.0,1225.0
1.0,1.0,AT,20201024.0,Country_Website,175.0,1177.0
2.0,2.0,AT,20201023.0,Country_Website,158.0,1058.0
3.0,3.0,AT,20201022.0,Country_Website,161.0,1002.0
4.0,4.0,AT,20201021.0,Country_Website,147.0,960.0


In [None]:
factDailyHospitalAdmissions.to_csv('/home/jovyan/datawarehouse/core_layer/processed/factDailyHospitalAdmissions.csv', index=False)