# 05d – Data Cleansing: Testing

**Author:** Roberto Chiaiese  
**Project:** CovidReporting Data Warehouse  

##  Overview
This notebook transforms the COVID-19 testing dataset into the analytical fact table **factTesting**.  
It includes test counts, new cases, positivity rate, and testing rate by country and week.

## Steps Performed
1. **Load data** from `staging.testing`.  
2. **Clean missing or invalid records** and convert numeric fields.  
3. **Normalize reporting week** and link it to the date dimension (`dimDate`).  
4. **Map country codes** to `dimCountry`.  
5. **Select relevant columns** for analytical purposes.  
6. **Save final cleansed data** to populate the **factTesting** table.

## Output Table
**factTesting**
- `test_id`,`country_key`, `date_id`, `new_cases`, `tests_done`, `testing_rate`, `positivity_rate`, `source`


In [1]:
import pandas as pd
import numpy as np

In [2]:
# read the file from github and create a dataframe
df = pd.read_csv('/home/jovyan/datawarehouse/staging_layer/dataset/testing.csv')
df.head()

Unnamed: 0,country,country_code,year_week,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source
0,Austria,AT,2020-W15,2041,12339,8858775,139.285624,16.541049,Manual webscraping
1,Austria,AT,2020-W16,855,58488,8858775,660.226724,1.461838,Manual webscraping
2,Austria,AT,2020-W17,472,33443,8858775,377.512692,1.411357,Manual webscraping
3,Austria,AT,2020-W18,336,26598,8858775,300.244673,1.263253,Country website
4,Austria,AT,2020-W19,307,42153,8858775,475.833284,0.728299,Country website


In [3]:
lookup_country = pd.read_csv('/home/jovyan/datawarehouse/lookup/country_lookup.csv')

In [4]:
# upload dimDate to create the fact table for the testing

df_merged = pd.merge(df, lookup_country, on='country', how='left')
df_merged.head()

Unnamed: 0,country,country_code,year_week,new_cases,tests_done,population_x,testing_rate,positivity_rate,testing_data_source,country_code_2_digit,country_code_3_digit,continent,population_y
0,Austria,AT,2020-W15,2041,12339,8858775,139.285624,16.541049,Manual webscraping,AT,AUT,Europe,8858775
1,Austria,AT,2020-W16,855,58488,8858775,660.226724,1.461838,Manual webscraping,AT,AUT,Europe,8858775
2,Austria,AT,2020-W17,472,33443,8858775,377.512692,1.411357,Manual webscraping,AT,AUT,Europe,8858775
3,Austria,AT,2020-W18,336,26598,8858775,300.244673,1.263253,Country website,AT,AUT,Europe,8858775
4,Austria,AT,2020-W19,307,42153,8858775,475.833284,0.728299,Country website,AT,AUT,Europe,8858775


In [5]:

df_merged = df_merged[['country','country_code_2_digit','country_code_3_digit','year_week','new_cases','tests_done','population_x','testing_rate','positivity_rate','testing_data_source']]

In [6]:
df_merged.head()

Unnamed: 0,country,country_code_2_digit,country_code_3_digit,year_week,new_cases,tests_done,population_x,testing_rate,positivity_rate,testing_data_source
0,Austria,AT,AUT,2020-W15,2041,12339,8858775,139.285624,16.541049,Manual webscraping
1,Austria,AT,AUT,2020-W16,855,58488,8858775,660.226724,1.461838,Manual webscraping
2,Austria,AT,AUT,2020-W17,472,33443,8858775,377.512692,1.411357,Manual webscraping
3,Austria,AT,AUT,2020-W18,336,26598,8858775,300.244673,1.263253,Country website
4,Austria,AT,AUT,2020-W19,307,42153,8858775,475.833284,0.728299,Country website


In [7]:
# rename the columns
df_merged = df_merged.rename(columns={'year_week':'reported_year_week','population_x':'population'})

In [8]:
df_merged.head()

Unnamed: 0,country,country_code_2_digit,country_code_3_digit,reported_year_week,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source
0,Austria,AT,AUT,2020-W15,2041,12339,8858775,139.285624,16.541049,Manual webscraping
1,Austria,AT,AUT,2020-W16,855,58488,8858775,660.226724,1.461838,Manual webscraping
2,Austria,AT,AUT,2020-W17,472,33443,8858775,377.512692,1.411357,Manual webscraping
3,Austria,AT,AUT,2020-W18,336,26598,8858775,300.244673,1.263253,Country website
4,Austria,AT,AUT,2020-W19,307,42153,8858775,475.833284,0.728299,Country website


In [None]:
#drop country and country_code_3_digit columns and rename tge index and the country_code_2_digit columns
df_merged = df_merged.drop(columns=['country','country_code_3_digit','Unnamed: 0']).reset_index().rename(columns={'country_code_2_digit':'country_id', 'index':'test_id'})

In [None]:
#convert the data to create a testing id made by the country + year + week of the year
df_merged['year'] = df['reported_year_week'].str[0:4].astype("Int64")
df_merged['week_of_year'] = df['reported_year_week'].str[6:8].astype("Int64")
new_df = df_merged.groupby('test_id').min('date_id').reset_index()
new_df['test_id'] = new_df['country_id'] + new_df['date_id'].astype('str')

In [72]:
new_df = df_merged[['test_id','country_id','date_id','new_cases','tests_done','testing_rate', 'positivity_rate','year','week_of_year', 'year_week','testing_data_source']]

In [85]:
new_df.to_csv('/home/jovyan/datawarehouse/core_layer/processed/factTesting.csv', index=False)