# Introduction

We show in this Kernel how we can process the data to prepare it for easier further processing. 
We also use this Kernel to generate the transformed data, from the original one. 
Let's check the data files.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Analysis preparation

## Load packages

In [None]:
import pandas as pd

## Load the data

The datafiles are in TSV format. We will read the files using pandas, just include in the function call the `sep` (tab separator data).
We demonstrate first how to read and process the Annual data.

In [None]:
data_df = pd.read_csv("/kaggle/input/excess-mortality-in-europe-in-20202021/demo_mexrt.tsv", sep='\t')

Let's glimpse the data columns.

In [None]:
print(list(data_df.columns))

The first column is a composed one, containing 2 different information (the unit and the geography). The next columns are the year/month value, from current available month in 2021 to first in 2020.

# Data pre-processing

We start by defining two working lists.

In [None]:
pivot_data_col = data_df.columns[0]
time_columns = data_df.columns[1:]

Then, we split from `pivot_data_col` the 2 separate fields:
* unit (NAC only);
* geography.

In [None]:
data_df['unit']     = data_df[pivot_data_col].apply(lambda x: x.split(",")[0])
data_df['country'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[1])

We select now only the new columns resulted from splitting the `pivot_data_col` and the time columns.

In [None]:
selected_columns = list(['unit', 'country']) +  list(time_columns)
data_df = data_df[selected_columns]

Next, we pivot the time columns using `melt` operation in pandas.  
We also make sure we transform `date` to be an integer (here is a year data).  
We set `value` to be a float, after we replace ": " (for N/A) with `NAN`.

In [None]:
data_tr_df = data_df.melt(id_vars=['unit', 'country'], 
        var_name="date", 
        value_name="value")
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("p", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace(": ", "NAN"))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: float(x))

In [None]:
from datetime import datetime
def strip_date(date_string, test=False):
    year, month = int(date_string[0:4]), int(date_string[5:7])
    if test:
        print(f"From: {date_string} -> Year: {year}, Month: {month}")
    try:
        d = datetime(year, month, 1)
        return d
    except Exception as ex:
        print("Error, wrong data: ", year, month)
        return None
    

print(f"Tests:") 
strip_date('2021M06 ', test=True)
strip_date('2019M13 ', test=True)
strip_date('1971M01 ', test=True)

print(f"\nFull data processing...\n")
data_tr_df['date'] = data_tr_df['date'].apply(lambda x: strip_date(x))
print("done.")

Let's inspect the result.

In [None]:
print(f"Transformed data shape: {data_tr_df.shape} (rows/columns)")
data_tr_df.head()

In [None]:
data_tr_df.tail()

# A very preliminary exploratory data analysis

This would be a very short exploratory data analysis. The role of this Kernel is just to show how we can prepare the annual data for analysis and we already did this.

In [None]:
import pandas_profiling
pandas_profiling.ProfileReport(data_tr_df)

# Save transformed data

In [None]:
data_tr_df.to_csv("excess_mortality_eu.csv", index=False)