# Introduction

We show in this Kernel how we can process the data to prepare it for easier processing. Let's check the data files.

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Analysis preparation

## Load packages

In [None]:
import pandas as pd

## Load the data

The datafiles are in TSV format. We will read the files using pandas, just include in the function call the `sep` (tab separator data).
We demonstrate first how to read and process the Annual data.

In [None]:
data_df = pd.read_csv("/kaggle/input/access-to-education-of-disabled-people-in-europe/hlth_de040.tsv", sep='\t')

Let's glimpse the data columns.

In [None]:
print(list(data_df.columns))

Let's also take a look to few of the rows.

In [None]:
data_df.head()

In [None]:
data_df.tail()

The first column is a composed one, containing 5 different information (unit, isced-97 education indicator,health problems, age, sex, time in years). The next columns are the countries values.

# Data pre-processing

We start by defining two working lists.

In [None]:
pivot_data_col = data_df.columns[0]
geo_columns = data_df.columns[1:]

Then, we split from `pivot_data_col` the 6 separate fields:
* unit;
* isced97 - ISCED-97 education attainment level classification;   
* hlth_pb - Health problem;
* age - this is age group actually;  
* sex (F, M or T);
* time - this is unique year 2011;  

In [None]:
data_df['unit'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[0])
data_df['isced97'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[1])
data_df['hlth_pb'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[2])
data_df['age'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[3])
data_df['sex']     = data_df[pivot_data_col].apply(lambda x: x.split(",")[4])
data_df['time'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[5])

We select now only the new columns resulted from splitting the `pivot_data_col` and the time columns.

In [None]:
selected_columns = list(['unit', 'isced97', 'hlth_pb', 'sex', 'age', 'time']) +  list(geo_columns)
data_df = data_df[selected_columns]

Next, we pivot the time columns using `melt` operation in pandas.  
We also make sure we transform `date` to be an integer (here is a year data).  
We set `value` to be a float, after we replace ": " (for N/A) with `NAN`.

In [None]:
data_tr_df = data_df.melt(id_vars=['unit', 'isced97', 'hlth_pb', 'sex', 'age', 'time'], 
        var_name="geo", 
        value_name="value")
data_tr_df['geo'] = data_tr_df['geo'].apply(lambda x: str(x))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("bp", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("b", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("u", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("c", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("d", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace(": ", "NAN"))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: float(x))

Let's inspect the result.

In [None]:
print(f"Transformed data shape: {data_tr_df.shape} (rows/columns)")
data_tr_df.head()

In [None]:
data_tr_df.tail()

# A very preliminary exploratory data analysis

This would be a very short exploratory data analysis. The role of this Kernel is just to show how we can prepare the annual data for analysis and we already did this.

In [None]:
import pandas_profiling
pandas_profiling.ProfileReport(data_tr_df)

# Export data in csv format

In [None]:
data_tr_df.to_csv("education_disbled_eu.csv", index=False)