# Introduction

We show in this Kernel how we can process the data to prepare it for easier processing. Let's check the data files.

In [1]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Analysis preparation

## Load packages

In [1]:
import pandas as pd

## Load the data

The datafiles are in TSV format. We will read the files using pandas, just include in the function call the `sep` (tab separator data).
We demonstrate first how to read and process the Annual data.

In [1]:
data_df = pd.read_csv("/kaggle/input/production-in-industry-annual-data/sts_inpr_a.tsv", sep='\t')

Let's glimpse the data columns.

In [1]:
print(list(data_df.columns))

Let's also take a look to few of the rows.

In [1]:
data_df.head()

In [1]:
data_df.tail()

The first column is a composed one, containing 5 different information (indic_bt, nace_r2, s_adj, unit,geo \ time). The next columns are temporal values, the years value.

# Data pre-processing

We start by defining two working lists.

In [1]:
pivot_data_col = data_df.columns[0]
year_columns = data_df.columns[1:]

Then, we split from `pivot_data_col` the 5 separate fields:
* indic_bt;
* nace_r2;   
* s_adj;
* unit;
* geo;

In [1]:
data_df['indic_bt'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[0])
data_df['nace_r2'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[1])
data_df['s_adj'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[2])
data_df['unit'] = data_df[pivot_data_col].apply(lambda x: x.split(",")[3])
data_df['geo']     = data_df[pivot_data_col].apply(lambda x: x.split(",")[4])

We select now only the new columns resulted from splitting the `pivot_data_col` and the time columns.

In [1]:
selected_columns = list(['indic_bt', 'nace_r2', 's_adj', 'unit', 'geo']) +  list(year_columns)
data_df = data_df[selected_columns]

Next, we pivot the time columns using `melt` operation in pandas.  
We also make sure we transform `date` to be an integer (here is a year data).  
We set `value` to be a float, after we replace ": " (for N/A) with `NAN`.

In [1]:
data_tr_df = data_df.melt(id_vars=['indic_bt', 'nace_r2', 's_adj', 'unit', 'geo'], 
        var_name="year", 
        value_name="value")
data_tr_df['geo'] = data_tr_df['geo'].apply(lambda x: str(x))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("e", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("b", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("u", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("c", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("d", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("z", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("p", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace("s", ""))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: str(x).replace(": ", "NAN"))
data_tr_df['value'] = data_tr_df['value'].apply(lambda x: float(x))

Let's inspect the result.

In [1]:
print(f"Transformed data shape: {data_tr_df.shape} (rows/columns)")
data_tr_df.head()

In [1]:
data_tr_df.tail()

# A very preliminary exploratory data analysis

This would be a very short exploratory data analysis. The role of this Kernel is just to show how we can prepare the annual data for analysis and we already did this.

In [1]:
import pandas_profiling
pandas_profiling.ProfileReport(data_tr_df)

# Export data in csv format

In [1]:
data_tr_df.to_csv("eu_production_in_industry_annual_data.csv", index=False)