# Understanding unemployment in Europe
In this notebook I will try to make a segmentation of data related to unemployment in various countries, in order to understand what are the factors that influence unemployment rate.

Before we begin modelling there are a few thing to understand about the Dataset (https://ec.europa.eu/eurostat/cache/metadata/en/une_rt_m_esms.htm#unit_measure1587452538133).

The *units* are:

- PC_ACT -> Percentage Active Population
- THS_PER -> Thousands Person.

In the column *s_asj* I will only select the data after seasonal adjustement (SA) (See link above). 

In [None]:
import numpy as np 
import pandas as pd 
import plotly.express as px

import ipywidgets as widgets

Load the Country Codes Notebook.

In [None]:
country_codes = pd.read_csv("../input/iso-country-codes-global/wikipedia-iso-country-codes.csv")
country_codes = country_codes.set_index("Alpha-2 code")["English short name lower case"]
country_codes.name = "country"

Process the first column of the unemployment rate and map the country codes. 

In [None]:
df = pd.read_csv("../input/unemployment-in-european-union/une_rt_m.tsv", sep="\t")
cols = df.columns[0].split(",")  # Processed columns
df = df.merge(pd.DataFrame(dict(zip(cols, s.split(","))) for s in df.iloc[:, 0].values), left_index=True, right_index=True)
df = df.iloc[:, 1:]
df[r"country"] = df[r"geo\time"].map(country_codes)
df.loc[df["country"].isna(), "country"] = df.loc[df["country"].isna(), r"geo\time"]
df = df[df["s_adj"] == "SA"]
df.drop([r"geo\time", "s_adj"], axis=1, inplace=True)

# Assert no missing values 
assert not df.isna().any().any(), "Missing values found"

# Replace collons with nans
df.replace(r"\s*:\s*", np.nan, inplace=True, regex=True)

print(df.shape)
df.head()

# Turning the Data Frame into a time series

In [None]:
ts_df = df.melt(id_vars=["age", "unit", "sex", "country"]).set_index("variable")
ts_df.index = pd.to_datetime([c.replace("M", " ") for c in ts_df.index])

# Changing the values to numeric
For some reason, the value column is of the object date type, and there are some annoying number like "24.4 p". I will assume this is a mistake and remove the letters from this column and turn it into numeric.

In [None]:
ts_df["value"] = ts_df["value"].str.extract(r"(\d+\.*\d*)").astype(float)

In [None]:
ts_df_pct = ts_df[ts_df["unit"] == "PC_ACT"].drop("unit", axis=1)  # Percentage Active Population
ts_df_th = ts_df[ts_df["unit"] == "THS_PER"].drop("unit", axis=1)  # Thousands

# Time series visualization
Use the menu to select the country.

The two imediate conclusions of the graphs bellow are that, unemployment is much higher below the age of 25, and the unemployment rate in Woman is slightly higher than Men. 

In [None]:
def f(country):
    fig=px.line(ts_df_pct[ts_df_pct["country"] == country], y="value", color="sex", facet_col="age", title=f"Unemployment in {country}")
    return fig

w = widgets.interact(f, country=ts_df_pct["country"].unique())

In [None]:
px.box(ts_df_pct, x="age", color="sex", y="value", title="Unemployment Rate")

# Which countries were more affected by Covid.
For this, we will quantify what was the difference between the lowest and highest unemployment rate in the year 2020, for every country.

In [None]:
unemployment_covid_rise = [(k, max(v) - min(v)) for k, v in ts_df_pct.loc["2020":, :].set_index("value").groupby("age sex country".split()).groups.items()]
covid_df = pd.DataFrame(index=pd.MultiIndex.from_tuples([v[0] for v in unemployment_covid_rise]), data=[v[1] for v in unemployment_covid_rise]).reset_index()
covid_df.columns = ["age", "sex", "country", "value"]
px.bar(covid_df.sort_values("value"), x="country", y="value", color="sex", facet_col="age", title="Difference between Maximum and Minimum Unemployment Rate per country in 2020")

# Which classes were generally more affected by Covid?
Generally, it seems that Men were more affected than Woman by Covid. Also, younger people were muc more affected than older people. 

In [None]:
px.box(covid_df, x="age", color="sex", y="value", title="Difference between Maximum and Minimum Unemployment by group in 2020")

# Correlation between unemployment rates (With Slider)
Now I want to verify if the unemployment rate between the different countries is somewhat correalated, or not at all.

In [None]:
min_year = ts_df_pct.index.year.min()
max_year = ts_df_pct.index.year.max()
w = widgets.IntRangeSlider(
    value=[2000, 2020],
    min=min_year,
    max=max_year,
    step=1,
    description='Select year:',
    disabled=False,
    continuous_update=False,
    orientation='horizontal',
    readout=True,
    readout_format='d',
)

def f(x):
    title = f"Correlation of unemployment rates between the years {x[0]} and {x[1]}"
    fig = px.imshow(ts_df_pct.loc[str(x[1]):str(x[0]), :].reset_index().pivot(index=["index", "age", "sex"], columns="country", values="value").loc[(slice(None), "TOTAL", "T"), :].corr(), title=title, width=1000, height=1000)
    return fig

a = widgets.interact(f, x=w)

# Conclusion
The most interesting aspect of this analysis, to me, is that there are countries that have extreme negative unemployment correlations with other countries. I would really like to cross this information with other macro economic data to fugire out why (Maybe I will in the future). 