# Introduction

We analyze the asylum applicants data. 
We load also the ISO country codes data, as auxiliary data.  
Use of Sankey diagrams allows us to show in a single graph the source and destination of asylum application seekers.

# Analysis preparation

## Load packages

In [None]:
import pandas as pd
import os

## Load data

In [None]:
data_df = pd.read_csv("/kaggle/input/asylum-applicants-by-citizenship-in-europe/asylum_applicants_in_europe.csv")
country_codes_df = pd.read_csv("/kaggle/input/iso-country-codes-global/wikipedia-iso-country-codes.csv")

# Data exploration

## Glimpse the data

In [None]:
data_df.info()

In [None]:
country_codes_df.info()

In [None]:
data_df.head()

In [None]:
country_codes_df.head()

The country codes used in the asylum dataset correspond to the Alpha-2 codes in the ISO country code data. We will merge twice the two datasets to get as well the English short name countries names.

## Merge asylum data and country data

In [None]:
cc_df = country_codes_df[['English short name lower case','Alpha-2 code','Alpha-3 code']]
cc_df.columns = ['citizen_name', 'citizen', 'citizen_3']
data_c_df = data_df.merge(cc_df, how='left')
print(data_df.shape, data_c_df.shape)
cc_df.columns = ['geography_name', 'geography', 'geography_3']
data_c_df = data_c_df.merge(cc_df, how='left')
print(data_c_df.shape)
data_c_df.head()

## Top 10 countries receiving asylum applicants untill 2007

In [None]:
agg_df = data_c_df.groupby(['geography', 'geography_name'])['value'].sum().reset_index()
agg_df.sort_values(["value"], inplace=True, ascending=False)
agg_df.head(10)

In [None]:
top_10_destination = agg_df.head(10).geography_name.values

## Top 10 countries of origin for asylum applicants untill 2007

In [None]:
agg_df = data_c_df.groupby(['citizen', 'citizen_name'])['value'].sum().reset_index()
agg_df.sort_values(["value"], inplace=True, ascending=False)
agg_df.head(10)

In [None]:
top_10_origin = agg_df.head(10).citizen_name.values

## Sankey diagram for top 100 combination {country of origin | country of destination}

We aggregate now on both citizenship of asylum applicants and on country receiving applications. All combinations will exceed 5K entries. We select only the first 100 entries.

### Visualization function using Sankey diagram

In [None]:
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

def genSankey(df,cat_cols=[],value_cols='',title='Sankey Diagram',param={"height":1000}):
    # maximum of 6 value cols -> 6 colors
    colorPalette = ['#4B8BBE', '#AF2346','#32CD32','#8B008B','#FFD43B','#646464']
    labelList = []
    colorNumList = []
    for catCol in cat_cols:
        labelListTemp =  list(set(df[catCol].values))
        colorNumList.append(len(labelListTemp))
        labelList = labelList + labelListTemp
        
    # remove duplicates from labelList
    labelList = list(dict.fromkeys(labelList))
    
    # define colors based on number of levels
    colorList = []
    for idx, colorNum in enumerate(colorNumList):
        colorList = colorList + [colorPalette[idx]]*colorNum
       
    # transform df into a source-target pair
    for i in range(len(cat_cols)-1):
        if i==0:
            sourceTargetDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            sourceTargetDf.columns = ['source','target','count']
        else:
            tempDf = df[[cat_cols[i],cat_cols[i+1],value_cols]]
            tempDf.columns = ['source','target','count']
            sourceTargetDf = pd.concat([sourceTargetDf,tempDf])
        sourceTargetDf = sourceTargetDf.groupby(['source','target']).agg({'count':'sum'}).reset_index()
        
    # add index for source-target pair
    sourceTargetDf['sourceID'] = sourceTargetDf['source'].apply(lambda x: labelList.index(x))
    sourceTargetDf['targetID'] = sourceTargetDf['target'].apply(lambda x: labelList.index(x))
    
    # creating the sankey diagram
    data = dict(
        type='sankey',
        node = dict(
          pad = 15,
          thickness = 20,
          line = dict(
            color = "black",
            width = 0.25
          ),
          label = labelList,
          color = colorList
        ),
        link = dict(
          source = sourceTargetDf['sourceID'],
          target = sourceTargetDf['targetID'],
          value = sourceTargetDf['count'],
        )
      )
    
    layout =  dict(
        title = title,
        font = dict(
          size = 10
        ),
        height=param["height"]
    )
       
    fig = dict(data=[data], layout=layout)
    return fig

### 

In [None]:
agg_df = data_c_df.groupby(['citizen_name', 'geography_name'])['value'].sum().reset_index()
agg_df.columns = ["origin", "destination", "total"]
agg_df.sort_values(["total"], inplace=True, ascending=False)
print(f"All combinations: {agg_df.shape[0]}\nTop 10 combinations:")
agg_df.head(10)

### Top 100 combinations Sankey diagram

In [None]:
data_agg = agg_df.head(100)
fig = genSankey(data_agg,cat_cols=['origin', 'destination'],\
                value_cols='total',
                title='Sankey Diagram for asylum application (top 100 combinations): {country of origin -> country of destination}')
iplot(fig, validate=False)

Interestingly, two countries from Central and Eastern Europe, Hungary and Poland are both influx and outflux migration countries.

Poland received until 2007 asylum applicants mostly from Russia and in the same time asylum applicants from Poland were targeting Germany, Portugal and Austria.

Hungary received mostly asylum applications from Afganistan and asylum applicants from Hungary targeted mostly Germany and Austria.

## Top 10 origin and top 10 destination countries combined

Let's now filter the top 10 destination and top 10 origin and only show them in the Sankey diagram.

In [None]:
data_agg = agg_df.loc[agg_df.origin.isin(top_10_origin) & (agg_df.destination.isin(top_10_destination))]

In [None]:
fig = genSankey(data_agg,cat_cols=['origin', 'destination'],\
                value_cols='total',
                title='Sankey Diagram for asylum application (top 10 country of origin x top 10 country of destination)',
                param={"height": 600})
iplot(fig, validate=False)

## Time evolution of asylum applications (top 10 origin countries)

We will look now to the trends of asylum applications for the top 10 origin countries.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 
def plot_time_variation(df, c='citizen_name', y='value', is_log=False, title=""):
    f, ax = plt.subplots(1,1, figsize=(16,12))
    countries = df[c].unique()
    for country in countries:
        df_ = df[(df[c]==country)] 
        df_[y] = df_[y] + 1
        g = sns.lineplot(x="date", y=y, data=df_,  label=country)  
        ax.text(max(df_['date']), (df_.loc[df_['date']==max(df_['date']), y]), str(country))
    plt.xticks(rotation=90)
    plt.title(f'Total {title}, grouped by country/year')
    ax.text(max(df_['date']), (df_.loc[df_['date']==max(df_['date']), y]), str(country))
    plt.legend(loc="upper left", bbox_to_anchor=(1,1))
    if(is_log):
        ax.set(yscale="log")
    ax.grid(color='black', linestyle='dotted', linewidth=0.75)
    plt.show()  

In [None]:
filter_df = data_c_df.loc[data_c_df.citizen_name.isin(top_10_origin)]
filter_df = filter_df.groupby(['citizen_name', 'date'])['value'].sum().reset_index()
plot_time_variation(filter_df,is_log=True,title="asylum applications (country of origin)")

In [None]:
filter_df = data_c_df.loc[data_c_df.geography_name.isin(top_10_destination)]
filter_df = filter_df.groupby(['geography_name', 'date'])['value'].sum().reset_index()
plot_time_variation(filter_df,c='geography_name',is_log=True,title="asylum applications (country of destination)")

In [None]:
filter_df = data_c_df.loc[data_c_df.geography_name.isin(['Germany']) & data_c_df.citizen_name.isin(top_10_origin)]
filter_df = filter_df.groupby(['citizen_name', 'date'])['value'].sum().reset_index()
plot_time_variation(filter_df,c='citizen_name',is_log=True,title="asylum applications for Germany (per country of origin)")

In [None]:
filter_df = data_c_df.loc[data_c_df.citizen_name.isin(['Romania']) & data_c_df.geography_name.isin(top_10_destination)]
filter_df = filter_df.groupby(['geography_name', 'date'])['value'].sum().reset_index()
plot_time_variation(filter_df,c='geography_name',is_log=True,title="asylum applications from Germany (per country of destination)")