<p style="border:2px solid black"> </p>
<span style="font-family:Lucida Bright;">
<p style="margin-bottom:1cm"></p>
<center>
<font size="7"><b>Social Data Analysis and Visualization</b></font>
<p style="margin-bottom:1cm"></p>
<font size="3"><b>Final Project</b></font>
<p style="margin-bottom:1cm"></p>
<font size="6"><b>Demographics of Copenhagen</b></font>
<p style="margin-bottom:0.8cm"></p>
<font size="3"><b>Wojciech Mazurkiewicz, DTU, 14 May 2021</b></font>
<p style="margin-bottom:1.5cm"></p>
<font size="6"><b>Data Visualization</b></font>
<br>
<font size="3"><b></b></font>
</center>
<p style="margin-bottom:0.7cm"></p>
<p style="border:2px solid black"> </p>

# How to read this notebook

<p style="border:2px solid black"> </p>

Please note that the pre-rendered outputs will first display properly when the notebook is __trusted__.

If you are viewing the HTML-version of the notebook and would like to download the .ipynm file, you can do it [here](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/data_visualization.ipynb).

# Introduction

<p style="border:2px solid black"> </p>

This notebook aims to describe the process of visualizing of the data about different demographical quantites for Copehhagen, including: 

1. country of origin
1. citizenzhip (Danes vs. western and non-western non-Danes)
1. marital status
1. family type and children
1. income
1. life span
1. population movement data (immigration, births, deaths, etc.)

Each demographical quantity is described in its own section. 


# Initialization

<p style="border:2px solid black"> </p>

The initialization procedure, including the definitions of many the functions that will be used to load and clean the data in this notebook, is defined in the [Initialization notebook](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/initialization.html). Let's run it now:

In [None]:
%run ./initialization.ipynb

# Load data

<p style="border:2px solid black"> </p>

Let's load the object containing all dataframes containing clean data we have created in the process of [loading and cleaning](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/data_loading_and_cleaning.html). 

In [None]:
cph_data = CphData()

# Country of origin (without district information)

<p style="border:2px solid black"> </p>

## Load data

Let's load the data about the population of Copenhagen by the country of origin:

In [None]:
# Load the data containing information about the population
# of Copenhagen by the country of origin.
df_country = cph_data.country_of_origin

# Show dataframe.
display(df_country)

## Show data

Let's visualize the data. We will start by removing data for 2008 to set the number of available years to 12 for convenience of viewing.

In [None]:
df_country = df_country[df_country['Year'] != 2008]
display(df_country)

Now, let's create a dataframe where each row will represent a year and each column population of a given country of origin:

In [None]:
# Create a dataframe where each row represents a year and each columnn
df_country_vs_year = (
    df_country
    .loc[:,
         ['Year', 'Country of origin', 'Number of people']]
    .groupby(['Year', 'Country of origin'])
    .sum()
    .unstack(level=1)
    .droplevel(0, axis=1)
)

# Save the dataframe to pickle.
df_country_vs_year.to_pickle(path_data_clean_root / 'df_country_of_origin_vs_year.pkl')

# Show the results.
display(df_country_vs_year)

### Plot the population of top 10 countries of origin for each year

We start by getting the list of years, from most recent to least recent:

In [None]:
# Get a list of countries represented in the dataframe,
# Sorted by summarized number of people over the years
countries_sorted_by_number_of_people = (
    df_country_vs_year
    .sum()
    .sort_values(ascending=False)
    .index
    .to_list()
)

# Show the countries sorted by population
# display(countries_sorted_by_number_of_people)

# Years from biggest to smallest.
years = df_country_vs_year.index.to_list()
years.sort(reverse=True)

# Show the results.
display(years)

#### By number of people

Now we can show the total population of each of the countries for all the years:

In [None]:
# Get the number of years.
n_years = len(years)

# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a figure for the plots.
figure, all_axes = plt.subplots(
    n_plot_rows, n_plot_columns,
    figsize=(5 * n_plot_columns + 2, 5 * n_plot_rows),
    gridspec_kw={'hspace': 0.3}
)

# Get the handles of the bottom axes'.
bottom_axes = all_axes[-1, :]

# Define colors.
n_countries_to_map = 15
color_palette = sns.color_palette("hls",
                                  n_colors=n_countries_to_map)

# Map the colors to countries.
color_mapping = {country: color
                 for country, color
                 in zip(countries_sorted_by_number_of_people[:n_countries_to_map],
                        color_palette)}

# Plot.
for idx, (year, axes) in enumerate(zip(years, all_axes.ravel()[:n_plots])):

    total = df_country_vs_year.at[year, 'Total']

    sns.barplot(
        data=(
            df_country_vs_year
            .loc[year, ~df_country_vs_year.columns.isin(['Total'])]
            .sort_values(ascending=False)
            .head(10)
            .div(1e3)
            .reset_index()
        ),
        x='Country of origin',
        y=year,
        ax=axes,
        palette=color_mapping)

    # Set the title of the plot.
    axes.set_title(year, y=0.88)
    axes.set_xlabel('')
    axes.set_ylabel('')
    axes.set_ylim([axes.get_ylim()[0], total / 1e3 * 1.2])
    
    draw_threshold(total * 1e-3, axes, title=f'Total: {total:,.0f}')

    # Rotate x tick labels.
    plt.setp(
        axes.get_xticklabels(),
        rotation=45,
        ha='right',
        va='top',
    )

    # Apply the standard formatting.
    format_axes_annotation(axes)

# Annotate the figure.
# figure_x_label(figure, 'Day of week', y_position=0.06)
figure_y_label(figure, 'Population in Copenhagen region [thousands]', x_position=0.08)
figure.suptitle('10 most represented countries of origin in Copenhagen', 
                size=24,
                y=0.91)
figure.set_facecolor("white")

#### By percentage of the total population of Copenhagen

Instead of showing the absolute numbers, it might be more interesting to see the populations as a fraction of the total population of Copenhagen:

In [None]:
# Years from biggest to smallest.
years = df_country_vs_year.index.to_list()
years.sort(reverse=True)

# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a figure for the plots.
figure, all_axes = plt.subplots(
    n_plot_rows, n_plot_columns,
    sharey='all',
    figsize=(5 * n_plot_columns + 2, 5 * n_plot_rows),
    gridspec_kw={'hspace': 0.3}
)

# Get the handles of the bottom axes'.
bottom_axes = all_axes[-1, :]

# Define colors.
n_countries_to_map = 15
color_palette = sns.color_palette("hls",
                                  n_colors=n_countries_to_map)

# Map the colors to countries.
color_mapping = {country: color
                 for country, color
                 in zip(countries_sorted_by_number_of_people[:n_countries_to_map],
                        color_palette)}

# Plot.
for idx, (year, axes) in enumerate(zip(years, all_axes.ravel()[:n_plots])):
    
    # The total number of people in Copenhagen.
    total = df_country_vs_year.at[year, 'Total']
    
    # Show the barplot.
    sns.barplot(
        data=(
            df_country_vs_year
            .loc[year, ~df_country_vs_year.columns.isin(['Total'])]
            .sort_values(ascending=False)
            .head(10)
            .mul(100 / total)
            .reset_index()
        ),
        x='Country of origin',
        y=year,
        ax=axes,
        palette=color_mapping)

    # Set the title of the plot.
    axes.set_title(year, y=0.9)
    axes.set_xlabel('')
    axes.set_ylabel('')

    # Rotate x tick labels.
    plt.setp(
        axes.get_xticklabels(),
        rotation=45,
        ha='right',
        va='top',
    )

    # Apply the standard formatting.
    format_axes_annotation(axes)

# Annotate the figure.
# figure_x_label(figure, 'Day of week', y_position=0.06)
figure_y_label(figure, r'% of polulation in Copenhagen region', x_position=0.08)
figure.suptitle('10 most represented countries of origin in Copenhagen', 
                size=24,
                y=0.91)

figure.set_facecolor("white")

figure.savefig(path_results_root / 'country_of_origin_10_most_represented.png', bbox_inches='tight')

#### Interactive bar graph of 2020 with plotly

We can also make the graph interactive by creating it using plotly:

In [None]:
# Define the year of interest.
year = 2020

# Create a dataframe that shows % of population by country of origin
# in the chosen year.
df_country_2020 = (
    df_country_vs_year
    .loc[year, ~df_country_vs_year.columns.isin(['Total'])]
    .sort_values(ascending=False)
    .head(10)
    .mul(100 / df_country_vs_year.at[year, 'Total'])
    .reset_index()
    .rename(columns={year: '% of population'})
)

# Create the plotly bar graph.
fig = px.bar(
    df_country_2020,
    x='Country of origin',
    y='% of population',
#     barmode='group',
    color='Country of origin',
    color_discrete_sequence=px.colors.qualitative.D3
)

# Update the layout of the figure.
fig.update_layout(
    dict(plot_bgcolor='rgb(256, 256, 256)',
         paper_bgcolor='rgb(256, 256, 256)',
         title=dict(text="Population of Copenhagen in 2020",
                    x=0.5,
                    xanchor='center',
                    yanchor='top'),
         xaxis=dict(tickmode='linear',
                    #                         tick0=0.5
                    dtick=1),
         showlegend=False
#          annotations=caption_attributes(caption, y),
#          margin=margin
         )
)

# Show the figure
fig.show()

### Plot the relationship between Danes and non-Danes 

Let's create a dataframe where we will divide all countries of origin into two categories: Danes and non-Danes:

In [None]:
# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a dataframe with data for danes vs non-danes.
df_danes_vs_non_danes = (
    df_country_vs_year
    .loc[:, 'Denmark']
    .to_frame('Danes')
)

df_danes_vs_non_danes['Non-danes'] = (
    df_country_vs_year
    .loc[:, ~df_country_vs_year.columns.isin(['Total', 'Denmark'])]
    .sum(axis=1)
    .to_frame('Non-danes')
)

df_danes_vs_non_danes[['% Danes', '% Non-danes']] = (
    df_danes_vs_non_danes[['Danes', 'Non-danes']]
    .div(df_danes_vs_non_danes.sum(axis=1), axis=0)
    .mul(100)
)


# Save the dataframe to a file
df_danes_vs_non_danes.to_pickle(
    path_data_clean_root / 'df_danes_vs_non_danes.pkl')


# Show the dataframe
display(df_danes_vs_non_danes)

####  By number of people

Let's show how the number of Danes and non-Danes have developed over the years.

In [None]:
# Create a figure for the plots.
figure, axes = plt.subplots(figsize=(15, 8))

# Create the barplot for each year.
sns.barplot(data=(df_danes_vs_non_danes[['Danes', 'Non-danes']]
                  .div(1e3)
                  .reset_index()
                  .melt(id_vars='Year',
                        var_name='Origin',
                        value_name='Number of people')),
            x='Year',
            y='Number of people',
            hue='Origin',
            ax=axes)

# Total population over the years.
total = (
    df_danes_vs_non_danes[['Danes', 'Non-danes']]
    .div(1e3)
    .sum(axis=1)
    .to_numpy()
)

# Plot total population.
axes.plot(df_danes_vs_non_danes[['Danes', 'Non-danes']].div(1e3).sum(axis=1).to_numpy(),
          color='red')

# Annotate.
axes.set_ylabel('Number of people [thousands]')
axes.text(0, total[0] + 20, 'Total population', rotation = 5, size=14)

# Format plot.
format_axes(axes)
format_axes_annotation(axes)
figure.set_facecolor("white")

# Save the plot.
figure.savefig(path_results_root / 'total_population_danes_vs_non_danes_absolute.png')


#### By percentage of the total population of Copenhagen

As earlier, it might be easier to get an intuition about the numbers if the sizes of the populations are expressed as fractions of the total population of Copenhagen:

In [None]:
# Create a figure for the plots.
figure, axes = plt.subplots(figsize=(15, 8))

# Create the barplot for each year.
sns.barplot(data=(df_danes_vs_non_danes[['% Danes', '% Non-danes']]
                  .rename(columns={'% Danes': 'Danes',
                                   '% Non-danes': 'Non-danes'})
                  .reset_index()
                  .melt(id_vars='Year',
                        var_name='Origin',
                        value_name='% of population')),
            x='Year',
            y='% of population',
            hue='Origin',
            ax=axes)

# Set axes limits (to make room for the legend)
axes.set_ylim((0, 90))
axes.set_title('Population of Copenhagen by country of origin')

format_axes(axes)
format_axes_annotation(axes)
figure.set_facecolor("white")

# Save the plot.
figure.savefig(path_results_root /
               'total_population_danes_vs_non_danes_pct.png')

#### Interactive graph with plotly

In [None]:
# Get a dataframe only containing the representations in  %
df_danes_non_danes_pct = (
    df_danes_vs_non_danes[['% Danes', '% Non-danes']]
    .rename(columns={'% Danes': 'Danes',
                     '% Non-danes': 'Non-danes'})
    .reset_index()
    .melt(id_vars='Year',
          var_name='Origin',
          value_name='% of population')
)

# Create a figure containing the plotly bar graph.
fig = px.bar(
    df_danes_non_danes_pct,
    x='Year',
    y='% of population',
    color='Origin',
    barmode='group',
    color_discrete_sequence=px.colors.qualitative.D3
)

# Adjust the layout.
fig.update_layout(
    dict(plot_bgcolor='rgb(256, 256, 256)',
         paper_bgcolor='rgb(256, 256, 256)',
         title=dict(text="Population of Copenhagen",
                    x=0.5,
                    xanchor='center',
                    yanchor='top'),
         xaxis=dict(tickmode='linear',
#                     tick0=0.5,
                    dtick=1)
         )
)

# Show the figure.
fig.show()

# Save as HTML
fig.write_html(str(
    path_results_root / 'total_population_danes_vs_non_danes_pct.html'))

# Citizenship
<p style="border:2px solid black"> </p>

## Load data

In [None]:
# Load the data
cph_data = CphData()

# Get the citizenship database
df_citizenship = cph_data.citizenship
display(df_citizenship)

## Get percentage of district population by citizenship

In [None]:
# Get the district data by year.
df_citizenhip_by_district = (
    df_citizenship
    .loc[df_citizenship['District type'] == 'District']
    .groupby(get_df_columns(df_citizenship,
                            exclude=['Sex', 'Age', 'Number of people']),
             as_index=False)
    .sum()
    .drop(['Quarter', 'District type'], axis=1)
)

# Add the % of district population.
df_citizenhip_by_district['% of district population'] = (
    df_citizenhip_by_district
    .groupby(['Year', 'District'])['Number of people']
    .transform(np.sum)
    .div(df_citizenhip_by_district['Number of people'])
    .pow(-1)
    .mul(100)
)

# Show the dataframe.
display(df_citizenhip_by_district)

## Show data for 2020

In [None]:
# Define the data for 2020.
year = 2020

# Get the dataframe for the chosen year. 
# Let's sort by % of district popuplation.
df_citizenhip_by_district_2020 = (
    df_citizenhip_by_district
    .loc[df_citizenhip_by_district['Year'] == year]
    .sort_values(by=['Year', '% of district population'], ascending=False)
)

# Show the results
display(df_citizenhip_by_district_2020.head(15))

## Create a plotly barplot showing the percentages of population by citizenship for all districts in 2020

In [None]:
# Create the plotly bar graph.
fig = px.bar(
    df_citizenhip_by_district_2020,
    x='District',
    y='% of district population',
    barmode='group',
    color='Citizenship',
    color_discrete_sequence=px.colors.qualitative.D3
)

# Update the layout of the figure.
fig.update_layout(
    dict(plot_bgcolor='rgb(256, 256, 256)',
         paper_bgcolor='rgb(256, 256, 256)',
         title=dict(text="Polulation of Copenhagen by district and citizenship",
                    x=0.5,
                    xanchor='center',
                    yanchor='top'),
         xaxis=dict(tickmode='linear',
                    #                         tick0=0.5
                    dtick=1),
         showlegend=True,
         xaxis_title=None
#          annotations=caption_attributes(caption, y),
#          margin=margin
         )
)

# Show the figure
fig.show()

## Create a geoplot of the percentage of non-Danes by district over time

### All non-Danes

In [None]:
show_pct_non_danes_over_time(
    group='all',
    caption=None,
    caption_y=-0.05,
    margin=dict(r=0,
                t=30,
                l=0,
                b=0),
    range_color=(0, 25)
)

### Western non-Danes

In [None]:
show_pct_non_danes_over_time(
    group='western',
    caption=None,
    caption_y=-0.05,
    margin=dict(r=0,
                t=30,
                l=0,
                b=0),
    range_color=(0, 25)
)

### Non-western non-Danes

In [None]:
show_pct_non_danes_over_time(
    group='non-western',
    caption=None,
    caption_y=-0.05,
    margin=dict(r=0,
                t=30,
                l=0,
                b=0),
    range_color=(0, 25)
)

In [None]:
# Show the dataframe.
display(df_citizenhip_by_district)

Create a dataframe that shows the temporal development of the percentage of non-Danes in all districts of Copenhagen.

In [None]:
# Create a dataframe that shows the temporal development of the 
# percentage of non-Danes in all districts of Copenhagen.
df_citizenhip_by_district_non_danes = (
    df_citizenhip_by_district
    .loc[~df_citizenhip_by_district['Citizenship'].isin(['Denmark'])]
    .drop(['Number of people', 'Citizenship'], axis=1)
    .groupby(['Year', 'District'],
             as_index=False)
    .sum()
)

# Show the results.
display(df_citizenhip_by_district_non_danes)

In [None]:
show_cph_geoplot(
    df_citizenhip_by_district_non_danes,
    locations='District',
    values='% of district population',
    animation_frame='Year',
    range_color=(0, 100),
    title='Percentage of non-Danes by district of Copenhagen'
)

# Marital status
<p style="border:2px solid black"> </p>

# Family type and children
<p style="border:2px solid black"> </p>

# Income
<p style="border:2px solid black"> </p>

## Load demographic data

Load the dataframe containing all available data organized by year and district.

In [None]:
# Load the dataframe containing all available data organized by year and district.
df_superset = pd.read_pickle(path_data_clean_root / 'cph_clean_superset.pkl')

# Show the dataframe.
display(df_superset)

## Load geodata

In [None]:
# Load the geo shapes from the json file.
district_geo_shapes = load_cph_district_geoshapes()

## Get data about the average income

In [None]:
# Get the data about the average income.
df_income = (
    df_superset.loc[:, ['Year', 'District', 'Average income (kr.)']]
)

# Add information about the difference of income between sexes.
df_income['Difference between men and women'] =  (
    df_income[('Average income (kr.)', 'Men')]
    - df_income[('Average income (kr.)', 'Women')]
)

# Flatten the multiindex of the columns.
df_income = flatten_multiindex(df_income)
       
# Show the result.
display(flatten_multiindex(df_income))

## Show geoplot

In [None]:
show_cph_geoplot(
    df_income,
    locations ='District',
    values = 'Difference between men and women',
    animation_frame='Year',
    label='',
    title='Difference in income between men and women [kr]',
    caption=''
)

# Life span
<p style="border:2px solid black"> </p>

# Population movement data
<p style="border:2px solid black"> </p>

# Dwellings
<p style="border:2px solid black"> </p>

# Geoplots
<p style="border:2px solid black"> </p>

# Sandbox
<p style="border:2px solid black"> </p>