<p style="border:2px solid black"> </p>
<span style="font-family:Lucida Bright;">
<p style="margin-bottom:1cm"></p>
<center>
<font size="7"><b>Social Data Analysis and</b></font>
<p style="margin-bottom:0.3cm"></p>
<font size="7"><b>Visualization</b></font>
<p style="margin-bottom:1cm"></p>
<font size="3"><b>Final Project</b></font>
<p style="margin-bottom:1cm"></p>
<font size="6"><b>Demographics of Copenhagen</b></font>
<p style="margin-bottom:0.8cm"></p>
<font size="3"><b>Wojciech Mazurkiewicz, DTU, 14 May 2021</b></font>
<p style="margin-bottom:1.5cm"></p>
<font size="6"><b>Explainer Notebook</b></font>
<br>
<font size="3"><b></b></font>
</center>
<p style="margin-bottom:0.7cm"></p>
<p style="border:2px solid black"> </p>

# How to read this notebook

<p style="border:2px solid black"> </p>

Please note that the pre-rendered outputs will first display properly when the notebook is __trusted__.

If you are viewing the HTML-version of the notebook and would like to download the .ipynm file, you can do it [here](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/data_loading_and_cleaning.ipynb)%TODO.

# Initialization

<p style="border:2px solid black"> </p>

The initialization procedure, including the definitions of many the functions that will be used to load and clean the data in this notebook, is defined in the [Initialization notebook](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/initialization.html). Let's run it now:

In [None]:
%run ./initialization.ipynb

# Motivation

<p style="border:2px solid black"> </p>

> - What is your dataset?
> - Why did you choose this/these particular dataset(s)?
> - What was your goal for the end user's experience?

The work in this project is inspired by an article by [Martin Henriksen](https://da.wikipedia.org/wiki/Martin_Henriksen), in which he warns the Danish population is being [replaced](https://ditoverblik.dk/martin-henriksen-advarer-befolkningen-udskiftes). 

Now, the fact that there are roughly as many interpretations of what it means to be Danish as there are people in Denmark leaves plenty of room for joyfull discussions, which I'm sure we will have the pleasure to witness everywhere from  television to social media for many years to come. I, however, have already reached the final truth in this topic and have therefore little incentive to further pursue this direction.

I would, nonetheless, like to better understand the numbers that lead to the interpretations like that of mr. Henriksen, and have therefore reached out to KÃ¸benhavns Kommune to gain better access to their statistical databases. After a couple of phone calls with a gentleman called Martin, I have succeeded in persuading them to raise the limit of how much data could be withdrawn at once from 10,000 to 50,000 cells, which has made it possible to withdraw enough information for a more detailed analysis of how some of the demographical quantities in Copenhagen have developed over time.

When I started gathering the data, I didn't have a specific goal in mind regarding what I wanted to investigate and present to the reader. Rather, I wanted the ideas to form while I was uncovering the data. To me, that's the beauty of data science - it's and endless journey where each turn you take opens up new ideas for where to go next. So, instead of choosing a particular message to pass on to the reader, I would like to invite them on this journey with me, and use whatever they find interesting and useful as a parting point to carry out their own study.

For this reason I have attempted to document as much of the process as clearly as possible, and to leave the data in a form that will be easy to work with in the future. I have therefore used as much time as the norm for this entire course on gathering, cleaning, and organizing the data, as well as documentation thereof. For with good, clean data, the visualization work is a joy.

As I have no prior experience in webpage design, I have decided to go with the simplest possible design of a Jupyter Notebook, as learning to create webpages from scratch without any guidance would simply take too much time from the data analysis, which I would rather focus on in this course.

I have aimed to focus only on the data which would allow to trace the change of different demographical quantities in different districts of Copenhagen over time. This quantities include:

- citizenzhip (danes vs. western and non-western non-danes)
- marital status
- family type and number of children
- income
- life span
- population movement (immigration, births, deaths, etc.)

However, out of sheer interest, I have also included the information about the entire population of Copenhagen by the country of origin.

# Basic stats

<p style="border:2px solid black"> </p>

> Let's understand the dataset better
>
> - Write about your choices in data cleaning and preprocessing
> - Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

The data that will be used for this project has been obtained from https://kk.statistikbank.dk and consists of 12.4 MB demographical data and 2.6 MB geographical data.


# Data Analysis

<p style="border:2px solid black"> </p>

- Describe your data analysis and explain what you've learned about the dataset.
- If relevant, talk about your machine-learning.

## Population of Copenhagen by **country of origin**

[The Danish population is being replaced](https://ditoverblik.dk/martin-henriksen-advarer-befolkningen-udskiftes)

Denmark's >independent< newspaper'
Here's an independent article about how the Danish population is being exchanged:
https://ditoverblik.dk/martin-henriksen-advarer-befolkningen-udskiftes/
It's worrying, because than means that the Danish nation is disappearing at a horrifying rate, and if nothing is done, it will cease to exist soon
And think about it: Foreigners, their children, and their children's children are already 25 % population in Copenhagen
so in 50 years many of the noble and pure real Danes will be dead
and the ONLY thing remaining will be the foreigners, their children, their children's children, their children's children's children, and their children's children's children's children.
So, basically, the foreigners will have taken over and there will be NOTHING left!
But people are blind, man! They don't notice this shit until it's too late...

### Population by country of origin vs year

Drop 2008 to set the number of available years to 12 for convenience of viewing.

In [5]:
df_country = df_country[df_country['Year'] != 2008]
display(df_country)

NameError: name 'df_country' is not defined

Get the population for each country of origin by year (4th quarter):

In [None]:
df_country_vs_year = (
    df_country
    .loc[:,
         ['Year', 'Country of origin', 'Population']]
    .groupby(['Year', 'Country of origin'])
    .sum()
    .unstack(level=1)
    .droplevel(0, axis=1)
)

display(df_country_vs_year)

### Top 10 countries

In [None]:
# Get a list of countries represented in the dataframe,
# Sorted by summarized number of people over the years
countries_sorted_by_number_of_people = (
    df_country_vs_year
    .sum()
    .sort_values(ascending=False)
    .index
    .to_list()
)

# Show the countries sorted by population
# display(countries_sorted_by_number_of_people)

# Years from biggest to smallest.
years = df_country_vs_year.index.to_list()
years.sort(reverse=True)

#### ... by population

In [None]:
# Get the number of years.
n_years = len(years)

# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a figure for the plots.
figure, all_axes = plt.subplots(
    n_plot_rows, n_plot_columns,
    figsize=(5 * n_plot_columns + 2, 5 * n_plot_rows),
    gridspec_kw={'hspace': 0.3}
)

# Get the handles of the bottom axes'.
bottom_axes = all_axes[-1, :]

# Define colors.
n_countries_to_map = 15
color_palette = sns.color_palette("hls",
                                  n_colors=n_countries_to_map)

# Map the colors to countries.
color_mapping = {country: color
                 for country, color
                 in zip(countries_sorted_by_number_of_people[:n_countries_to_map],
                        color_palette)}

# Plot.
for idx, (year, axes) in enumerate(zip(years, all_axes.ravel()[:n_plots])):

    total = df_country_vs_year.at[year, 'Total']

    sns.barplot(
        data=(
            df_country_vs_year
            .loc[year, ~df_country_vs_year.columns.isin(['Total'])]
            .sort_values(ascending=False)
            .head(10)
            .div(1e3)
            .reset_index()
        ),
        x='Country of origin',
        y=year,
        ax=axes,
        palette=color_mapping)

    # Set the title of the plot.
    axes.set_title(year, y=0.9)
    axes.set_xlabel('')
    axes.set_ylabel('')
    axes.set_ylim([axes.get_ylim()[0], total / 1e3 * 1.2])

    draw_threshold(total * 1e-3, axes, title=f'Total: {total:,.0f}')

    # Rotate x tick labels.
    plt.setp(
        axes.get_xticklabels(),
        rotation=45,
        ha='right',
        va='top',
    )

    # Apply the standard formatting.
    format_axes_annotation(axes)

# Annotate the figure.
# figure_x_label(figure, 'Day of week', y_position=0.06)
figure_y_label(figure, 'Population in Copenhagen region [thousands]', x_position=0.08)
figure.suptitle('10 most represented countries of origin in Copenhagen',
                size=24,
                y=0.92)

#### ... by percentage

In [None]:
# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a figure for the plots.
figure, all_axes = plt.subplots(
    n_plot_rows, n_plot_columns,
    sharey='all',
    figsize=(5 * n_plot_columns + 2, 5 * n_plot_rows),
    gridspec_kw={'hspace': 0.3}
)

# Get the handles of the bottom axes'.
bottom_axes = all_axes[-1, :]

# Define colors.
n_countries_to_map = 15
color_palette = sns.color_palette("hls",
                                  n_colors=n_countries_to_map)

# Map the colors to countries.
color_mapping = {country: color
                 for country, color
                 in zip(countries_sorted_by_number_of_people[:n_countries_to_map],
                        color_palette)}

# Plot.
for idx, (year, axes) in enumerate(zip(years, all_axes.ravel()[:n_plots])):

    # The total number of people in Copenhagen.
    total = df_country_vs_year.at[year, 'Total']

    # Show the barplot.
    sns.barplot(
        data=(
            df_country_vs_year
            .loc[year, ~df_country_vs_year.columns.isin(['Total'])]
            .sort_values(ascending=False)
            .head(10)
            .mul(100 / total)
            .reset_index()
        ),
        x='Country of origin',
        y=year,
        ax=axes,
        palette=color_mapping)

    # Set the title of the plot.
    axes.set_title(year, y=0.9)
    axes.set_xlabel('')
    axes.set_ylabel('')

    # Rotate x tick labels.
    plt.setp(
        axes.get_xticklabels(),
        rotation=45,
        ha='right',
        va='top',
    )

    # Apply the standard formatting.
    format_axes_annotation(axes)

# Annotate the figure.
# figure_x_label(figure, 'Day of week', y_position=0.06)
figure_y_label(figure, r'% of polulation in Copenhagen region', x_position=0.08)
figure.suptitle('10 most represented countries of origin in Copenhagen',
                size=24,
                y=0.91)

#### Danes vs non-danes

Prepare the dataframe showing the proportions of danes vs non-danes.

In [None]:
# Get the number of counries.
n_countries = len(countries_sorted_by_number_of_people)

# Get the number of plots.
n_plots = len(years)

# Define the plot grid.
n_plot_columns = 3
n_plot_rows = int(np.ceil(n_plots / n_plot_columns))

# Create a dataframe with data for danes vs non-danes.
df_danes_vs_non_danes = (
    df_country_vs_year
    .loc[:, 'Denmark']
    .to_frame('Danes')
)

df_danes_vs_non_danes['Non-danes'] = (
    df_country_vs_year
    .loc[:, ~df_country_vs_year.columns.isin(['Total', 'Denmark'])]
    .sum(axis=1)
    .to_frame('Non-danes')
)

df_danes_vs_non_danes[['Pct danes', 'Pct non-danes']] = (
    df_danes_vs_non_danes[['Danes', 'Non-danes']]
    .div(df_danes_vs_non_danes.sum(axis=1), axis=0)
    .mul(100)
)

# Show the dataframe
display(df_danes_vs_non_danes)

Show absolute populations:

In [None]:
# Create a figure for the plots.
figure, axes = plt.subplots(figsize=(15, 8))

# Create the barplot for each year.
sns.barplot(data=(df_danes_vs_non_danes[['Danes', 'Non-danes']]
                  .div(1e3)
                  .reset_index()
                  .melt(id_vars='Year',
                        var_name='Origin',
                        value_name='Number of people')),
            x='Year',
            y='Number of people',
            hue='Origin',
            ax=axes)

# Total population over the years.
total = (
    df_danes_vs_non_danes[['Danes', 'Non-danes']]
    .div(1e3)
    .sum(axis=1)
    .to_numpy()
)

axes.plot(df_danes_vs_non_danes[['Danes', 'Non-danes']].div(1e3).sum(axis=1).to_numpy(),
          color='red')

axes.set_ylabel('Number of people [thousands]')
axes.text(0, total[0] + 20, 'Total population', rotation = 5, size=14)

format_axes(axes)
format_axes_annotation(axes)

Show the proportions in terms of the percentages of the total population:

In [None]:
# Create a figure for the plots.
figure, axes = plt.subplots(figsize=(15, 8))

# Create the barplot for each year.
sns.barplot(data=(df_danes_vs_non_danes[['Pct danes', 'Pct non-danes']]
                  .reset_index()
                  .melt(id_vars='Year',
                        var_name='Origin',
                        value_name='Number of people')),
            x='Year',
            y='Number of people',
            hue='Origin',
            ax=axes)

# Set axes limits (to make room for the legend)
axes.set_ylim((0, 90))

format_axes(axes)
format_axes_annotation(axes)

# Genre

<p style="border:2px solid black"> </p>

## Which genre of data story did you use?

## Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?

The 3 categories are: 
  
- Visual Structuring:
    - Establishing Shot / Splash Screen
    - Consistent Visual Platform
    - Progress Bar / Timebar
    - "Checklist" Progresss Tracker


- Highlighting:
    - Close-Ups
    - Feature Distinction
    - Character Direction
    - Motion
    - Audio
    - Zooming


- Transition guidance:
    - Familiar Objects (but still cuts)
    - Viewing Angle
    - Viewer (Camera) Motion
    - Continuity Editing
    - Object Continuity
    - Animated Transitions!

## Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
  
The  3 categories are:  
  
- Ordering:
    - Random Access
    - User Directed Path
    - Linear  


- Interactivity:
    - Hover Highlighting / Details
    - Filtering / Selection / Search
    - Navigation Buttons
    - Very Limited Interactivity
    - Explicit Instruction
    - Tacit Tutorial
    - Stimulating Default Views


- Messaging:
    - Captions / Headlines
    - Annotations
    - Accompanying Article
    - Multi-Messaging
    - Comment Repitition
    - Introductory Text
    - Summary / Synthesis
  

# Visualizations

<p style="border:2px solid black"> </p>

- Explain the visualizations you've chosen.
- Why are they right for the story you want to tell?


# Discussion

<p style="border:2px solid black"> </p>

Think critically about your creation

- What went well?,
- What is still missing? What could be improved? Why?


# Contributions

<p style="border:2px solid black"> </p>

Who did what?

- You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
- It is not OK simply to write "All group members contributed equally".


# References

<p style="border:2px solid black"> </p>

Make sure that you use references when they're needed and follow academic standards.