<p style="border:2px solid black"> </p>
<span style="font-family:Lucida Bright;">
<p style="margin-bottom:1cm"></p>
<center>
<font size="7"><b>Social Data Analysis and Visualization</b></font>
<p style="margin-bottom:1cm"></p>
<font size="3"><b>Final Project</b></font>
<p style="margin-bottom:1cm"></p>
<font size="6"><b>Demographics of Copenhagen</b></font>
<p style="margin-bottom:0.8cm"></p>
<font size="3"><b>Wojciech Mazurkiewicz, DTU, 14 May 2021</b></font>
<p style="margin-bottom:1.5cm"></p>
<font size="6"><b>Loading and Cleaning of Data</b></font>
<br>
<font size="3"><b></b></font>
</center>
<p style="margin-bottom:0.7cm"></p>
<p style="border:2px solid black"> </p>

# How to read this notebook

<p style="border:2px solid black"> </p>

<span style="font-family:Arial;">

Please note that the pre-rendered outputs will first display properly when the notebook is __trusted__.

# Introduction

<p style="border:2px solid black"> </p>

This notebook aims to describe the process of loading and cleaning of the data about different demographical quantites for Copehhagen. Each demographical quantity is described in its own section, and, at the end, the clean data is saved to hard drive in the form of pickled Pandas dataframes.

The data about the different demographical quantities for Copenhagen has been obtained from https://kk.statistikbank.dk. As the web interface only permits a withdrawal of 50,000 cells at a time, it was in many cases necessary to split the data into multiple smaller tables, that needed reassembling. Also, the format and resolution of the data representing the same quantities, such as age and time, was not always consistent, thus a major effort has been invested into cleaning and streamlining the data. 

I have aimed to focus only on data which would allow to trace the change of different demographical quantities in different districts of Copenhagen over time. This quantities include:

1. citizenzhip (danes vs. western and non-western non-danes)
2. marital status
3. family type and children
5. income
6. life span
7. population movement data (immigration, births, deaths, etc.)

However, out of sheer interest, I have also included the information about the entire population of Copenhagen by the country of origin.

# Initialization

<p style="border:2px solid black"> </p>

The initialization procedure, including the definitions of many the functions that will be used to load and clean the data in this notebook, is defined in the [Initialization notebook](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/initialization.html). Let's run it now:

In [3]:
%run ./initialization.ipynb

# Country of origin (without district information)

<p style="border:2px solid black"> </p>

##  Source

The data about the population of Copenhagen by the citizens' country of origin was obtained from: www.statistikbanken.dk/FOLK1C.

Due to the withdrawal limit of 50,000 cells, the data has been split into multiple files.

## Load data

Let's load the data:

In [4]:
# Define the columns on wchich the split-tables will be merged.
merge_on_columns = ['region', 'sex', 'age', 'country of origin']

# Load the dataframe.
for idx, years in enumerate(
    ['2008', '2009-2010', '2011-2012', '2013-2014',
     '2015-2016', '2017-2018', '2019-2020']):

    # Define the path to the data file.
    path_csv = (
        path_data_without_district_info_root /
        ('cph_population_by_country_' + years + '.csv')
    )

    # Merge the dataframes.
    if idx == 0:
        df_country_raw = pd.read_csv(path_csv,
                                     sep='\t',
                                     skiprows=0,
                                     encoding='windows-1252')
    else:
        df_country_raw = df_country_raw.merge(
            pd.read_csv(path_csv,
                        sep='\t',
                        skiprows=0,
                        encoding='windows-1252'),
            left_on=merge_on_columns,
            right_on=merge_on_columns
        )

# Show the dataframe.
display(df_country_raw)

Unnamed: 0,region,sex,age,country of origin,2008Q1,2008Q2,2008Q3,2008Q4,2009Q1,2009Q2,2009Q3,2009Q4,2010Q1,2010Q2,2010Q3,2010Q4,2011Q1,2011Q2,2011Q3,2011Q4,2012Q1,2012Q2,2012Q3,2012Q4,2013Q1,2013Q2,2013Q3,2013Q4,2014Q1,2014Q2,2014Q3,2014Q4,2015Q1,2015Q2,2015Q3,2015Q4,2016Q1,2016Q2,2016Q3,2016Q4,2017Q1,2017Q2,2017Q3,2017Q4,2018Q1,2018Q2,2018Q3,2018Q4,2019Q1,2019Q2,2019Q3,2019Q4,2020Q1,2020Q2,2020Q3,2020Q4
0,Copenhagen,Men,0-4 years,Total,16516,16623,16708,16857,17122,17352,17503,17656,17935,18176,18314,18523,18730,18916,19039,19162,19186,19325,19463,19438,19508,19623,19721,19617,19581,19607,19640,19646,19591,19598,19602,19530,19500,19589,19726,19840,19935,20086,20151,20209,20184,20304,20452,20451,20459,20479,20660,20695,20663,20745,20740,20617
1,Copenhagen,Men,0-4 years,Denmark,13303,13416,13542,13699,13930,14136,14295,14446,14726,14949,15082,15244,15426,15589,15732,15824,15857,16004,16125,16144,16165,16289,16351,16229,16225,16262,16249,16206,16129,16096,16081,16016,15989,16048,16152,16253,16336,16434,16499,16539,16516,16629,16754,16729,16775,16794,16919,16961,16951,17054,17099,17004
2,Copenhagen,Men,0-4 years,Albania,0,0,0,0,0,0,0,0,0,1,1,0,1,1,1,1,1,1,1,1,3,4,4,4,4,6,6,6,6,5,6,6,5,5,5,5,5,5,6,6,6,5,4,5,6,5,7,9,10,10,11,10
3,Copenhagen,Men,0-4 years,Andorra,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,Copenhagen,Men,0-4 years,Belgium,1,2,1,1,1,2,2,4,5,4,4,4,5,5,6,6,6,7,8,9,7,7,7,7,7,6,9,8,7,6,8,6,5,7,7,10,10,10,11,12,14,15,15,12,11,14,10,9,7,9,10,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9991,Copenhagen,Women,100 years and over,Vanuatu,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9992,Copenhagen,Women,100 years and over,East Timor,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9993,Copenhagen,Women,100 years and over,Pacific Islands,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9994,Copenhagen,Women,100 years and over,Stateless,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Clean data

Let's start by dropping the information about the region, as it is always set to "Copenhagen":

In [5]:
# Drop region information (as they are all about Copenhagen).
df_country = df_country_raw.drop(
    ['region'], axis=1)

Now, we can clean the data so that:
- The ***Year***, ***Quarter***, and ***Number of people*** will appear in designated columns.
- The age is standardized to 10 - year intervals.
- All the numerical values are formatted as floats except ***Year*** and ***Quarter***, which are formatted as integers.
- Columns ***Year***, ***Quarter***, ***District***, ***District type***, ***Age***, and ***Sex*** will appear in the front. All the other columns will be sorted alphabetically.

The cleaning is performed using the function `clean_cph_dataframe`, which is defined in the [Initialization notebook](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/initialization.html).

In [6]:
# Clean and standardize the dataframe
df_country = clean_cph_dataframe(df_country,
                                 value_name='Number of people',
                                 df_name='Country of origin')

# Ensure correct data types.
columns_int = ['Year', 'Quarter']
columns_str = ['Sex', 'Age', 'Country of origin']
df_country = set_data_types(df_country, columns_int, columns_str)

# Show dataframe.
display(df_country)

Unnamed: 0,Year,Quarter,Sex,Age,Country of origin,Number of people
0,2008,4,Men,0-9 years,Abu Dhabi,0.0
1,2009,4,Men,0-9 years,Abu Dhabi,0.0
2,2010,4,Men,0-9 years,Abu Dhabi,0.0
3,2011,4,Men,0-9 years,Abu Dhabi,0.0
4,2012,4,Men,0-9 years,Abu Dhabi,0.0
...,...,...,...,...,...,...
61875,2016,4,Women,>= 90 years,Zimbabwe,0.0
61876,2017,4,Women,>= 90 years,Zimbabwe,0.0
61877,2018,4,Women,>= 90 years,Zimbabwe,0.0
61878,2019,4,Women,>= 90 years,Zimbabwe,0.0


## Show basic statistics

Let's show the basic statistics about the data set:

In [13]:
show_stats(df_country)

Unnamed: 0,Data types
Year,Int64
Quarter,Int64
Sex,object
Age,object
Country of origin,object
Number of people,float64


Unnamed: 0,Number of missing values
Year,0
Quarter,0
Sex,0
Age,0
Country of origin,0
Number of people,0


Unnamed: 0,Year,Quarter,Sex,Age,Country of origin,Number of people
count,61880.0,61880.0,61880,61880,61880,61880.0
unique,,,2,10,238,
top,,,Women,0-9 years,Bulgaria,
freq,,,30940,6188,260,
mean,2014.0,4.0,,,,243.5
std,3.7,0.0,,,,2873.9
min,2008.0,4.0,,,,0.0
25%,2011.0,4.0,,,,0.0
50%,2014.0,4.0,,,,1.0
75%,2017.0,4.0,,,,11.0


We can see that no values are missing, and that the data types are as intended. We can also see that the ranges of the values in all columns seem reasonable. Therefore, we are happy with the result.

## Save data

Save the dataframe containing the clean data to hard drive:

In [7]:
# Save the clean data.
df_country.to_pickle(path_data_clean_root /
                     'cph_population_by_country_of_origin_without_district.pkl')

# Citizenship

<p style="border:2px solid black"> </p>

## Source

The data about the population of Copenhagen by the citizens' citizenship was obtained from: : https://kk.statistikbank.dk/KKBEF8

Due to the withdrawal limit of 50,000 cells, the data has been split into multiple files.

## Load data

Let's load the data:

In [10]:
# Get the paths to all files containing splits of the dataframe.
paths_csv = [path_csv
             for path_csv in path_data_citizenship_root.glob('**/*.csv')
             if path_csv.is_file()]

# Load the dataframe from files.
df_citizenship_raw = load_split_dataframe(paths_csv)

# Show the loaded dataframe.
display(df_citizenship_raw)

Unnamed: 0,citizenship,age,sex,district,1980Q1,1981Q1,1982Q1,1983Q1,1984Q1,1985Q1,1986Q1,1987Q1,1988Q1,1989Q1,1990Q1,1991Q1,1992Q1,1993Q1,1994Q1,1995Q1,1996Q1,1997Q1,1998Q1,1999Q1,2000Q1,2001Q1,2002Q4,2003Q4,2004Q4,2005Q4,2006Q4,2007Q4,2008Q4,2009Q4,2010Q4,2011Q4,2012Q4,2013Q4,2014Q4,2015Q4,2016Q4,2017Q4,2018Q4,2019Q4,2020Q4
0,Denmark,0-4 years,Men,Copenhagen total,9146,8814,8733,8723,8700,8526,8390,8426,8690,9102,9632,9970,10443,11017,11338,11663,11951,12280,12348,12630,12870,13494,14296,14542,14497,14439,14508,14970,15388,16130,16866,17408,17641,17650,17523,17236,17549,17842,17992,18148,18158
1,Denmark,0-4 years,Men,District - Indre By,817,788,798,778,729,728,751,720,710,773,805,795,793,846,849,864,944,1018,1001,1065,1155,1207,1255,1283,1252,1259,1293,1271,1270,1276,1291,1325,1354,1342,1337,1337,1353,1316,1311,1300,1262
2,Denmark,0-4 years,Men,District - Østerbro,1245,1246,1244,1232,1196,1128,1113,1110,1128,1194,1263,1335,1368,1478,1570,1620,1670,1779,1820,1872,1854,1920,2012,2097,2103,2095,2027,2111,2173,2268,2319,2342,2332,2336,2290,2186,2145,2171,2161,2193,2154
3,Denmark,0-4 years,Men,District - Nørrebro,1266,1216,1216,1193,1183,1241,1203,1221,1279,1363,1443,1502,1607,1696,1684,1757,1787,1794,1872,1890,1945,2066,2312,2309,2304,2225,2144,2158,2105,2188,2260,2310,2307,2339,2384,2354,2467,2533,2604,2536,2501
4,Denmark,0-4 years,Men,District - Vesterbro/Kongens Enghave,990,939,929,920,935,842,788,768,796,812,888,954,952,976,1005,1039,1008,1048,1092,1164,1233,1346,1515,1617,1611,1616,1574,1675,1729,1880,2040,2153,2142,2135,2106,2093,2176,2213,2268,2343,2368
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7675,Non-western countries,95+years,Women,Polling area - 9. Syd,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7676,Non-western countries,95+years,Women,Polling area - 9. Øst,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7677,Non-western countries,95+years,Women,Polling area - 9. Vest,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0
7678,Non-western countries,95+years,Women,Polling area - 9. Midt,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Clean data

Now, we can clean the data so that:
- The ***Year***, ***Quarter***, and ***Number of people*** will appear in designated columns.
- The age is standardized to 10 - year intervals.
- All the numerical values are formatted as floats except ***Year*** and ***Quarter***, which are formatted as integers.
- Columns ***Year***, ***Quarter***, ***District***, ***District type***, ***Age***, and ***Sex*** will appear in the front. All the other columns will be sorted alphabetically.

The cleaning is performed using the function `clean_cph_dataframe`, which is defined in the [Initialization notebook](https://social-data-analysis-and-visualization-final-project.s3.eu-central-1.amazonaws.com/initialization.html).

In [11]:
# Add district type and clean the district and time data.
df_citizenship = clean_cph_dataframe(df_citizenship_raw,
                                     value_name='Number of people',
                                     df_name='Citizenship')

# Ensure correct data types.
columns_int = ['Year', 'Quarter']
columns_str = ['District', 'District type', 'Sex', 'Age', 'Citizenship']
df_citizenship = set_data_types(df_citizenship, columns_int, columns_str)

# Show the dataframe.
display(df_citizenship)

Unnamed: 0,Year,Quarter,District,District type,Sex,Age,Citizenship,Number of people
0,1980,1,1. Nord,Polling area,Men,0-9 years,Denmark,265.0
1,1981,1,1. Nord,Polling area,Men,0-9 years,Denmark,278.0
2,1982,1,1. Nord,Polling area,Men,0-9 years,Denmark,265.0
3,1983,1,1. Nord,Polling area,Men,0-9 years,Denmark,249.0
4,1984,1,1. Nord,Polling area,Men,0-9 years,Denmark,249.0
...,...,...,...,...,...,...,...,...
157435,2016,4,Østerbro,District,Women,>= 90 years,Western countries,3.0
157436,2017,4,Østerbro,District,Women,>= 90 years,Western countries,4.0
157437,2018,4,Østerbro,District,Women,>= 90 years,Western countries,5.0
157438,2019,4,Østerbro,District,Women,>= 90 years,Western countries,6.0


## Show basic statistics

Let's show the basic statistics about the data set:

In [15]:
show_stats(df_citizenship)

Unnamed: 0,Data types
Year,Int64
Quarter,Int64
District,object
District type,object
Sex,object
Age,object
Citizenship,object
Number of people,float64


Unnamed: 0,Number of missing values
Year,0
Quarter,0
District,0
District type,0
Sex,0
Age,0
Citizenship,0
Number of people,0


Unnamed: 0,Year,Quarter,District,District type,Sex,Age,Citizenship,Number of people
count,157440.0,157440.0,157440,157440,157440,157440,157440,157440.0
unique,,,64,3,2,10,3,
top,,,3. Nord,Polling area,Men,0-9 years,Non-western countries,
freq,,,2460,130380,78720,15744,52480,
mean,2000.0,2.4,,,,,,401.4
std,11.8,1.5,,,,,,2047.8
min,1980.0,1.0,,,,,,0.0
25%,1990.0,1.0,,,,,,7.0
50%,2000.0,1.0,,,,,,39.0
75%,2010.0,4.0,,,,,,287.0


We can see that no values are missing, and that the data types are as intended. We can also see that the ranges of the values in all columns seem reasonable. Therefore, we are happy with the result.

## Save data

Save the dataframe containing the clean data to hard drive:

In [14]:
# Save the clean data.
df_citizenship.to_pickle(path_data_clean_root /
                         'cph_population_by_citizenship.pkl')

<p style="border:2px solid black"> </p>

# Marital status (with district)

## Source

The data was obtained from: https://kk.statistikbank.dk/KKBEF1

## Load data

In [None]:
# Get file paths to all the paths.
paths_csv = [path_csv
             for path_csv in path_data_marital_status_root.glob('**/*.csv')
             if path_csv.is_file()]

# Load the dataframe from files.
df_marital_status_raw = load_split_dataframe(paths_csv)

# Show the dataframe.
display(df_marital_status_raw)

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_marital_status = clean_cph_dataframe(
    df_marital_status_raw,
    value_name='Number of people',
    df_name='Marital status w. district'
)

# Ensure correct data types.
columns_int = ['Year', 'Quarter']
columns_str = ['District', 'District type', 'Sex', 'Age', 'Marital status']
df_marital_status = set_data_types(df_marital_status, columns_int, columns_str)

# Save the clean data.
df_marital_status.to_pickle(path_data_clean_root /
                         'cph_population_by_marital_status.pkl')
# Show the dataframe.
display(df_marital_status)

## Show statistics

In [None]:
show_stats(df_marital_status)

<p style="border:2px solid black"> </p>

# Family type and children

## Source

The data was obtained from: https://kk.statistikbank.dk/KKFAM1

## Load data

In [None]:
# Get file paths to all the paths.
path_csv = path_data_root / 'cph_children_1998-2020.csv'

# Load the dataframe from files.
df_children_raw = load_cph_df(path_csv)

# Show the dataframe.
display(df_children_raw)

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_children = clean_cph_dataframe(
    df_children_raw,
    value_name='Number of families',
    df_name='Number of children'
)

# Ensure correct data types.
columns_int = ['Year', 'Quarter']
columns_str = ['District', 'District type', 'Family type', 'Number of children']
df_children = set_data_types(df_children, columns_int, columns_str)

# Save the clean data.
df_children.to_pickle(path_data_clean_root /
                      'cph_population_by_family_type_and_number_of_chidren.pkl')

# Show the dataframe.
display(df_children)

## Show statistics

In [None]:
show_stats(df_children)

<p style="border:2px solid black"> </p>

# Income

## Source

The data was obtained from: https://kk.statistikbank.dk/KKIND3

## Load data

In [None]:
# Get file paths to all the paths.
path_csv = path_data_root / 'cph_income_1987-2019.csv'

# Load the dataframe from files.
df_income_raw = load_cph_df(path_csv)

# Show the dataframe.
display(df_income_raw.head(3))

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_income = clean_cph_dataframe(df_income_raw,
                                value_name='Value',
                                df_name='Income')

# Use only total personal income, and use all unique values
# in the column "Unit" as colum names.
df_income = df_sort_columns(
    df_income.loc[(df_income['Type of income']
                   .isin(['Personal income in total (ex. imputed rent and '
                          'before deductions of interest expenses)']))]
    .drop(['Type of income'], axis=1)
    .pivot_table(values='Value',
                 index=[column for column in list(df_income.columns)
                        if column not in ['Type of income', 'Unit', 'Value']],
                 columns=['Unit'],
                 aggfunc='first')
    .reset_index()
)

# Delete the name of the index of columns.
df_income.columns.name = ''

# The amount of income is in thousands of kr, not in kr.
# Let's correct it.
df_income['Amount of income (kr.)'] = df_income['Amount of income (kr.)'].mul(1000)

# Ensure correct data types.
columns_int = ['Year']
columns_str = ['District', 'District type', 'Sex']
df_income = set_data_types(df_income, columns_int, columns_str)

# Rename columns
df_income = df_income.rename(
    columns={'Amount of income (kr.)': 'Total income in district (kr.)',
             'Average income for people with the type of income (kr.)': 'Average income (kr.)',
             'People with the type of income (number)': 'Number of people'}
)

# Save the clean data.
df_income.to_pickle(path_data_clean_root /
                    'cph_income.pkl')

# Show the dataframe.
display(df_income)

## Show statistics

In [None]:
show_stats(df_income)

<p style="border:2px solid black"> </p>

# Life span

## Source

The data was obtained from: https://kk.statistikbank.dk/KKBEF4

## Load data

In [None]:
# Get file paths to all the paths.
path_csv = path_data_root / 'cph_life_expectancy_5_years_average_2009-2020.csv'

# Load the dataframe from files.
df_life_span_raw = load_cph_df(path_csv)

# Show the dataframe.
display(df_life_span_raw)

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_life_span = clean_cph_dataframe(df_life_span_raw,
                                   value_name='Average life span',
                                   df_name='Life span')

# Ensure correct data types.
columns_int = ['Year']
columns_str = ['District', 'District type']
df_life_span = set_data_types(df_life_span, columns_int, columns_str)

# Save the clean data.
df_life_span.to_pickle(path_data_clean_root /
                       'cph_life_span.pkl')

# Show the dataframe.
display(df_life_span)

## Show statistics

In [None]:
show_stats(df_life_span)

<p style="border:2px solid black"> </p>

# Population movement data

## Source

The data was obtained from: https://kk.statistikbank.dk/KKBEF6

## Load data

In [None]:
# Get file paths to all the paths.
path_csv = path_data_root / 'cph_polulation_stats_summary_1975-2020.csv'

# Load the dataframe from files.
df_movement_raw = load_cph_df(path_csv)

# Show the dataframe.
display(df_movement_raw)

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_movement = clean_cph_dataframe(df_movement_raw,
                                  df_name='Population movement')


# Make each unique value from the column "Type of movement"
# into a column
df_movement = df_sort_columns(
    df_movement
    .pivot_table(values='Value',
                 index=['District', 'District type', 'Year'],
                 columns='Type of movement',
                 aggfunc='first')
    .reset_index()
)

# Delete the name of the index of columns.
df_movement.columns.name = ''

# Ensure correct data types.
columns_int = ['Year']
columns_str = ['District', 'District type']
df_movement = set_data_types(df_movement, columns_int, columns_str)

# Save the clean data.
df_movement.to_pickle(path_data_clean_root /
                      'cph_population_movement.pkl')

# Show the dataframe.
display(df_movement)

## Show statistics

In [None]:
show_stats(df_movement)

<p style="border:2px solid black"> </p>

# Dwellings

## Source

The data was obtained from: https://kk.statistikbank.dk/KKBOL2

## Load data

In [None]:
# Get file paths to all the paths.
path_csv = path_data_dwellings_root / 'cph_dwellings_1991-2021.csv'

# Load the dataframe from files.
df_dwellings_raw = load_cph_df(path_csv)

# Show the dataframe.
display(df_dwellings_raw)

## Clean data

In [None]:
# Add district type and clean the district and time data.
df_dwellings = (
    clean_cph_dataframe(df_dwellings_raw,
                        value_name='Total square meters occupied dwellings',
                        df_name='Dwellings')
    .rename(columns={'Ownership': 'Dwelling ownership'})
    .drop(['Unit'], axis=1)
)

# Ensure correct data types.
columns_int = ['Year']
columns_str = ['District', 'District type', 'Dwelling ownership']
df_dwellings = set_data_types(df_dwellings, columns_int, columns_str)

# Save the clean data.
df_dwellings.to_pickle(path_data_clean_root /
                       'cph_dwellings.pkl')

# Show the dataframe.
display(df_dwellings)

## Show statistics

In [None]:
show_stats(df_dwellings)

<p style="border:2px solid black"> </p>

# Sandbox



## Outer join

In [None]:
from functools import reduce

df1 = pd.DataFrame({'district': ['a', 'b', 'c', 'd'],
                    'year': [1, 1, 1, 1],
                    'sex': ['M', 'F', 'M', 'F'],
                    'marital status': ['single', 'single', 'married', 'married'],
                    'number': [20, 30, 10, 20]})

df1.name = 'marital_status'

df2 = pd.DataFrame({'district': ['a', 'b', 'c', 'd'],
                    'year': [1, 1, 1, 1],
                    #                   'sex': ['M', 'F', 'M', 'F'],
                    'income': ['rich', 'poor', 'rich', 'poor'],
                    'number': [5, 15, 25, 45]})

df2.name = 'income'

df3 = pd.DataFrame({'district': ['a', 'a', 'a', 'a'],
                    'year': [1, 1, 1, 5],
                    'age': [7, 8, 2, 7],
                    'marital status': [222, 333, 444, 555]})

df3.name = 'df3'

df4 = pd.DataFrame({'district': ['a', 'd', 'e', 'f'],
                    'year': [1, 2, 3, 5],
                    'age': [7, 8, 2, 8],
                    'marital status': [22, 1, 9, 10]})

df4.name = 'df4'

dfs = [df1, df2]

for df in dfs:
    display(df)

# display(pd.concat([df1, df2, df3], axis=0, join='outer', ignore_index=False))
display(
    reduce(lambda left, right: pd.merge(left,
                                        right,
                                        suffixes=('_' + left.name,
                                                  '_' + right.name),
                                        how='inner',
                                        indicator=True
                                        ),
           dfs)
    .sort_values(by=['year', 'district'])
)


# display(pd.concat(dfs, ignore_index=True, sort=False))
display(
    pd.concat(dfs, ignore_index=True, sort=False)
    .sort_values(by=['year', 'district'])
)

In [None]:
df1 = pd.DataFrame({'district': ['a', 'b', 'c', 'd'],
                    'year': [1, 2, 3, 5],
                    'sex': [7, 8, 9, 10]})
df2 = pd.DataFrame({'district': ['a', 'd', 'e', 'f'],
                    'year': [1, 2, 3, 10],
                    'sex': [7, 8, 2, 7],
                    'age': [22, 1, 9, 10]})
df3 = pd.DataFrame({'district': ['a', 'd', 'e', 'f'],
                    'year': [1, 2, 3, 10],
                    'age': [7, 8, 2, 7],
                    'marital status': [22, 1, 9, 10]})

display(df1)
display(df2)

display(df1.merge(df2, how='inner'))
