# YOUR PROJECT TITLE

> **Note the following:** 
> 1. This is *not* meant to be an example of an actual **data analysis project**, just an example of how to structure such a project.
> 1. Remember the general advice on structuring and commenting your code
> 1. The `dataproject.py` file includes a function which can be used multiple times in this notebook.

Download package to make map plots of Denmark:

In [34]:
#pip install git+https://github.com/sebastianbarfort/mapDK

Imports and set magics:

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import ipywidgets as widgets
from matplotlib_venn import venn2
from dstapi import DstApi


# autoreload modules when code is run
%load_ext autoreload
%autoreload 2

# user written modules
import own_dataproject


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Read and clean data

Import your data, either through an API or manually, and load it. 

In [36]:
kirke = DstApi('KM6')
inc = DstApi('INDKF132')

In [37]:
tabsum = inc.tablesummary(language='en')
display(tabsum)

Table INDKF132: Disposable family income by region, unit, family type, income interval and time
Last update: 2022-11-24T08:00:00


Unnamed: 0,variable name,# values,First value,First value label,Last value,Last value label,Time variable
0,OMRÅDE,99,000,All Denmark,851,Aalborg,False
1,ENHED,3,102,Families in the group (Number),117,Average income for families in the group (DKK),False
2,FAMTYP,3,FAIA,"Families, total",ENIA,"Single people, total",False
3,INDKINTB,11,99,Total,725,"1 million DKK, and more",False
4,Tid,35,1987,1987,2021,2021,True


In [38]:
# The available values for a each variable: 
for variable in tabsum['variable name']:
    print(variable+':')
    display(inc.variable_levels(variable, language='en'))

OMRÅDE:


Unnamed: 0,id,text
0,000,All Denmark
1,101,Copenhagen
2,147,Frederiksberg
3,155,Dragør
4,185,Tårnby
...,...,...
94,773,Morsø
95,840,Rebild
96,787,Thisted
97,820,Vesthimmerlands


ENHED:


Unnamed: 0,id,text
0,102,Families in the group (Number)
1,110,Amount of income (DKK 1.000)
2,117,Average income for families in the group (DKK)


FAMTYP:


Unnamed: 0,id,text
0,FAIA,"Families, total"
1,PAIA,"Couples, total"
2,ENIA,"Single people, total"


INDKINTB:


Unnamed: 0,id,text
0,99,Total
1,800,"Less than 200,000 DKK"
2,810,"200,000 - 299,999 DKK"
3,815,"300,000 - 399,999 DKK"
4,820,"400,000 - 499,999 DKK"
5,600,"500,000 - 599,999 DKK"
6,700,"600,000 - 699,999 DKK"
7,710,"700,000 - 799,000 DKK"
8,715,"800,000 - 899,000 DKK"
9,720,"900,000 - 999,000 DKK"


Tid:


Unnamed: 0,id,text
0,1987,1987
1,1988,1988
2,1989,1989
3,1990,1990
4,1991,1991
5,1992,1992
6,1993,1993
7,1994,1994
8,1995,1995
9,1996,1996


In [39]:
params = inc._define_base_params(language='en')
params

{'table': 'indkf132',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'ENHED', 'values': ['*']},
  {'code': 'FAMTYP', 'values': ['*']},
  {'code': 'INDKINTB', 'values': ['*']},
  {'code': 'Tid', 'values': ['*']}]}

In [40]:
params = {'table': 'indkf132',
 'format': 'BULK',
 'lang': 'en',
 'variables': [{'code': 'OMRÅDE', 'values': ['*']},
  {'code': 'ENHED', 'values': ['110']},
  {'code': 'FAMTYP', 'values': ['*']},
  {'code': 'INDKINTB', 'values': ['99']},
  {'code': 'Tid', 'values': ['*']}]}

In [41]:
inc_table = inc.get_data(params=params)
#CHR:use the method get_data and then you feed in the params dictionary that we have created, and then you download the data
inc_table.head(5)

Unnamed: 0,OMRÅDE,ENHED,FAMTYP,INDKINTB,TID,INDHOLD
0,Rudersdal,Amount of income (DKK 1.000),"Single people, total",Total,2018,4556836
1,Rudersdal,Amount of income (DKK 1.000),"Families, total",Total,2018,18675308
2,Rudersdal,Amount of income (DKK 1.000),"Couples, total",Total,2018,14118472
3,Egedal,Amount of income (DKK 1.000),"Single people, total",Total,2018,2101655
4,Egedal,Amount of income (DKK 1.000),"Families, total",Total,2018,9560281


In [42]:
inc_table.sort_values(by=['OMRÅDE', 'TID', 'FAMTYP'], inplace=True)
inc_table.head(5)

Unnamed: 0,OMRÅDE,ENHED,FAMTYP,INDKINTB,TID,INDHOLD
4182,Aabenraa,Amount of income (DKK 1.000),"Couples, total",Total,1987,2645508
4183,Aabenraa,Amount of income (DKK 1.000),"Families, total",Total,1987,3525080
4181,Aabenraa,Amount of income (DKK 1.000),"Single people, total",Total,1987,879571
6195,Aabenraa,Amount of income (DKK 1.000),"Couples, total",Total,1988,2801653
6196,Aabenraa,Amount of income (DKK 1.000),"Families, total",Total,1988,3738808


In [43]:
inc_table_000=inc_table.loc[:, ['OMRÅDE', 'TID', 'FAMTYP','INDHOLD']]

In [44]:
inc_table_000

Unnamed: 0,OMRÅDE,TID,FAMTYP,INDHOLD
4182,Aabenraa,1987,"Couples, total",2645508
4183,Aabenraa,1987,"Families, total",3525080
4181,Aabenraa,1987,"Single people, total",879571
6195,Aabenraa,1988,"Couples, total",2801653
6196,Aabenraa,1988,"Families, total",3738808
...,...,...,...,...
997,Ærø,2020,"Families, total",1139427
996,Ærø,2020,"Single people, total",438829
8663,Ærø,2021,"Couples, total",730489
8662,Ærø,2021,"Families, total",1190369


In [45]:
inc_grouped=inc_table_000.groupby(['OMRÅDE', 'TID'])['INDHOLD'].apply('sum')

In [54]:
inc_grouped.info()

<class 'pandas.core.series.Series'>
MultiIndex: 3465 entries, ('Aabenraa', 1987) to ('Ærø', 2021)
Series name: INDHOLD
Non-Null Count  Dtype
--------------  -----
3465 non-null   int64
dtypes: int64(1)
memory usage: 166.1+ KB


In [50]:
I = inc_table_000.FAMTYP.str.contains('Families, total')
inc_table_010 = inc_table_000.loc[I, ['OMRÅDE', 'TID']]

In [51]:
inc_table_010

Unnamed: 0,OMRÅDE,TID
4183,Aabenraa,1987
6196,Aabenraa,1988
4669,Aabenraa,1989
3464,Aabenraa,1990
4003,Aabenraa,1991
...,...,...
2230,Ærø,2017
440,Ærø,2018
2025,Ærø,2019
997,Ærø,2020


In [53]:
inc_grouped

OMRÅDE    TID 
Aabenraa  1987    7050159
          1988    7477616
          1989    7987724
          1990    8327578
          1991    8823632
                   ...   
Ærø       2017    2150522
          2018    2151716
          2019    2185986
          2020    2278854
          2021    2380738
Name: INDHOLD, Length: 3465, dtype: int64

In [52]:
inc_table_020 = inc_table_010.set_index(['OMRÅDE', 'TID']).join(inc_grouped, how='left')
inc_table_020

Unnamed: 0_level_0,Unnamed: 1_level_0,INDHOLD
OMRÅDE,TID,Unnamed: 2_level_1
Aabenraa,1987,7050159
Aabenraa,1988,7477616
Aabenraa,1989,7987724
Aabenraa,1990,8327578
Aabenraa,1991,8823632
...,...,...
Ærø,2017,2150522
Ærø,2018,2151716
Ærø,2019,2185986
Ærø,2020,2278854


## Explore each data set

In order to be able to **explore the raw data**, you may provide **static** and **interactive plots** to show important developments 

# Merge data sets

Now you create combinations of your loaded data sets. Remember the illustration of a (inner) **merge**:

In [None]:
plt.figure(figsize=(15,7))
v = venn2(subsets = (4, 4, 10), set_labels = ('Data X', 'Data Y'))
v.get_label_by_id('100').set_text('dropped')
v.get_label_by_id('010').set_text('dropped' )
v.get_label_by_id('110').set_text('included')
plt.show()

Here we are dropping elements from both data set X and data set Y. A left join would keep all observations in data X intact and subset only from Y. 

Make sure that your resulting data sets have the correct number of rows and columns. That is, be clear about which observations are thrown away. 

**Note:** Don't make Venn diagrams in your own data project. It is just for exposition. 

# Analysis

To get a quick overview of the data, we show some **summary statistics** on a meaningful aggregation. 

MAKE FURTHER ANALYSIS. EXPLAIN THE CODE BRIEFLY AND SUMMARIZE THE RESULTS.

# Conclusion

ADD CONCISE CONLUSION.