---
# Merging exercises

---
These are introductory exercises in Pandas with focus in **syntax, indexing, data selection, missing data, aggregations, visualizations**, **data cleaning**, **merging**, **concatenation**, **joining**, **parse html tables** .

Date: 25/10/2021

Check the [questions][link].

[link]: https://colab.research.google.com/github/kokchun/Databehandling-21/blob/main/Exercises/E02_merging.ipynb

---

## 1. Swedish demographic data (*)

Go to Swedish-language wikipedia page [Sveriges demografi](https://sv.wikipedia.org/wiki/Sveriges_demografi). 

&nbsp; a) Read in the table under "Befolkningsstatistik sedan 1900" into a DataFrame

&nbsp; b) Choose to do some EDA (exploratory data analysis) on this dataset. And draw some relevant graphs.

&nbsp; c) Now we want to go backwards in time (before 1900) to see how population has changed in Sweden. Read in the table under history and keep the data of "Folkmängd" from 1570-1865. 

| År   | Folkmängd |
| ---- | --------- | 
| 1570 | 900000    |     
| 1650 | 1225000   |
| 1700 | 1485000   |
| 1720 | 1350000   |
| 1755 | 1878000   |
| 1815 | 2465000   |
| 1865 | 4099000   |


&nbsp; d) Now concatenate this with the table from 1900 so that you have population data from 1570 to 2020. Note that you may need to clean the data in order for it to fit properly. Also you may be able to do this in several ways. 

&nbsp; e) Draw a graph of population data from 1570-2020.

<details>
<summary>Hint</summary>

Useful methods:
- append()
- join()
- concat()
- merge()

</details>

<br/>

<details>

<summary>Answer</summary>

![Sweden population data 1952-2020](../assets/sverige_befolkning_tid.png)

</details>

In [18]:
import pandas as pd
import numpy as np

In [19]:
table = pd.read_html("https://sv.wikipedia.org/wiki/Sveriges_demografi", match="Födda", decimal=',', thousands='.')
table
#https://stackoverflow.com/questions/39412829/pandas-read-html-not-support-decimal-comma

[     Unnamed: 0   Folkmängd    Födda    Döda Befolkningsförändringar  \
 0          1900   5 117 000  138 139  86 146                  51 993   
 1          1901   5 156 000  139 370  82 772                  56 598   
 2          1902   5 187 000  137 364  79 722                  57 642   
 3          1903   5 210 000  133 896  78 610                  55 286   
 4          1904   5 241 000  134 952  80 152                  54 800   
 ..          ...         ...      ...     ...                     ...   
 116        2016   9 995 000  117 425  90 982                  26 443   
 117        2017  10 120 000  115 416  91 972                  23 444   
 118        2018  10 230 000  115 832  92 185                  23 647   
 119        2019  10 327 589  114 523  88 766                  28 727   
 120        2020  10 379 295  113 077  98 124                  14 953   
 
      Nativiteten (per 1000)  Dödstalen (per 1000)  \
 0                      27.0                  16.8   
 1            

In [20]:
df = table[0]
df.head()

Unnamed: 0.1,Unnamed: 0,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
0,1900,5 117 000,138 139,86 146,51 993,27.0,16.8,10.2,4.02
1,1901,5 156 000,139 370,82 772,56 598,27.0,16.1,11.0,4.04
2,1902,5 187 000,137 364,79 722,57 642,26.5,15.4,11.1,3.95
3,1903,5 210 000,133 896,78 610,55 286,25.7,15.1,10.6,3.82
4,1904,5 241 000,134 952,80 152,54 800,25.7,15.3,10.5,3.83


In [21]:
swe_demographi = df.rename(columns={"Unnamed: 0":"År"})
swe_demographi.head()

Unnamed: 0,År,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
0,1900,5 117 000,138 139,86 146,51 993,27.0,16.8,10.2,4.02
1,1901,5 156 000,139 370,82 772,56 598,27.0,16.1,11.0,4.04
2,1902,5 187 000,137 364,79 722,57 642,26.5,15.4,11.1,3.95
3,1903,5 210 000,133 896,78 610,55 286,25.7,15.1,10.6,3.82
4,1904,5 241 000,134 952,80 152,54 800,25.7,15.3,10.5,3.83


In [22]:
swe_demographi.iloc[:, 1:5] 

Unnamed: 0,Folkmängd,Födda,Döda,Befolkningsförändringar
0,5 117 000,138 139,86 146,51 993
1,5 156 000,139 370,82 772,56 598
2,5 187 000,137 364,79 722,57 642
3,5 210 000,133 896,78 610,55 286
4,5 241 000,134 952,80 152,54 800
...,...,...,...,...
116,9 995 000,117 425,90 982,26 443
117,10 120 000,115 416,91 972,23 444
118,10 230 000,115 832,92 185,23 647
119,10 327 589,114 523,88 766,28 727


In [31]:
swe_demographi.iloc[:, 1:5] = swe_demographi.iloc[:, 1:5].applymap(lambda x: x.replace(" ", "")).astype("int")

In [27]:
#swe_demographi["Folkmängd"] = swe_demographi["Folkmängd"].str.replace(" ", "")
#swe_demographi.astype({"Folkmängd": "int"})
#https://datagy.io/pandas-select-columns/

In [32]:
swe_demographi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   År                                  121 non-null    int64  
 1   Folkmängd                           121 non-null    int32  
 2   Födda                               121 non-null    int32  
 3   Döda                                121 non-null    int32  
 4   Befolkningsförändringar             121 non-null    int32  
 5   Nativiteten (per 1000)              121 non-null    float64
 6   Dödstalen (per 1000)                121 non-null    float64
 7   Befolkningsförändringar (per 1000)  121 non-null    float64
 8   Total fertilitet                    121 non-null    float64
dtypes: float64(4), int32(4), int64(1)
memory usage: 6.7 KB


In [9]:
swe_demographi.describe()

Unnamed: 0,År,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
count,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,1960.0,7436239.0,111215.92562,82021.132231,29219.338843,15.710744,11.296694,4.405785,2.253884
std,35.073732,1432004.0,14806.054124,9503.678548,18754.952133,4.640991,1.826743,3.370541,0.659067
min,1900.0,5117000.0,85020.0,63741.0,-6553.0,10.0,8.7,-0.7,1.5
25%,1930.0,6131000.0,98463.0,73267.0,15171.0,11.9,10.0,2.2,1.79
50%,1960.0,7480000.0,110192.0,80026.0,27126.0,14.4,10.8,3.5,2.0
75%,1990.0,8559000.0,121679.0,91074.0,43908.0,18.1,11.8,6.4,2.43
max,2020.0,10379300.0,139505.0,104594.0,64967.0,27.0,18.0,11.9,4.04


In [33]:
# imports
import plotly.express as px
import pandas as pd
import numpy as np

fig = px.line(swe_demographi, x="År", y="Folkmängd", title="Total population in Sweden from 1900", labels={"År": "Year", "Folkmängd": "Population"},markers=True)
fig.show()

In [34]:
fig = px.line(swe_demographi, x="År", y= "Total fertilitet", title="The average number of children a women has in Sweden from 1900", labels={"År": "Year", "Total fertilitet": "number of children"}, markers=True)
fig.show()

In [35]:
fig = px.line(swe_demographi, x="År", y=["Födda", "Döda"], title="The newborn and dead in Sweden from 1900", labels={"År": "Year"}, markers=True)
fig.show()

Import data from 1950

In [124]:
table = pd.read_html("https://sv.wikipedia.org/wiki/Sveriges_demografi", match="1570", decimal=',', thousands=' ', header=[1])
population_history = table[0]


In [125]:
population_history

Unnamed: 0,Vid utgången av år,Folkmängd,Totalt,Promille
0,1570,900 000,—,—
1,1650,1 225 000,4 063,3.86
2,1700,1 485 000,5 200,3.86
3,1720,1 350 000,−6 750,"−4,75"
4,1755,1 878 000,15 086,9.48
5,1815,2 465 000,9 783,4.54
6,1865,4 099 000,32 680,10.22
7,1900,5 140 000,29 743,6.48
8,2000,8 861 000,,
9,2020,10 379 000,,


In [107]:
population_history = population_history.iloc[0:7, 0:2]
population_history

Unnamed: 0,Vid utgången av år,Folkmängd
0,1570,900 000
1,1650,1 225 000
2,1700,1 485 000
3,1720,1 350 000
4,1755,1 878 000
5,1815,2 465 000
6,1865,4 099 000


In [112]:
population_history["Folkmängd"].iloc[0]

'900\xa0000'

In [90]:
population_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Vid utgången av år  7 non-null      object
 1   Folkmängd           7 non-null      object
dtypes: object(2)
memory usage: 240.0+ bytes


In [95]:
population_history.columns

Index(['Vid utgången av år', 'Folkmängd'], dtype='object')

In [108]:
population_history["Folkmängd"]

0      900 000
1    1 225 000
2    1 485 000
3    1 350 000
4    1 878 000
5    2 465 000
6    4 099 000
Name: Folkmängd, dtype: object

In [116]:
population_history["Folkmängd"] = population_history["Folkmängd"].str.replace("\xa0", "")
population_history.head()


Unnamed: 0,Vid utgången av år,Folkmängd
0,1570,900000
1,1650,1225000
2,1700,1485000
3,1720,1350000
4,1755,1878000


In [71]:
population_history.rename(columns={"Medelfolkmängd":"Folkmängd"})
population_history.reset_index(drop=True)

Unnamed: 0,År,Medelfolkmängd
0,1571,639000
1,1620,854000
2,1631,916722
3,1632,922598
4,1633,922882
5,1634,923605
6,1635,927815
7,1636,934032
8,1637,944073
9,1638,957115


---
## 2. Denmark demographic data (*)

Go to the Danish-language wikipedia page [Danmarks demografi](https://da.wikipedia.org/wiki/Danmarks_demografi). 

&nbsp; a) Read in the table under "Demografiske data" into a DataFrame (*)

&nbsp; b) Clean the data and draw a graph of population against year from 1769-2020. (**)


---
## 3. Norwegian demographic data (*)

Go to Swedish-language wikipedia page [Norges demografi](https://sv.wikipedia.org/wiki/Norges_demografi). 

&nbsp; a) Read in the table under "Befolkningsstatistik sedan 1900" into a DataFrame

&nbsp; b) You see some missing data in column "Total fertilitet". Go to the [English page](https://en.wikipedia.org/wiki/Demographics_of_Norway) and read in the data from "Vital statistics since 1900".  

&nbsp; c) Pick out the fertility column from b) dataset, merge it into a) dataset and clean the data so that you only have columns "År", "Folkmängd", "Fertilitet". 


---
## 4. Merge Sweden-Norway (*)

Create a population graph and a fertility graph showing Sweden and Norway.

<details>

<summary>Answer</summary>

![Fertilitet Norge och Sverige](../assets/fertilitet_sv_no.png)

![Folkmängd Norge och Sverige](../assets/folkmangd_sverige_norge.png)

</details>

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---