---
# Merging exercises

---
These are introductory exercises in Pandas with focus in **syntax, indexing, data selection, missing data, aggregations, visualizations**, **data cleaning**, **merging**, **concatenation**, **joining**, **parse html tables** .

Date: 25/10/2021

Check the [questions][link].

[link]: https://colab.research.google.com/github/kokchun/Databehandling-21/blob/main/Exercises/E02_merging.ipynb

---

## 1. Swedish demographic data (*)

Go to Swedish-language wikipedia page [Sveriges demografi](https://sv.wikipedia.org/wiki/Sveriges_demografi). 

&nbsp; a) Read in the table under "Befolkningsstatistik sedan 1900" into a DataFrame

&nbsp; b) Choose to do some EDA (exploratory data analysis) on this dataset. And draw some relevant graphs.

&nbsp; c) Now we want to go backwards in time (before 1900) to see how population has changed in Sweden. Read in the table under history and keep the data of "Folkmängd" from 1570-1865. 

| År   | Folkmängd |
| ---- | --------- | 
| 1570 | 900000    |     
| 1650 | 1225000   |
| 1700 | 1485000   |
| 1720 | 1350000   |
| 1755 | 1878000   |
| 1815 | 2465000   |
| 1865 | 4099000   |


&nbsp; d) Now concatenate this with the table from 1900 so that you have population data from 1570 to 2020. Note that you may need to clean the data in order for it to fit properly. Also you may be able to do this in several ways. 

&nbsp; e) Draw a graph of population data from 1570-2020.

<details>
<summary>Hint</summary>

Useful methods:
- append()
- join()
- concat()
- merge()

</details>

<br/>

<details>

<summary>Answer</summary>

![Sweden population data 1952-2020](../assets/sverige_befolkning_tid.png)

</details>

In [1]:
import pandas as pd
import numpy as np

In [2]:
table = pd.read_html("https://sv.wikipedia.org/wiki/Sveriges_demografi", match="Födda", decimal=',', thousands=' ')[0]
table.head()

Unnamed: 0.1,Unnamed: 0,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
0,1900,5117000,138139,86146,51993,27.0,16.8,10.2,4.02
1,1901,5156000,139370,82772,56598,27.0,16.1,11.0,4.04
2,1902,5187000,137364,79722,57642,26.5,15.4,11.1,3.95
3,1903,5210000,133896,78610,55286,25.7,15.1,10.6,3.82
4,1904,5241000,134952,80152,54800,25.7,15.3,10.5,3.83


In [3]:
swe_demographi = table.rename(columns={"Unnamed: 0":"År"})
swe_demographi.head()

Unnamed: 0,År,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
0,1900,5117000,138139,86146,51993,27.0,16.8,10.2,4.02
1,1901,5156000,139370,82772,56598,27.0,16.1,11.0,4.04
2,1902,5187000,137364,79722,57642,26.5,15.4,11.1,3.95
3,1903,5210000,133896,78610,55286,25.7,15.1,10.6,3.82
4,1904,5241000,134952,80152,54800,25.7,15.3,10.5,3.83


In [4]:
swe_demographi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 121 entries, 0 to 120
Data columns (total 9 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   År                                  121 non-null    int64  
 1   Folkmängd                           121 non-null    int64  
 2   Födda                               121 non-null    int64  
 3   Döda                                121 non-null    int64  
 4   Befolkningsförändringar             121 non-null    int64  
 5   Nativiteten (per 1000)              121 non-null    float64
 6   Dödstalen (per 1000)                121 non-null    float64
 7   Befolkningsförändringar (per 1000)  121 non-null    float64
 8   Total fertilitet                    121 non-null    float64
dtypes: float64(4), int64(5)
memory usage: 8.6 KB


In [5]:
#https://datagy.io/pandas-select-columns/
#swe_demographi.iloc[:, 1:5] = swe_demographi.iloc[:, 1:5].applymap(lambda x: x.replace(" ", "")).astype("int")

In [6]:
#swe_demographi["Folkmängd"] = swe_demographi["Folkmängd"].str.replace(" ", "")
#swe_demographi.astype({"Folkmängd": "int"})


In [7]:
swe_demographi.describe()

Unnamed: 0,År,Folkmängd,Födda,Döda,Befolkningsförändringar,Nativiteten (per 1000),Dödstalen (per 1000),Befolkningsförändringar (per 1000),Total fertilitet
count,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,1960.0,7436239.0,111215.92562,82021.132231,29219.338843,15.710744,11.296694,4.405785,2.253884
std,35.073732,1432004.0,14806.054124,9503.678548,18754.952133,4.640991,1.826743,3.370541,0.659067
min,1900.0,5117000.0,85020.0,63741.0,-6553.0,10.0,8.7,-0.7,1.5
25%,1930.0,6131000.0,98463.0,73267.0,15171.0,11.9,10.0,2.2,1.79
50%,1960.0,7480000.0,110192.0,80026.0,27126.0,14.4,10.8,3.5,2.0
75%,1990.0,8559000.0,121679.0,91074.0,43908.0,18.1,11.8,6.4,2.43
max,2020.0,10379300.0,139505.0,104594.0,64967.0,27.0,18.0,11.9,4.04


In [8]:
# imports
import plotly.express as px
import pandas as pd
import numpy as np

fig = px.line(swe_demographi, x="År", y="Folkmängd", title="Total population in Sweden from 1900", labels={"År": "Year", "Folkmängd": "Population"},markers=True)
fig.show()

In [9]:
fig = px.line(swe_demographi, x="År", y= "Total fertilitet", title="The average number of children a women has in Sweden from 1900", labels={"År": "Year", "Total fertilitet": "number of children"}, markers=True)
fig.show()

In [10]:
fig = px.line(swe_demographi, x="År", y=["Födda", "Döda"], title="The newborn and dead in Sweden from 1900", labels={"År": "Year"}, markers=True)
fig.show()

Import data from 1570

- If I do not use thousands='\xa0' in read_html, then a problem encountered. 
- population_history["Folkmängd"].iloc[0] 
- result: '900\xa0000'

In [11]:
table = pd.read_html("https://sv.wikipedia.org/wiki/Sveriges_demografi", match="1570", decimal=',', thousands='\xa0', header=[1])
population_history = table[0]


In [12]:
population_history

Unnamed: 0,Vid utgången av år,Folkmängd,Totalt,Promille
0,1570,900000,—,—
1,1650,1225000,4063,3.86
2,1700,1485000,5200,3.86
3,1720,1350000,−6 750,"−4,75"
4,1755,1878000,15086,9.48
5,1815,2465000,9783,4.54
6,1865,4099000,32680,10.22
7,1900,5140000,29743,6.48
8,2000,8861000,,
9,2020,10379000,,


In [13]:
population_history = population_history.iloc[0:7, 0:2]
population_history

Unnamed: 0,Vid utgången av år,Folkmängd
0,1570,900000
1,1650,1225000
2,1700,1485000
3,1720,1350000
4,1755,1878000
5,1815,2465000
6,1865,4099000


In [14]:
population_history = population_history.rename(columns={"Vid utgången av år":"År"})

In [15]:
population_history = population_history.astype("int64")
# population_history = population_history.astype(int)
# int32

In [16]:
# https://stackoverflow.com/questions/15891038/change-column-type-in-pandas


In [17]:
population_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   År         7 non-null      int64
 1   Folkmängd  7 non-null      int64
dtypes: int64(2)
memory usage: 240.0 bytes


In [18]:
sweden_pop = pd.merge(swe_demographi, population_history, on=["År", "Folkmängd"], how="outer", indicator=True)
sweden_pop = sweden_pop.iloc[:,0:2]
sweden_pop = sweden_pop.sort_values(by="År")
sweden_pop = sweden_pop.reset_index(drop=True)
sweden_pop

Unnamed: 0,År,Folkmängd
0,1570,900000
1,1650,1225000
2,1700,1485000
3,1720,1350000
4,1755,1878000
...,...,...
123,2016,9995000
124,2017,10120000
125,2018,10230000
126,2019,10327589


In [19]:
fig = px.line(sweden_pop, x="År", y="Folkmängd", title="Total population in Sweden from 1570", labels={"År": "Year", "Folkmängd": "Population"},markers=True)
fig.show()

---
## 2. Denmark demographic data (*)

Go to the Danish-language wikipedia page [Danmarks demografi](https://da.wikipedia.org/wiki/Danmarks_demografi). 

&nbsp; a) Read in the table under "Demografiske data" into a DataFrame (*)

&nbsp; b) Clean the data and draw a graph of population against year from 1769-2020. (**)


In [20]:
table = pd.read_html("https://da.wikipedia.org/wiki/Danmarks_demografi", match="År", thousands="." )[0]
table = table.iloc[2:]
table

Unnamed: 0,År,Befolkning pr. 1. januar,År.1,Befolkning pr. 1. januar.1
2,1769,797584,1976.0,5065313
3,1787,841806,1977.0,5079879
4,1801,929001,1978.0,5096959
5,1834,1230964,1979.0,5111537
6,1840,1289075,1980.0,5122065
7,1845,1356877,1981.0,5123989
8,1850,1414648,1982.0,5119155
9,1855,1507222,1983.0,5116464
10,1860,1608362,1984.0,5112130
11,1870,1784741,1985.0,5111108


In [21]:
table = table.set_axis(["År", "Folkmängd", "År", "Folkmängd"], axis=1)

In [22]:
table = table.applymap(lambda x: str(x).replace(",", ""))
table = table.astype(float)
table = table.astype("int64")

In [23]:
table.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 2 to 33
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   År         32 non-null     int64
 1   Folkmängd  32 non-null     int64
 2   År         32 non-null     int64
 3   Folkmängd  32 non-null     int64
dtypes: int64(4)
memory usage: 1.1 KB


In [24]:
table_1 = table.iloc[:, :2]
table_2 = table.iloc[:, 2:]

In [25]:
danmark_pop = pd.concat([table_1, table_2]).reset_index(drop=True)
danmark_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64 entries, 0 to 63
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   År         64 non-null     int64
 1   Folkmängd  64 non-null     int64
dtypes: int64(2)
memory usage: 1.1 KB


In [26]:
fig = px.line(danmark_pop, x="År", y="Folkmängd", title="Total population in Danmark from 1769", labels={"År": "Year", "Folkmängd": "Population"},markers=True)
fig.show()

---
## 3. Norwegian demographic data (*)

Go to Swedish-language wikipedia page [Norges demografi](https://sv.wikipedia.org/wiki/Norges_demografi). 

&nbsp; a) Read in the table under "Befolkningsstatistik sedan 1900" into a DataFrame

&nbsp; b) You see some missing data in column "Total fertilitet". Go to the [English page](https://en.wikipedia.org/wiki/Demographics_of_Norway) and read in the data from "Vital statistics since 1900".  

&nbsp; c) Pick out the fertility column from b) dataset, merge it into a) dataset and clean the data so that you only have columns "År", "Folkmängd", "Fertilitet". 


In [38]:
norway_pop = pd.read_html("https://sv.wikipedia.org/wiki/Norges_demografi", match="Födda", thousands=" ", decimal=",")[0]
norway_pop = norway_pop.rename(columns={"Unnamed: 0": "År", "Befolkning i tusentals (x 1000)": "Folkmängd"})
norway_pop.head()

Unnamed: 0,År,Folkmängd,Födda,Döda,Naturlig förändring,Födelsetal per 1000 invånare,Dödstal per 1000 invånare,Naturlig förändring per 1000 invånare,Total fertilitet
0,1900,2231,66229,35345,30884,29.7,15.8,13.8,
1,1901,2255,67303,33821,33482,29.8,15.0,14.8,
2,1902,2276,66494,31670,34824,29.2,13.9,15.3,
3,1903,2288,65470,33847,31623,28.6,14.8,13.8,
4,1904,2298,64143,32895,31248,27.9,14.3,13.6,


In [39]:
norway_pop["Folkmängd"] = norway_pop["Folkmängd"]*1000
norway_pop["År"] = norway_pop["År"].astype("int64")
norway_pop.head()

Unnamed: 0,År,Folkmängd,Födda,Döda,Naturlig förändring,Födelsetal per 1000 invånare,Dödstal per 1000 invånare,Naturlig förändring per 1000 invånare,Total fertilitet
0,1900,2231000,66229,35345,30884,29.7,15.8,13.8,
1,1901,2255000,67303,33821,33482,29.8,15.0,14.8,
2,1902,2276000,66494,31670,34824,29.2,13.9,15.3,
3,1903,2288000,65470,33847,31623,28.6,14.8,13.8,
4,1904,2298000,64143,32895,31248,27.9,14.3,13.6,


In [40]:
norway_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 113 entries, 0 to 112
Data columns (total 9 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   År                                     113 non-null    int64  
 1   Folkmängd                              113 non-null    int64  
 2   Födda                                  113 non-null    int64  
 3   Döda                                   113 non-null    int64  
 4   Naturlig förändring                    113 non-null    int64  
 5   Födelsetal per 1000 invånare           113 non-null    float64
 6   Dödstal per 1000 invånare              113 non-null    float64
 7   Naturlig förändring per 1000 invånare  113 non-null    float64
 8   Total fertilitet                       85 non-null     float64
dtypes: float64(4), int64(5)
memory usage: 8.1 KB


In [41]:
table_2 = pd.read_html("https://en.wikipedia.org/wiki/Demographics_of_Norway", match="Total fertility")[0]
table_2 = table_2.iloc[:-1, [0, -1]]
table_2 = table_2.set_axis(["År", "Total fertilitet"], axis=1)
table_2 = table_2[table_2["År"]<= 1927]

In [42]:
table_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28 entries, 0 to 27
Data columns (total 2 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   År                28 non-null     int64  
 1   Total fertilitet  28 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 672.0 bytes


In [45]:
norway_pop = norway_pop.sort_values(by= "År")
table_2 = table_2.sort_values(by= "År")
norway_pop.update(table_2)
norway_pop["År"] = norway_pop["År"].astype("int64")

In [46]:
norway_pop.head()

Unnamed: 0,År,Folkmängd,Födda,Döda,Naturlig förändring,Födelsetal per 1000 invånare,Dödstal per 1000 invånare,Naturlig förändring per 1000 invånare,Total fertilitet
0,1900,2231000,66229,35345,30884,29.7,15.8,13.8,4.4
1,1901,2255000,67303,33821,33482,29.8,15.0,14.8,4.37
2,1902,2276000,66494,31670,34824,29.2,13.9,15.3,4.26
3,1903,2288000,65470,33847,31623,28.6,14.8,13.8,4.16
4,1904,2298000,64143,32895,31248,27.9,14.3,13.6,4.07


In [47]:
fig = px.line(norway_pop, x="År", y="Folkmängd", title="Total population in Norway from 1900", labels={"År": "Year", "Folkmängd": "Population"},markers=True)
fig.show()

In [48]:
fig = px.line(norway_pop, x="År", y= "Total fertilitet", title="The average number of children a women has in Norway from 1900", labels={"År": "Year", "Total fertilitet": "number of children"}, markers=True)
fig.show()

---
## 4. Merge Sweden-Norway (*)

Create a population graph and a fertility graph showing Sweden and Norway.

<details>

<summary>Answer</summary>

![Fertilitet Norge och Sverige](../assets/fertilitet_sv_no.png)

![Folkmängd Norge och Sverige](../assets/folkmangd_sverige_norge.png)

</details>

---

Kokchun Giang

[LinkedIn][linkedIn_kokchun]

[GitHub portfolio][github_portfolio]

[linkedIn_kokchun]: https://www.linkedin.com/in/kokchungiang/
[github_portfolio]: https://github.com/kokchun/Portfolio-Kokchun-Giang

---