## PRE-PROCESSING SOCIO-ECONOMIC PAAVO FILES 

Paavo-database is maintained by Statistics Finland and it contains socio-economic variables by postal code areas.
All files used in this analysis are downloaded from here: https://pxnet2.stat.fi/PXWeb/pxweb/fi/Postinumeroalueittainen_avoin_tieto

Variables chosen for analysis: 
- number of academic (lower + higher) degree holder in an area (divided by adult population to get percentage, 2018
- number of inhabitants belonging to highest and lowest income categories (divided by total population), 2017
- number of Inhabitants, 2018
- average and median incomes, 2017

The data was retrieved in late 2020 and these stats were the newest available then.

In [12]:
import pandas as pd

#import separate csv files, downloaded from https://pxnet2.stat.fi/PXWeb/pxweb/fi/Postinumeroalueittainen_avoin_tieto
edu = pd.read_csv("/Users/kosokoso/Desktop/YLLI/edu2.csv", encoding="latin8", skiprows=[0, 1], na_values=["..."], delimiter= ";")
house = pd.read_csv("/Users/kosokoso/Desktop/YLLI/households.csv", encoding="latin8", skiprows=[0, 1], na_values=["..."], delimiter= ";")
state = pd.read_csv("/Users/kosokoso/Desktop/YLLI/state_of_life_2018.csv", encoding="latin8", skiprows=[0, 1], na_values=["..."], delimiter= ";")
income = pd.read_csv("/Users/kosokoso/Desktop/YLLI/income2018.csv", encoding="latin8", skiprows=[0], na_values=["..."])
pop = pd.read_csv("/Users/kosokoso/Desktop/YLLI/pop2018.csv", encoding="latin8", skiprows=[0], na_values=["..."])
median_income = pd.read_csv("/Users/kosokoso/Desktop/YLLI/mediaanitulot.csv", encoding="latin8", skiprows=[0, 1], na_values=["..."])
ages = pd.read_csv("/Users/kosokoso/Desktop/YLLI/age_groups_13+.csv", encoding="latin8", skiprows=[0, 1], na_values=[".", ".."], delimiter= ";")

In [13]:
#convert ages dataframe data types back to float
for i in ages.columns:
    if i == "Postal code area":
        continue
    else: 
        ages[i] = ages[i].astype(float)

In [14]:
#take away an extra empty space before ")"
ages["Postal code area"] = ages["Postal code area"].apply(lambda x: x[0:-2] + ")")
ages["Postal code area"] = ages["Postal code area"].str.strip()

#create a column for 18 - 49 years old (most active tweeters)
ages["18-49"] =  ages["18-19-vuotiaat, 2016 (HE)"] + \
ages["20-24-vuotiaat, 2016 (HE)"] + ages["25-29-vuotiaat, 2016 (HE)"] + ages["30-34-vuotiaat, 2016 (HE)"] + \
ages["35-39-vuotiaat, 2016 (HE)"] + ages["40-44-vuotiaat, 2016 (HE)"] + ages["45-49-vuotiaat, 2016 (HE)"] 

#create a column for 13+ years olds (the population allowed in Twitter)
ages["13+"] = ages["13-15-vuotiaat, 2016 (HE)"] + ages["16-17-vuotiaat, 2016 (HE)"] + ages["18-49"] + \
ages["50-54-vuotiaat, 2016 (HE)"] + ages["55-59-vuotiaat, 2016 (HE)"] + \
ages["60-64-vuotiaat, 2016 (HE)"] + ages["65-69-vuotiaat, 2016 (HE)"] + ages["70-74-vuotiaat, 2016 (HE)"] + \
ages["75-79-vuotiaat, 2016 (HE)"] + ages["80-84-vuotiaat, 2016 (HE)"] + ages["85 vuotta täyttäneet, 2016 (HE)"]

In [15]:
ages = ages[["Postal code area", "18-49", "13+"]]
ages

Unnamed: 0,Postal code area,18-49,13+
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),10081.0,16392.0
1,00120 Punavuori (Helsinki),3823.0,6319.0
2,00130 Kaartinkaupunki (Helsinki),815.0,1370.0
3,00140 Kaivopuisto - Ullanlinna (Helsinki),4073.0,7111.0
4,00150 Eira - Hernesaari (Helsinki),5631.0,8667.0
...,...,...,...
162,02860 Siikajärvi (Espoo),220.0,558.0
163,02920 Niipperi (Espoo),2188.0,4079.0
164,02940 Lippajärvi-Järvenperä (Espoo),4191.0,8307.0
165,02970 Kalajärvi (Espoo),1459.0,2926.0


In [16]:
#merge dataframes to one
df = edu.merge(income, on="Postal code area")
df1 = df.merge(pop, on="Postal code area")
data = df1.merge(median_income, on="Postal code area")
data = data.merge(house, on="Postal code area")
data = data.merge(state, on="Postal code area")



#create new column that holds only postcode (key)
data["Posno"] = None

for i in range(len(data)):
    data.at[i, "Posno"] = data["Postal code area"].str.split(" ")[i][0]

#create new column that holds only postcode (key)
ages["Posno"] = None

for i in range(len(ages)):
    ages.at[i, "Posno"] = ages["Postal code area"].str.split(" ")[i][0]

data1 = data.merge(ages, on ="Posno")
data1

Unnamed: 0,Postal code area_x,"18 vuotta täyttäneet yhteensä, 2018 (KO)","Perusasteen suorittaneet, 2018 (KO)","Alemman korkeakoulututkinnon suorittaneet, 2018 (KO)","Ylemmän korkeakoulututkinnon suorittaneet, 2018 (KO)","Inhabintants belonging to the lowest income category, 2017 (HR)","Inhabintants belonging to the highest income category, 2017 (HR)","Inhabitants, total, 2018 (HE)",Inhabitants' average income 2017 (HR),Inhabitants' median income 2017 (HR),...,"Vuokra-asunnoissa asuvat taloudet, 2018 (TE)","Työlliset, 2018 (PT)","Työttömät, 2018 (PT)","Lapset 0-14 -vuotiaat, 2018 (PT)","Opiskelijat, 2018 (PT)","Eläkeläiset, 2018 (PT)",Posno,Postal code area_y,18-49,13+
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),16273,1659.0,2983.0,5988.0,2772.0,6609.0,18427,42196.0,27577.0,...,5144,10576.0,607.0,1846.0,1227.0,3420.0,00100,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),10081.0,16392.0
1,00120 Punavuori (Helsinki),6202,679.0,1040.0,2205.0,1096.0,2566.0,7161,41657.0,27523.0,...,1759,4081.0,241.0,829.0,419.0,1243.0,00120,00120 Punavuori (Helsinki),3823.0,6319.0
2,00130 Kaartinkaupunki (Helsinki),1319,131.0,190.0,519.0,246.0,618.0,1523,57766.0,30479.0,...,352,876.0,37.0,172.0,86.0,266.0,00130,00130 Kaartinkaupunki (Helsinki),815.0,1370.0
3,00140 Kaivopuisto - Ullanlinna (Helsinki),6800,713.0,1167.0,2480.0,1127.0,3078.0,7921,53555.0,29439.0,...,2023,4251.0,261.0,986.0,459.0,1530.0,00140,00140 Kaivopuisto - Ullanlinna (Helsinki),4073.0,7111.0
4,00150 Eira - Hernesaari (Helsinki),8304,981.0,1512.0,2793.0,1431.0,3224.0,9385,41564.0,26546.0,...,2708,5514.0,364.0,957.0,527.0,1544.0,00150,00150 Eira - Hernesaari (Helsinki),5631.0,8667.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162,02860 Siikajärvi (Espoo),489,101.0,48.0,70.0,91.0,164.0,623,28707.0,25119.0,...,15,286.0,19.0,104.0,38.0,156.0,02860,02860 Siikajärvi (Espoo),220.0,558.0
163,02920 Niipperi (Espoo),3825,715.0,595.0,689.0,548.0,1470.0,5577,34733.0,28034.0,...,513,2745.0,160.0,1463.0,437.0,633.0,02920,02920 Niipperi (Espoo),2188.0,4079.0
164,02940 Lippajärvi-Järvenperä (Espoo),7704,1462.0,1221.0,1557.0,1102.0,2838.0,10308,31043.0,27122.0,...,1686,5129.0,371.0,2156.0,771.0,1581.0,02940,02940 Lippajärvi-Järvenperä (Espoo),4191.0,8307.0
165,02970 Kalajärvi (Espoo),2755,597.0,399.0,423.0,414.0,1028.0,3824,30706.0,27202.0,...,332,1854.0,133.0,895.0,284.0,567.0,02970,02970 Kalajärvi (Espoo),1459.0,2926.0


In [18]:
#create new column and calculate the percentage of adult population with low and high education
data1["hi_ed"] = (data1["Alemman korkeakoulututkinnon suorittaneet, 2018 (KO)"].astype(float)+ data1["Ylemmän korkeakoulututkinnon suorittaneet, 2018 (KO)"].astype(float)) / data1["18 vuotta täyttäneet yhteensä, 2018 (KO)"] * 100
data1["low_ed"] = data1["Perusasteen suorittaneet, 2018 (KO)"] / data1["18 vuotta täyttäneet yhteensä, 2018 (KO)"] * 100

#crete new columns for percentage of people belonging to the highest and lowest income categories
data1["low_in"] = data1["Inhabintants belonging to the lowest income category, 2017 (HR)"] / data1["Inhabitants, total, 2018 (HE)"] * 100
data1["high_in"] = data1["Inhabintants belonging to the highest income category, 2017 (HR)"] / data1["Inhabitants, total, 2018 (HE)"] * 100

#create columns for percentages of people belonging to different population groups
data1["children"] = data1["Lapset 0-14 -vuotiaat, 2018 (PT)"] / data1["Inhabitants, total, 2018 (HE)"] * 100
data1["students"] = data1["Opiskelijat, 2018 (PT)"] / data1["Inhabitants, total, 2018 (HE)"] * 100
data1["pensioners"] = data1["Eläkeläiset, 2018 (PT)"] / data1["Inhabitants, total, 2018 (HE)"] * 100
data1["employed"] = data1["Työlliset, 2018 (PT)"] / data1["Inhabitants, total, 2018 (HE)"] * 100
data1["unemployed"] = data1["Työttömät, 2018 (PT)"] / data1["Inhabitants, total, 2018 (HE)"] * 100

#create columns for percentages of people belonging to different household types
data1["own_house"] = data1["Omistusasunnoissa asuvat taloudet, 2018 (TE)"] / data1["Taloudet yhteensä, 2018 (TE)"] * 100
data1["rental"] = data1["Vuokra-asunnoissa asuvat taloudet, 2018 (TE)"] / data1["Taloudet yhteensä, 2018 (TE)"] * 100
data1["kids_house"] = data1["Lapsitaloudet, 2018 (TE)"] / data1["Taloudet yhteensä, 2018 (TE)"] * 100
data1["adult_house"] = data1["Aikuisten taloudet, 2018 (TE)"] / data1["Taloudet yhteensä, 2018 (TE)"] * 100
data1["pension_house"] = data1["Eläkeläisten taloudet, 2018 (TE)"] / data1["Taloudet yhteensä, 2018 (TE)"] * 100

data1

Unnamed: 0,Postal code area_x,"18 vuotta täyttäneet yhteensä, 2018 (KO)","Perusasteen suorittaneet, 2018 (KO)","Alemman korkeakoulututkinnon suorittaneet, 2018 (KO)","Ylemmän korkeakoulututkinnon suorittaneet, 2018 (KO)","Inhabintants belonging to the lowest income category, 2017 (HR)","Inhabintants belonging to the highest income category, 2017 (HR)","Inhabitants, total, 2018 (HE)",Inhabitants' average income 2017 (HR),Inhabitants' median income 2017 (HR),...,children,students,pensioners,employed,unemployed,own_house,rental,kids_house,adult_house,pension_house
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),16273,1659.0,2983.0,5988.0,2772.0,6609.0,18427,42196.0,27577.0,...,10.017909,6.658707,18.559722,57.394041,3.294079,46.277060,49.936899,12.998738,64.498592,22.890981
1,00120 Punavuori (Helsinki),6202,679.0,1040.0,2205.0,1096.0,2566.0,7161,41657.0,27523.0,...,11.576595,5.851138,17.357911,56.989247,3.365452,51.754607,44.407978,14.970967,63.216360,22.241858
2,00130 Kaartinkaupunki (Helsinki),1319,131.0,190.0,519.0,246.0,618.0,1523,57766.0,30479.0,...,11.293500,5.646750,17.465529,57.518056,2.429416,52.211302,43.243243,15.970516,61.425061,22.972973
3,00140 Kaivopuisto - Ullanlinna (Helsinki),6800,713.0,1167.0,2480.0,1127.0,3078.0,7921,53555.0,29439.0,...,12.447923,5.794723,19.315743,53.667466,3.295039,49.818758,45.831445,15.903942,59.537834,24.807431
4,00150 Eira - Hernesaari (Helsinki),8304,981.0,1512.0,2793.0,1431.0,3224.0,9385,41564.0,26546.0,...,10.197123,5.615344,16.451785,58.753330,3.878530,48.831939,47.210600,12.273361,68.688982,19.299163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162,02860 Siikajärvi (Espoo),489,101.0,48.0,70.0,91.0,164.0,623,28707.0,25119.0,...,16.693419,6.099518,25.040128,45.906902,3.049759,91.439689,5.836576,25.680934,40.856031,34.241245
163,02920 Niipperi (Espoo),3825,715.0,595.0,689.0,548.0,1470.0,5577,34733.0,28034.0,...,26.232742,7.835754,11.350188,49.220011,2.868926,71.311475,26.280738,45.901639,36.731557,18.032787
164,02940 Lippajärvi-Järvenperä (Espoo),7704,1462.0,1221.0,1557.0,1102.0,2838.0,10308,31043.0,27122.0,...,20.915794,7.479627,15.337602,49.757470,3.599146,60.920578,38.041516,32.626354,48.037004,19.584838
165,02970 Kalajärvi (Espoo),2755,597.0,399.0,423.0,414.0,1028.0,3824,30706.0,27202.0,...,23.404812,7.426778,14.827406,48.483264,3.478033,74.807018,23.298246,38.807018,38.526316,22.947368


In [19]:
#drop unnecessary columns and rename long column names
data1 = data1.drop(["18 vuotta täyttäneet yhteensä, 2018 (KO)", "Perusasteen suorittaneet, 2018 (KO)", "Alemman korkeakoulututkinnon suorittaneet, 2018 (KO)",
                 "Ylemmän korkeakoulututkinnon suorittaneet, 2018 (KO)", "Inhabintants belonging to the lowest income category, 2017 (HR)",
                  "Inhabintants belonging to the highest income category, 2017 (HR)", "Postal code area_y", 'Taloudet yhteensä, 2018 (TE)',
                   'Lapsitaloudet, 2018 (TE)', 'Aikuisten taloudet, 2018 (TE)','Eläkeläisten taloudet, 2018 (TE)',
       'Omistusasunnoissa asuvat taloudet, 2018 (TE)','Vuokra-asunnoissa asuvat taloudet, 2018 (TE)', 'Työlliset, 2018 (PT)',
       'Työttömät, 2018 (PT)', 'Lapset 0-14 -vuotiaat, 2018 (PT)',
       'Opiskelijat, 2018 (PT)', 'Eläkeläiset, 2018 (PT)'],axis=1)


data1.rename(columns = {"Inhabitants, total, 2018 (HE)":"pop", 
                              "Inhabitants' average income 2017 (HR)":"avg_income",
                              "Inhabitants' median income 2017 (HR)":"median_income", 
                               "Postal code area_x": "post_code_area", 
                               "Talouksien keskikoko, 2018 (TE)": "household_size"}, inplace = True)
data1

Unnamed: 0,post_code_area,pop,avg_income,median_income,household_size,Posno,18-49,13+,hi_ed,low_ed,...,children,students,pensioners,employed,unemployed,own_house,rental,kids_house,adult_house,pension_house
0,00100 Helsinki Keskusta - Etu-Töölö (Helsinki),18427,42196.0,27577.0,180,00100,10081.0,16392.0,55.128126,10.194801,...,10.017909,6.658707,18.559722,57.394041,3.294079,46.277060,49.936899,12.998738,64.498592,22.890981
1,00120 Punavuori (Helsinki),7161,41657.0,27523.0,180,00120,3823.0,6319.0,52.321832,10.948081,...,11.576595,5.851138,17.357911,56.989247,3.365452,51.754607,44.407978,14.970967,63.216360,22.241858
2,00130 Kaartinkaupunki (Helsinki),1523,57766.0,30479.0,190,00130,815.0,1370.0,53.752843,9.931766,...,11.293500,5.646750,17.465529,57.518056,2.429416,52.211302,43.243243,15.970516,61.425061,22.972973
3,00140 Kaivopuisto - Ullanlinna (Helsinki),7921,53555.0,29439.0,180,00140,4073.0,7111.0,53.632353,10.485294,...,12.447923,5.794723,19.315743,53.667466,3.295039,49.818758,45.831445,15.903942,59.537834,24.807431
4,00150 Eira - Hernesaari (Helsinki),9385,41564.0,26546.0,160,00150,5631.0,8667.0,51.842486,11.813584,...,10.197123,5.615344,16.451785,58.753330,3.878530,48.831939,47.210600,12.273361,68.688982,19.299163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
162,02860 Siikajärvi (Espoo),623,28707.0,25119.0,240,02860,220.0,558.0,24.130879,20.654397,...,16.693419,6.099518,25.040128,45.906902,3.049759,91.439689,5.836576,25.680934,40.856031,34.241245
163,02920 Niipperi (Espoo),5577,34733.0,28034.0,280,02920,2188.0,4079.0,33.568627,18.692810,...,26.232742,7.835754,11.350188,49.220011,2.868926,71.311475,26.280738,45.901639,36.731557,18.032787
164,02940 Lippajärvi-Järvenperä (Espoo),10308,31043.0,27122.0,230,02940,4191.0,8307.0,36.059190,18.977155,...,20.915794,7.479627,15.337602,49.757470,3.599146,60.920578,38.041516,32.626354,48.037004,19.584838
165,02970 Kalajärvi (Espoo),3824,30706.0,27202.0,270,02970,1459.0,2926.0,29.836661,21.669691,...,23.404812,7.426778,14.827406,48.483264,3.478033,74.807018,23.298246,38.807018,38.526316,22.947368


In [39]:
#check data types
data1.dtypes

post_code_area     object
pop                 int64
avg_income        float64
median_income     float64
household_size     object
Posno              object
18-49             float64
13+               float64
hi_ed             float64
low_ed            float64
low_in            float64
high_in           float64
children          float64
students          float64
pensioners        float64
employed          float64
unemployed        float64
own_house         float64
rental            float64
kids_house        float64
adult_house       float64
pension_house     float64
dtype: object

In [38]:
#save to csv
data1.to_csv("/Users/kosokoso/Desktop/YLLI/sos_econ.csv")
#joined to shapefile of post code areas in QGIS