<a href="https://colab.research.google.com/github/ruettet/measure_of_central_tendency/blob/master/MOCT_zip.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Measure of Central Tendency || zip

In [0]:
from pandas import read_excel, Series

##Calculating the average zip code in Belgium

We read in the list of communities with their zip codes. Because the data.gov.be is referring to a broken datasource, and since the Belgian post has a corrupt Excel file on its servers, we have to rely on the website of zorg-en-gezondheid.be to provide us with data.

In [0]:
zorggezondheid_zip_url = 'https://www.zorg-en-gezondheid.be/sites/default/files/atoms/files/NIS-codes%20gemeenten.xls'
zorggezondheid_zip = read_excel(zorggezondheid_zip_url, skiprows=1, index_col=0)

From inspection, we know that this list does not contain the interesting edge cases of zip codes for "special" customers, such as Saint Nicolas (0612). These kind of zip codes are included in the data from bpost. Nonetheless, we can use this to calculate an average zip code.

In [68]:
mean_zip_calc = zorggezondheid_zip["Postcode"].mean()
mean_zip_calc

5183.764525993884

Obviously, since we are dealing with zip codes, it is impossible to have decimal numbers, so we round to the closest integer.

In [70]:
int(mean_zip_calc.round())

5184

##A mean zip code that is more true

Our observations above indicate that this simplistic zip code calculation can not be the most true mean zip code. We have to enrich our calculations with additional contextual information.

Therefore, we read in the contextual information about the population in each Belgian municipality.

In [0]:
statbel_bevolking_url = 'https://statbel.fgov.be/sites/default/files/files/opendata/bevolking%20naar%20woonplaats%2C%20nationaliteit%20burgelijke%20staat%20%2C%20leeftijd%20en%20geslacht/TF_SOC_POP_STRUCT_2018.xlsx'
datagovbe_bevolking = read_excel(statbel_bevolking_url)

We have to perform a complex data joining to enrich the zip code data with the contextual information for each Belgian municipality.

In [0]:
df = datagovbe_bevolking.join(zorggezondheid_zip, on='CD_MUNTY_REFNIS')

A first, obvious attempt at improving the mean zip code is by weighing the municipalities by means of the amount of inhabitants.

In [72]:
datagovbe_bevolking.head()

Unnamed: 0,CD_MUNTY_REFNIS,TX_MUNTY_DESCR_NL,TX_MUNTY_DESCR_FR,CD_DSTR_REFNIS,TX_ADM_DSTR_DESCR_NL,TX_ADM_DSTR_DESCR_FR,CD_PROV_REFNIS,TX_PROV_DESCR_NL,TX_PROV_DESCR_FR,CD_RGN_REFNIS,...,CD_SEX,CD_NATLTY,TX_NATLTY_FR,TX_NATLTY_NL,CD_CIV_STS,TX_CIV_STS_FR,TX_CIV_STS_NL,CD_AGE,MS_POPULATION,CD_YEAR
0,71024,Herk-de-Stad,Herck-la-Ville,71000,Arrondissement Hasselt,Arrondissement de Hasselt,70000.0,Provincie Limburg,Province de Limbourg,2000,...,F,BEL,Belges,Belgen,20,Marié,Gehuwd,39,42,2018
1,71037,Lummen,Lummen,71000,Arrondissement Hasselt,Arrondissement de Hasselt,70000.0,Provincie Limburg,Province de Limbourg,2000,...,M,BEL,Belges,Belgen,20,Marié,Gehuwd,82,24,2018
2,71011,Diepenbeek,Diepenbeek,71000,Arrondissement Hasselt,Arrondissement de Hasselt,70000.0,Provincie Limburg,Province de Limbourg,2000,...,F,BEL,Belges,Belgen,20,Marié,Gehuwd,42,51,2018
3,71016,Genk,Genk,71000,Arrondissement Hasselt,Arrondissement de Hasselt,70000.0,Provincie Limburg,Province de Limbourg,2000,...,M,BEL,Belges,Belgen,20,Marié,Gehuwd,63,277,2018
4,71017,Gingelom,Gingelom,71000,Arrondissement Hasselt,Arrondissement de Hasselt,70000.0,Provincie Limburg,Province de Limbourg,2000,...,F,BEL,Belges,Belgen,20,Marié,Gehuwd,30,14,2018
