# Census 2022 Results for Germany - Accessing the Data API with Python/Pandas
Get your free account at  
https://ergebnisse2011.zensus2022.de/datenbank/online#modal=register  
API Documentation (latest update 2023-11-27)  
German
https://ergebnisse2011.zensus2022.de/datenbank/online/docs/ZENSUS-Webservices_Einfuehrung.pdf  
English
https://ergebnisse2011.zensus2022.de/datenbank/online/docs/ZENSUS-Webservices_Introduction.pdf

In [1]:
import pandas as pd
import requests
import json
import io
from dotenv import dotenv_values, load_dotenv
import datetime

In [2]:
pd.__version__

'2.1.1'

In [3]:
# convenience function for timestamps while logging
def tStamp():
    return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

In [4]:
# Load credentials from .env file
load_dotenv()  
usr, pwd = dotenv_values().values()

## URL will likely change with 2022 results

In [5]:
# Set base path for API calls
BASE_URL = 'https://ergebnisse2011.zensus2022.de/api/rest/2020/'

## Check credentials, network and endpoint availability
Status may also return 'exceeded limit of parallel jobs'  
which can happen after too many failures and will suspend the user temporarily

In [6]:
try:
    hello = requests.get(BASE_URL + 'helloworld/logincheck', params={
        'username': usr,
        'password': pwd,
        'language': 'en'
        }, timeout=300)
    
    print(hello.json()["Status"])

except:
    print(tStamp()+" : "+usr+" failed")

You have been logged in and out successfully!


## Basic table download function
* be kind to the database
* only request the data-depth that you need
* download municipality data to disk, it changes only once a decade
* parallel downloads don't work

In [7]:
# this takes a RESTful/JSON response as input
# usually from the 'data/tablefile' service

def response2disk(resp):

    filename = resp.headers["Content-Disposition"].split("=")[1]

    destination = "download/"+filename
            
    with open(destination, 'wb') as f:
        
        f.write(resp.content)

    print(tStamp(), filename, "download complete")

## Method “tablefile“ starts a value retrieval and returns a table
* example function fetches municipal data for selected regions (SH, HH, NI, MV)
* this makes sense if your repeated table requests always cover the same region 
* make sure regionalvariable exists for table in question
* e.g. 2000S (sample) tables are only available for GEOGM3 (pop >10k)
* there can be up to 5 pairs of classifying-variable and -key

In [8]:
def tab2download(tabl, classVar1="", classKey1="", classVar2="", classKey2=""):

    if tabl.find("S") == 4:

        regio = "GEOGM3" # municipalities, pop >10k
    
    else:

        regio = "GEOGM1" # municipalities, 11k for all of Germany
    
    try:

        response = requests.get(BASE_URL + 'data/tablefile', params={
            'username': usr,
            'password': pwd,
            'name': tabl,
            'regionalvariable': regio,       
            'regionalkey': "01*,02*,03*,13*",  # e.g. NDR Sendegebiet
            'classifyingvariable1': classVar1,
            'classifyingkey1': classKey1,           
            'classifyingvariable2': classVar2,
            'classifyingkey2': classKey2,           
            'format': "ffcsv",
            'quality': "on", # include quality symbols, see below
            'language': "de",
            'job': "false"   # get the data directly
            }, timeout=600)  # large tables may take some time

        try:
            response2disk(response)  # save to disk for re-use
            return(response.content) # use directly from memory

        except:
        
            if response.status_code == 200:
                
                # here the api will tell you if your request could not be processed
                # e.g.
                # 'Code': 25, 'Content': 'Mindestens ein Parameter enthält ungültige Werte'
                # 'Code': 90, 'Content': 'Die angeforderte Tabelle ist nicht vorhanden.'
                
                try:
                    print(tStamp()+" : "+tabl+" : "+str(response.json()["Status"])[0:80])
                except:
                    # in case response isn't json formatted
                    print(tStamp()+" : "+tabl+" : "+str(response.text[0:300]))
            
            else:
                # log if api times out or otherwise disconnects (500, 404...)
                print(tStamp()+" : "+tabl+" http code "+str(response.status_code))

    except requests.exceptions.Timeout:

        # log if this request has hit its own timeout limit as set above
        print(tStamp()+" : "+tabl + " timed out")

### Explanation of symbols
`e`   final value  
`-`   Exactly zero or adjusted to zero  
`()`  Limited information value because the numerical value may have been modified relatively strongly by the confidentiality procedure
#### for Sample Data
`/`   No data because the numerical value is not sufficiently reliable  
`-`   no figures

### ffcsv will be delivered as zip starting in Feb 2024

In [9]:
# this will read the current csv from memory and returns a dataframe
# some assumptions regarding type conversion are applied
# when in doubt read everything as string first
# reading from disk works accordingly
# pandas reads zipped csv files directly

def table2df(ffcsv):

    # BytesIO covers utf-8 but should also work with zipped content
    csvInput = io.BytesIO(ffcsv)

    # decimal setting for german default
    df = pd.read_csv(csvInput, delimiter = ';', decimal = ",", 
                     # quality indicators (strings) may replace numerical values
                     na_values = ["...",".","-","/","x"],
                     # regional key (ARS, AGS) has leading zeroes, force as string
                     dtype = {"1_variable_attribute_code": str})

    return(df)

## Example Table "5000H-2005" 
#### Haushalte: Größe des privaten Haushalts - Ausstattung der Wohnung/Fläche der Wohnung (20 m²-Intervalle)/Räume
Notice the slashes: On the second axis `/` denote variable features of which the first is selected by default  
Maybe experiment with all the options on the web first, switch on 'code' wherever possible, then come back here  
https://ergebnisse2011.zensus2022.de/datenbank/online/statistic/5000H/table/5000H-2005

In [17]:
# download single table, deliberate typo, table doesn't exist
myTable = tab2download("5000H-2905")

2024-01-30 16:14:30 : 5000H-2905 : {'Code': 90, 'Content': 'Die angeforderte Tabelle ist nicht vorhanden. Bitte prü


In [11]:
# try again, now with correct name
myTable = tab2download("5000H-2005")

2024-01-30 16:09:35 5000H-2005_flat.csv download complete


#### Switching to "Fläche der Wohnung (20 m²-Intervalle)" instead of the default "Ausstattung der Wohnung"
Use classifying variable/key pairs to refine the table, e.g. Single Households only that occupy the largest floor space 

In [12]:
# pruned table for a specific research question
myTable = tab2download("5000H-2005", "HSHGR2", "PERSON01", "WHGFL3", "WFL200BXXX")

2024-01-30 16:10:30 5000H-2005_flat.csv download complete


#### Example Table with Sample Data: 2000S-2029
Personen: Höchster beruflicher Abschluss (ausführlich) - Art der Wohnungsnutzung/Gebäudetyp (Bauweise)/Gebäudetyp (Größe)  
https://ergebnisse2011.zensus2022.de/datenbank/online/table/2000S-2029

In [13]:
# Personen "Ohne oder noch kein Schulabschluss", die in freistehenden Einfamilienhäusern wohnen
myTable = tab2download("2000S-2029", "BILBA1", "ABSCH-X", "GEBTP2", "GEB-EIN-FREI")

2024-01-30 16:10:36 2000S-2029_flat.csv download complete


#### Example Table with Percentages: 1000A-1009
Personen: ...  
https://ergebnisse2011.zensus2022.de/datenbank/online/table/1000A-1009

In [14]:
# Personen
myTable = tab2download("1000A-1009", "FAMST1", "VERWITWET")

2024-01-30 16:10:55 1000A-1009_flat.csv download complete


# New flatfile csv (ffcsv) datastructure
* identical english variable lables in both en/de localizations
* Colon `;` delimited for both de/en
* Decimal point `.` in English but comma `,` in German
* utf-8 via API (as always) and utf-8 with BOM via web
* currently not sorted by regional key (may change in the future)
* only one `value` column
* counts and percentages mixed in this value
* observe `value_unit` to see which is which
* see `value_q` for how to interpret the data (see explanation of symbols above)
* ffcsv always comes zip-compressed with matching filename of .zip and .csv

In [16]:
# use the table directly while in memory
# ffcsv is unsorted right now

table2df(myTable) #.sort_values(by="1_variable_attribute_code")

Unnamed: 0,statistics_code,statistics_label,time_code,time_label,time,1_variable_code,1_variable_label,1_variable_attribute_code,1_variable_attribute_label,2_variable_code,2_variable_label,2_variable_attribute_code,2_variable_attribute_label,value,value_unit,value_variable_code,value_variable_label,value_q
0,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,010585864147,Schülp b. Nortorf,FAMST1,Familienstand (ausführlich),VERWITWET,Verwitwet,6.6,%,PRS018,Personen,e
1,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,031545403025,Warberg,FAMST1,Familienstand (ausführlich),,Insgesamt,100.0,%,PRS018,Personen,e
2,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,010595990109,Esgrus,FAMST1,Familienstand (ausführlich),,Insgesamt,100.0,%,PRS018,Personen,e
3,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,010595952151,Osterby (Kreis Schleswig-Flensburg),FAMST1,Familienstand (ausführlich),VERWITWET,Verwitwet,3.5,%,PRS018,Personen,e
4,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,010615138035,Heiligenstedtenerkamp,FAMST1,Familienstand (ausführlich),,Insgesamt,100.0,%,PRS018,Personen,e
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5895,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,130555517008,Brunn (Landkreis Mecklenburg-Strelitz),FAMST1,Familienstand (ausführlich),VERWITWET,Verwitwet,7.7,%,PRS018,Personen,e
5896,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,130545416001,Alt Krenzlin,FAMST1,Familienstand (ausführlich),VERWITWET,Verwitwet,12.2,%,PRS018,Personen,e
5897,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEOGM1,Gemeinden,033515404018,Nienhagen (Landkreis Celle),FAMST1,Familienstand (ausführlich),VERWITWET,Verwitwet,6.6,%,PRS018,Personen,e
5898,1000A,Bevölkerung kompakt,STAG,Stichtag,2011-05-09,GEODL1,Deutschland,DG,Deutschland,FAMST1,Familienstand (ausführlich),,Insgesamt,100.0,%,PRS001,Personen,e
