Both the budget of Berlin and a wealth of demographic data about this fine city
is openly and freely available on the interweb. This repository contains
several IPython
notebooks that explore and connect various aspects of the
data, some of which are showcased on
www.100-percent.net.
Extract, collect, and examine the budget of both the city and the state of Berlin from published data. Note that this is only for what the powers that be planned or plan to spend. Much less data is available on what they actually did spend (and that's not counting BBI Airport).
Based on swaths of demograpic data, the original goal was to localize "gentrification" among Berlin's boroughs. After trying a wide range of metrics, one could be identified that allowed clustering algorithms to divide Berlin into at least four different kinds of neighborhoods with distinct demographic characteristics. These are then plotted onto an interactive OpenStreet map (see ClusterMap.html for an example).
pyshp 1.2.10
folium 0.2.1
Knowing all these demographics, we can try our best to locate you on an interactive OpenStreet map (see PosteriorMap.html for one example), if you provide some detail about yourself. This is done within a Naive Bayes framework.
pyshp 1.2.10
folium 0.2.1
Putting one and one together is harder than it sounds. Knowing the budget of Berlin and quite a bit about its demography, can we actually measure the infinite wisdom of the city fathers in their budget allocation? Can we discern clear-cut correlations (or their absence) between certain demographic groups and certain budget items? The answer is far from straightforward and this is very much work in progress. So far it is mostly an exercise in
- various flavors of (generalized) linear models,
- model selection (best-subset and forward),
- cross-validation,
- dimensionality reduction (PCA, PLS, and CCA).
A more elegant and more powerful way of approaching the same problem is then shown in the next notebook.
Taking a probabilistic view on the same problem and employing Monte-Carlo sampling in a Bayesian framework allows us to circumvent many of the problems we faced in assessing the wisdom of our city fathers in deploying their budget. Now that we have a clear picture, what do we see?
pymc3 3.1
theano 0.9
Everything seems to run smoothly on a relatively fresh Anaconda 4.3 install
(this is for python 3.6
).
While the data are free and open to access for everyone, they are published
under a different license and can, therefore, not be redistributed here.
The demographic data files
have been mildy edited to homogenize column lables etc. The
budget data have additionally between split into individual years, where
only biannual budgets were available. The geographic data can be read in with the pyshp
package
but has been converted to the geo-json format needed by folium
with the
wonderful QGIS.