# Who owns Lausanne? 

## Installation

In order to run this notebook you will need multiple dependecies. We assume you have a running `conda` distribution.

 - `numpy` needs to be in version `>=0.15`
 - The JSON processing tool `jq` is needed. Install with your OS' package manager (e.g. `brew install jq`).
 - `pip install yq geopy shapely`
 - If you want to have a look at the raw geographical data you have to intall [QGIS](https://www.qgis.org/en/site/), an open source geo information system tool.

_If you are running an older version of macOS (e.g. 10.11) you might need to call `ulimit -n 1024` in the terminal before starting `jupyter`. This can avoid a bug with one of the preprocessing scripts._

## 1. Public data and owners

We obtained ftp access from the Lausanne office of cadastre. The data is a collection of ESRI shapefiles, describing roads, buildings, parcels, trees, waterbodies, and others.
Each shapefile is a collection of features, and each feature has an associated geometry (e.g. the shape of a land parcel) and associated attributes ( e.g. the commune responsible for the parcel, the parcel number).

We can explore this dataset by using GIS software that supports shapefiles. We used QGIS to explore the dataset.
We hoped to find an attribute describing the parcel owner in the parcel shapefile layer, but it wasn't there.
We had to resort to web scraping to recover this attribute.

## 1.1 Scraping owners

### 1.1.2 Download XML files
We wanted to associate each parcel in Lausanne to an owner. To do this, we divided Lausanne's surface in rectangles, and requested parcel informations for these rectangles to a service exposing the owners name.
The code for the scraping is in [`/scraping/owners/scrape_owners_to_xml.py`](/edit/scraping/owners/scrape_owners_to_xml.py).

The result of owner scraping is a set of 400 xml files, each containing parcel information for a geographical rectangle. The data are saved in the following directory: `data/raw/owners/`.

For privacy reason we decided not to push any data on the online github respository.

We start exploring the raw owner xml data:

In [None]:
!ls "data/raw/owners/" | wc -l

Each file is named after the coordinates, in the Swiss systems, of the top-left and bottom-right points bounding the scraped rectangle.

In [None]:
!ls "data/raw/owners/" 2>/dev/null | head -10 # suppress error message by redirecting errors to null

One example file:

In [None]:
!head -n 20 "data/raw/owners/534810.4210526316_155847.0_535161.3710526315_155589.0.xml"

### 1.1.2 From XML to a single JSON
We use `yq`, `xq` and `jq` programs to extract only the features we care about from the different XML files and save them as a list of objects in a single json file.
[`scraping/owners/multiple_xml_to_single_json.sh`](/edit/scraping/owners/multiple_xml_to_single_json.sh) is a small script leveraging the expressiveness of `jq` to efficiently concatenate the XML files into a single json, while also discarding all the attributes we have no interest in.

The result file of the shell program is a JSON file`/data/owners/all_owners_dirty.json`.

In [None]:
# Convert all XML into one JSON file
!scraping/owners/multiple_xml_to_single_json.sh all_owners_dirty.json 

If you see errors of the form
```
jq: error (at <stdin>:1): Cannot iterate over null (null)
```
you don't have to worry. These are just errors, when `jq` encounters the end of a file. The script is still working correctly.

### 1.1.3 Remove duplicated and  clean owners JSON
The generated `all_owners_dirty.json` JSON has duplicate entries, entries concerning other communes than Lausanne and entries with missing owners. Furthermore, JSON is not the best format to handle tabular data. The code in [`scraping/owners/owner_json_to_clean_csv.py`](/edit/scraping/owners/owner_json_to_clean_csv.py) cleans the duplicates and tranfsorms the data into `all_owners.csv` CSV file.

In [None]:
# Clean dirty data and write it to 'all_owners.csv' 
import scraping.owners
scraping.owners.owner_json_to_clean_csv.main(
    './data/owners/all_owners_dirty.json',
    './data/owners/all_owners.csv'
)

### 1.1.4 Joining the owners data with the cadastre shapefiles
The result of the previous preprocessing steps is a CSV file with three columns: commune number, parcel number, and the owner name:

In [None]:
import pandas as pd
all_owners = pd.read_csv('data/owners/all_owners.csv')
all_owners.head(3)

We would like to add the owner name to the attributes of the parcels shapefile that we obtained from "Office du Cadastre". To do so, we import the csv and the shapefile in QGIS, and we join this two "tables" by parcel number. The resulting geographical layer contains all the geographical features representing the parcels, and additionally the owner name for each parcel. 

We can now export this layer as a GeoJSON, making sure to use `WGS-84` as the coordinate system, and continue our exploration.

The exported *geojson* file is saved at `data/owners/all_owners_parcelles.geojson`.

## 1.2 Cadastral data - data exploration

We now have a GeoJSON, containing parcel geometries and parcel owners.

In [None]:
import numpy as np
import json

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set()

%load_ext autoreload
%autoreload 2

import cleaning

We start loading the `geojson` file

In [None]:
with open('data/owners/all_owners_parcelles.geojson') as geojson:
    all_owners_parcelles = json.load(geojson)

In [None]:
all_owners_parcelles.keys()

The information we need are located into features column. Such columns is a list of properties where each properties contains the owners, the localization the intereted area (geometry).

In [None]:
print(json.dumps(all_owners_parcelles['features'][6], indent=2))

We load the attributes of each geographical feature in a pandas dataframe:

In [None]:
# select 'number (id) of parcels' and owners name
features = [
    {'parc_num': feature['properties']['NO_PARC'],
     'owner': feature['properties']['proprio']} for feature in all_owners_parcelles['features']
]
parcels = pd.DataFrame.from_records(features)
parcels.head(3)

In [None]:
len(parcels)

### 1.2.1 Who is the biggest real estate owners?

We can now quickly answer questions such as who are the 30 biggest property owners in Lausanne, by using the number of parcels owned as a measure:

In [None]:
parcels_per_owner = parcels['owner'].value_counts()
parcels_per_owner.head(30)

We can see that most of the biggest owners are either corporations, pension funds, or public institutions. 

We note that there is no a single private person in the list. This data tell us that our analysis will have to take into consideration other kind of owners than the private ones.

### 1.2.2 Unique owners

After dropping the non assigned values, we can see the total number of parcels owned and the unique owners:

In [None]:
owners = parcels['owner'].dropna()
print('Total parcels', len(owners))
print('Unique owners', len(owners.unique()))

There are almost 8'000 parcels in Lausanne. 

BWe are interestd to know how many people and societies own them. This number doesn't account for PPE (_prorpiété par étage_, single flats owned by privates). A lower bound on the owners can be estimated by discarding the PPE entries altogether:

In [None]:
lower_bounds_owners = len(owners[~owners.str.contains('PPE ')].unique())
print('Unique lower-bounds owners', lower_bounds_owners)

This is a lower bound for the number of owners. Although, the real number is likely to be much higher since it's unprobable that most of these unique owners are also owners of a PPE share.

### 1.2.3 Visualizing the distribution of missing values

In [None]:
# Compute portion of missing values
parcels['owner'].isna().mean()

22% of the parcels don't have owner information. Indeed, many parcels reperesent roads, and as such they didn't have an owner on the site we scraped. Also we didn't scrape the values for the northern part of Lausanne, which is mostly farmland and woods.

Let's try to visualize the missing values on a map:

In [None]:
import folium
from tools import *

!mkdir export

In [None]:
m = getMap()

def style_function(feature):
    """Returns color red for missing values, blue for valid."""
    return {
        'fillColor':
        'red' if feature['properties']['proprio'] is None else 'blue', 
        'stroke': False
    }

geo_fol = folium.GeoJson(all_owners_parcelles, style_function=style_function)

m.add_child(geo_fol)
m.save('export/missing_values.html')
m

This visualization is not very snappy or legible, but we can interpret it as follows:

- Red areas are parcels for which the proprietary is not assigned, i.e `None`. The northern parts of Lausanne were not scraped, since we didn't want to overload the scraped website and since they're mostly rural areas. It is expected that they are red.
- Zooming into central Lausanne, we see that roads have unknown owners. This is also expected.
- For some areas blue and red overlap, yielding purple parcels. This is because the dataset is slightly dirty and some bigger parcels with no owners _contains_ smaller parcels with known owners. Therefore the colors overlap.

Having asserted that the dataset is fairly sane, we can drop the features were the owner is `None`, since they will be of no use to us (roads), and will make the map drawing slower.

In [None]:
geo_parcels = all_owners_parcelles.copy()

# replace the list of features by filtering out the 
# features having None as proprio
geo_parcels['features'] = [
    feature for feature in geo_parcels['features']
    if feature['properties']['proprio'] is not None
]

features = [
    {'parc_num':feature['properties']['NO_PARC'],
    'owner':feature['properties']['proprio']} for feature in geo_parcels['features']
]
parcels = pd.DataFrame.from_records(features)
len(parcels)


###  1.2.4 Show parcels by owner type

The parcel owner format allows us to know the category of each owner. 
We will use similar categories as the statistical office of the city of Lausanne [here](https://www.lausanne.ch/officiel/statistique/quartiers/tableaux-donnees.html):

- privates
- public institutions
- companies (corporations)
- cooperatives
- pension funds
- foundations
- PPE

Societies are detected by having 'AG' or 'SA' in their name. Similary for cooperatives, foundations, and pension funds. We display a map colored by the owner category.

In [None]:
import re
def categorize(owner):
    regex_cats = [
            ('retraites|pension|prévoyance|prevoyance|BVK|'+\
             'anlagestiftung|fondation d\'investissement|fondation de placement|'+\
             'vorsorge|anlage stiftung', 'pension'),
            ('commune de lausanne|dfire|cff|domaine public|Etat de Vaud|Service du logement', 'public'),
            (r's\.a\.|\bsa\b|\bag\b|société anonyme|sàrl|\bBCV\b|SICAV', 'société'), # BCV société ou public?
            (r'fondation|\bstiftung|foundation|association|fédération', 'fondation/association'),
            (r'\bppe\b|copropriété|copropriete|parcelles', 'PPE'),
            ('société coopérative|societe cooperative', 'coop'),
            ('.*', 'private')
    ]
    for cat_re, category in regex_cats: 
        if re.search(cat_re, owner, flags=re.IGNORECASE):
            return category

In [None]:
owners_categories = parcels[['owner']]
owners_categories['category'] = owners_categories['owner'].apply(categorize)
owners_categories = owners_categories.drop_duplicates().set_index('owner')
owners_categories.head(5)

In [None]:
m = getMap()

def style_function(feature):
    colors = {
        'coop': 'yellow',
        'société' : 'red',
        'public' : 'green',
        'private': 'blue',
        'PPE': 'orange',
        'pension': 'purple',
        'fondation/association' : 'brown'
        
    }
    owner = feature['properties']['proprio']
    cat = owners_categories.loc[owner][0]
    
    return {
        'stroke':False,
        'fillColor': colors[cat]
    }

folium.GeoJson(
    geo_parcels, 
    style_function=style_function,
    # show the owner at hover
    tooltip=folium.GeoJsonTooltip(['proprio'])
).add_to(m)
m

## 2. Rents data

### 2.1 Scraping

In order to analyse how ownership patterns influence prices, we needed to complement the owners dataset with rent prices.
Rent prices are generally not public, but we can scrape from real estate websites' current rent listings, and extract the prices from there.

We scraped from [anibis.ch](https://www.anibis.ch/fr/default.aspx), [homegate.ch](https://www.homegate.ch/fr) and [tutti.ch](https://tutti.ch) and extracted up-to-date real estate announcements.

#### 2.1.1 Download the raw rents data

The scripts to download the data from the three portals are the following:

- for homegate: [`scraping/homegate/scrape_homegate.py`](/edit/scraping/homegate/scrape_homegate.py), to download and parse the data. Data are saved in `data/rents` as `homegate.json`

- for anibis:
    1. [`scraping/anibis/anibis_scrape_listings.py`](/edit/scraping/anibis/anibis_scrape_listings.py) to download the index of results matching rents in lausanne
    2. [`scraping/anibis/anibis_scrape_offers.py`](/edit/scraping/anibis/anibis_scrape_offers.py) to download each single rent offer, given a parsed index

- for tutti:
    1. [`scraping/tutti/tutti_scrape_listings.py`](/edit/scraping/tutti/tutti_scrape_listings.py) to download the index of results matching rents in lausanne
    
 
The raw rents data are then saved in `data/raw`.

#### 2.1.2 Parse the rents data


Once downloaded we parse the data in a agreed JSON format.

#### Tutti

In [None]:
from scraping.tutti import tutti_parse_listings
tutti_parse_listings.main()

#### Anibis

1. [`scraping/anibis/anibis_parse_listings.py`](/edit/scraping/anibis/anibis_parse_listings.py) to parse the listings index.
2. [`scraping/anibis/anibis_parse_offers.py`](/edit/scraping/anibis/anibis_parse_offers.py) to parse the pages for each offer.

### 2.2 Removing duplicates

Most rent listings are published on several websites. When merging the data sources, we first need to figure out which results are present in multiple datasets to avoid duplicate datapoints. We consider listings to be duplicates if they have the same address and the same price. The code is in [`cleaning/merge_rent_offers.py`](/edit/cleaning/merge_rent_offers.py) .
In addition to removing duplicates, the merging script does cleaning of offers without prices, without addresses, or without surface area.

In [None]:
!jq length data/raw/rents/tutti.json
!jq length data/raw/rents/anibis_with_streets.json
!jq length data/raw/rents/homegate.json

Before cleaning and merging we have a total of ~1200 offers

In [None]:
pd_tutti = pd.read_json("./data/raw/rents/tutti.json")
pd_tutti.replace('nan',np.nan, inplace=True)

pd_tutti.head(2)

In [None]:
pd_anibis = pd.read_json("./data/raw/rents/anibis_with_streets.json")
pd_anibis.head(2)

In [None]:
pd_homegate = pd.read_json("./data/raw/rents/homegate.json")
pd_homegate.head(2)

In [None]:
import cleaning as cleaning
filenames = ["./data/raw/rents/tutti.json", 
             "./data/raw/rents/anibis_with_streets.json",
            "./data/raw/rents/homegate.json"]
merged = cleaning.merge_rent_offers.main(filenames)
len(merged)

### 2.3 Mapping street addresses to coordinates
The data cleaning up to now provided us with a list of json objects, each one representing a rent offer.

The address is in textual form. Also, it is clear that there will be more cleaning needed. The first address is a phone number instead of an actual address. This sanitisation is automatically provided by the next script.

To perform geographical queries on addresses, we need to convert them to coordinates. To do so, we use the cadastral layer of building addresses, provided by the Cadastral offic of Lausanne.
During merging of the three datasets, the addresses were standardized to use the format used by this cadastral layer.

To map an address to a coordinates couple, we iterate over all buildings in Lausanne, and check if the street name and the street number match those of our address. If there's a match, we extract the coordinates of the building from the cadastral layer. If there isn't we drop the offer (like the phone number above) and therefore perform some cleaning:

In [None]:
rent_prices = cleaning.address_to_coords.main(merged)
len(rent_prices)

In [None]:
pd_rent_prices = pd.DataFrame.from_dict(rent_prices)
# DROP not used columns
pd_rent_prices.drop(['city', 'meuble'], axis='columns', inplace=True)

pd_rent_prices.head()

In [None]:
# Delete eventually left duplicates
# TODO

pd_rent_prices['street'] = pd_rent_prices['street'].str.lower()
pd_rent_prices['price'] = pd_rent_prices['price'].astype(float)

duplicate = pd_rent_prices.groupby(['street','number','price'])['address'].transform('count') > 1

### 2.4 Visualizing the rents dataset

Finally, we can take a first look at the rental data in a cleaned form.

In [None]:
# load the geojson featuring borders for each quartier
quartiers = json.load(open('data/raw/maps/quartiers.geojson'))

# compute the cost per squared meter of the rent
for offer in rent_prices:
    offer['CHF/m2'] = float(offer['price'])/float(offer['surface'])
    
# draw a map showing the location of each vacancy, and the quartiers borders
m = getMap()
folium.GeoJson(quartiers).add_to(m)
for offer in rent_prices:
    coords = offer['position']
    # Marker wants first the N coordinate and then E
    folium.Marker((coords[1], coords[0]), tooltip=offer['CHF/m2']).add_to(m)
m

### 2.5 Mapping rent datapoints to quartiers
Each rent data-point has a pair of coordinates localizing it in space. _quartiers_ are polygons, whose perimeter is a list of coordinates. We can use the python library `shapely`, that allows us to perform geometrical queries, to find the _quartier_ for each rent offer.

In [None]:
#import the two data structures needed
from shapely.geometry import Point, Polygon

for offer in rent_prices:
    offer['quartier'] = None
    for quartier in quartiers['features']:
        
        # skip because we don't have owner data for forest areas
        if quartier['properties']['Name'] == '90 - Zones foraines':  
            continue
        
        offer_pos = Point(offer['position'])
        
        # we extract the list of coordinates of the polygon's vertices, 
        # discarding useless height
        quartier_vertices = [(east, north) for east, north, z in quartier['geometry']['coordinates'][0]]
        quartier_poly = Polygon(quartier_vertices)
        if quartier_poly.contains(offer_pos):
            offer['quartier'] = quartier['properties']['Name']

Let's sanity check by changing the color of the marker depending on the found _quartier_ and displaying all of it on a map.

In [None]:
m = getMap()
folium.GeoJson(quartiers).add_to(m)
for offer in rent_prices:
    coords = offer['position']
    
    # little hack to assign a different color to each quartier
    # calculate hex color from a hash of the name
    color = '%06x' % (hash(offer['quartier']) % (256**3))
    
    # Marker wants first the N coordinate and then E
    folium.CircleMarker(
        (coords[1], coords[0]),
        radius=5, fill_color='#'+color, weight=0, fill_opacity=1
    ).add_to(m)
m

This looks pretty good. We will now give a first statistic on rent prices. **However**, there are still fake offers (like offers for parking lots and the like) and there are still outliers in the dataset. The results are therefore not yet _real, clean means_. 

As a cheap mitigation we will display the median instead of the mean. Before analysing and building our mathematical model we will however clean those offers out.

In [None]:
# Display median price per neighborhood
rents_per_quartier = pd.DataFrame.from_dict(rent_prices)
rents_per_quartier[['CHF/m2', 'quartier']].groupby('quartier').agg('median')

### 2.6 Map each offer to an owner

Each parcel has an owner and a geometry, which is a `MultiPolygon`. A `MultiPolygon` is a list of `Polygon`s. Every `Polygon` is a list of "linear rings". The first linear ring defines the outer perimeter of the polygon, and the next linear rings define holes in the polygon.
For all parcels the multipolygons are made of only 1 polygon, and every polygon only has the outer perimeter and no holes. We can use shapely polygons again to find wether an offer is within a polygon.

In [None]:
# for each parcels, construct a Polygon
def get_parcel_polygons():
    parcel_polygons = []
    for parcel in geo_parcels['features']:
        coords = parcel['geometry']['coordinates'][0][0]
        num_parc = parcel['properties']['NO_PARC']
        proprio = parcel['properties']['proprio']
        poly = Polygon(coords)
        parcel_polygons.append((num_parc, proprio, poly))
    return parcel_polygons
parcel_polygons = get_parcel_polygons()

In [None]:
# For each offer, assign a propretary
for offer in rent_prices:
    offer_pos = Point(offer['position'])
    for parc_n, proprio, poly in parcel_polygons:
        if poly.contains(offer_pos):
            offer['proprio'] = proprio
    if 'proprio' not in offer:
        offer['proprio'] = None

In [None]:
# duplicate but how can one tell?
#[o for o in rent_prices if o['proprio'] is not None and 'Meuli' in o['proprio']]

In [None]:
# DANGER: CFF est propriétaire de Avenue de Sévelin 13 mais pas de 13 A-E. Nécessite meilleure parsing
# des addresse pour ne pas faire de ces erreures.
#[o for o in rent_prices if o['proprio'] is not None and 'CFF' in o['proprio']]

In [None]:
len([o for o in rent_prices if o['proprio'] is None])
# datapoints in zone foraine. We didn't scrape the owners for that quartier

In [None]:
len(rent_prices)

## 3. Linear model describing relation of ownership and price

The data is now in the form we need in order to apply our model. As our main goal is to understand the rent price composition, we will perform linear regression on the rent prices.

More precisely, we will try to predict the prices of rents in each quartier based on the features:

 - ownership proportion of each ownership type: ($f_{public}, f_{s.a.}, $...)
 - distance from the centre of the city: $dist$
 - the mean price of rents in the _quartier_ $q$ (dependent variable): $price(q)$ 
 
The linear model is then:

$$
price(q) = \beta_1 f_{public}(q) + \beta_2  f_{s.a.}(q) + ~...~ + \beta_j  f_{privates}(q)+  \beta_k dist(q)
$$

We will apply linear regression to this model and extract the knowledge from the parameters $\beta$. One problem could however be, that the ownership pattern itself depends on the distance or vice-versa. In that case we'll be able to check this assumption by predicting the distance from the ownership types:

$$
dist(q) = \beta_1 f_{public}(q) + \beta_2  f_{s.a.}(q) + ~...~ +\beta_j  f_{privates}(q)
$$

### 3.1 influence of owner type on average price

In [None]:
from sklearn import linear_model
import scipy

covariates = pd.DataFrame.from_records(
    [
        {
            "owner_type": categorize(offer["proprio"]),
            "price/m2": offer["CHF/m2"],
        }
        for offer in rent_prices
        if offer["proprio"] is not None
    ]
)
covariates["owner_type"] = covariates["owner_type"].astype("category")
samples_per_cat = covariates["owner_type"].value_counts()

covariates = pd.get_dummies(covariates)

# drop one indicator to avoid multiple colinearity
covariates = covariates.drop("owner_type_private", axis="columns")

coeffs = np.empty(covariates.shape)

# columns: num of covariates minus the response (y) plus 1 for the intercept

# bootstrap confidence interval for linear regression coefficients
for i in range(covariates.shape[0]):
    sample = covariates.values[
        np.random.choice(
            covariates.shape[0], size=covariates.shape[0], replace=True
        )
    ]
    lm = linear_model.LinearRegression(fit_intercept=True)
    X = sample[:, 1:]
    y = sample[:, 0]
    lm.fit(X, y)
    coeffs[i, 0] = lm.intercept_
    coeffs[i, 1:] = lm.coef_

lower, upper = np.percentile(coeffs, q=(2.5, 97.5), axis=0)

print("%-40s\t%s\t%s\t%s" % ("feature", ".025 qtile", ".975 qtile", "n"))
print()
print("%-40s:\t%f\t%f" % ("intercept", lower[0], upper[0]))

for typ, lower_q, upper_q in zip(covariates.columns[1:], lower[1:], upper[1:]):
    print(
        "%-40s:\t%f\t%f\t%d"
        % (typ, lower_q, upper_q, samples_per_cat[typ.split("_")[-1]])
    )


#### Linear regression on distance

In [None]:
from geopy.distance import great_circle

position = 46.50766, 6.62758

rent_positions = pd.DataFrame.from_records(
    [
        {
            "lat": offer["position"][1],
            "long": offer["position"][0],
            "CHF/m2": offer["CHF/m2"],
        }
        for offer in rent_prices
    ]
).dropna()

rent_distances = rent_positions.apply(
    lambda row: great_circle((row.lat, row.long), position).km, axis=1
)
rent_distances = rent_distances.to_frame("km")
rent_distances["CHF/m2"] = rent_positions["CHF/m2"]
rent_distances.head()

In [None]:
sns.lmplot(x="km", y="CHF/m2", data=rent_distances);
plt.title("Distance from the station vs. rent prices");

In [None]:
distance_model = scipy.stats.linregress(
    rent_distances.km, rent_distances["CHF/m2"]
)

def print_model(model):
    for stat in ["intercept", "slope", "stderr", "pvalue"]:
        print(stat, ": ", getattr(model, stat))

        
print_model(distance_model)

In [None]:
lat_model = scipy.stats.linregress(
    rent_positions.lat ** 2, rent_positions["CHF/m2"]
)
long_model = scipy.stats.linregress(
    rent_positions.long ** 2, rent_positions["CHF/m2"]
)

print_model(lat_model)
print()
print_model(long_model)


In [None]:
rents_per_quartier = pd.DataFrame.from_records(
    [
        {
            "lat": offer["position"][1],
            "long": offer["position"][0],
            "CHF/m2": offer["CHF/m2"],
            "surface": float(offer["surface"]),
            "proprio": offer["proprio"],
            "quartier": offer["quartier"],
        }
        for offer in rent_prices
    ]
)

rents_per_quartier["distance"] = rent_distances.km
rents_per_quartier = rents_per_quartier.dropna()

print_model(
    scipy.stats.linregress(
        rents_per_quartier.surface, rents_per_quartier["CHF/m2"]
    )
)


In [None]:
sns.lmplot(x="surface", y="CHF/m2", data=rents_per_quartier);

## Heatmap
Can a heatmap show spacial dependency of rent prices? Let's investigate

In [None]:
import tools
tools.heatmap_prices_from_json(rent_prices)

## What comes next:

Please find more information about further ideas, our current state of work and our story in the `README`.

Because obtaining the data from all the different sources and with various methods is a huge amount of work in this project, we are not fully done with cleaning and analysing all of the datasets. As said before, the rental data still contains unwanted entries. And the data from tutti.ch has not yet been converted to the correct JSON format.

However, while working on this and discussing the progress of the project we came up with a good guideline: understand how prices are determined. This is good because it lends itself well for writing a story but it also naturally yields the rather simple mathematical model from above. We are therefore convinced that we will be able to carry out the analysis to its full extent and to come up with a good web data story in the end!

### Todos and future sections:

 - Adapt ownership categories
 - Clean fake offers and outliers
 - Linear regression on data and analysis
     - Tune model
     - extract parameters and CIs
     - intra-quartier effects
     - ...
 - Political analysis by hand
 - Check: do we answer the four RQs?
 - Graphics and maps production for web story
 
--> The outline for the story (our end result) can be found in the `README`.
 
 - Condense information to one line about finding affordable accommodation and why it is difficult (e.g. cheap parts are far from centre/university...)
 - Write the story
 - Design web page and animations (using a static generator and `JS`)
 - Deploy site to server (github pages or own)
 
**Milestone 3, 16.12.18**
 
 - Boil analysis down to 5 keypoints
 - Think about way of presenting this highly geographical data/problem
 - Write text for presentation and exercise!


# Machine Learning

### Extrapolate rents for all parcels with k-nearest-neighbor  

In [None]:
# Prepare the data to be trained

pd_parcels_rent = pd.DataFrame.from_dict(rent_prices)
pd_parcels_rent.head(2)

train = pd_parcels_rent['position'].apply(lambda r: pd.Series(r))
train = pd.concat([pd_parcels_rent['CHF/m2'], train], axis='columns')
train.columns = ['target', 'long', 'lat']
train.head()

In [None]:
# Prepare the data to be predicted
polygons_df = pd.DataFrame.from_records(get_parcel_polygons())
polygons_df.columns = ('parc_no', 'proprio', 'poly')
polygons_df['owner_type'] = polygons_df['proprio'].apply(categorize)
polygons_df['x'] = polygons_df['poly'].apply(lambda poly: poly.centroid.x)
polygons_df['y'] = polygons_df['poly'].apply(lambda poly: poly.centroid.y)

# select 'id' and 'position' of parcels
features = polygons_df[["x", "y", "parc_no"]]
features.columns = ["long", "lat", "parc_no"]

predict = features[["lat", "long"]]
predict.head()

In [None]:
# Predict prices
from machine_learning import model_price_knn

prices, rmse, ks = model_price_knn(train, predict, np.arange(1, 50))

pd.DataFrame(rmse, ks).plot(legend=False);
plt.xlabel("k");
plt.ylabel("rmse");

# EXPORT DATA

In [None]:
# Construct pandas with position and price for all parcelles
prices_for_all_parcels = features.copy()
prices_for_all_parcels['price'] = prices

prices_for_all_parcels.head()

In [None]:
# Save heatmap with prices for each parcelle
hm = tools.heatmap_prices_per_parcels(geo_parcels, prices_for_all_parcels)
hm.save('export/heatmap_prices_all_parcelles.html')

## Denoising owner type

We want to revisit the colorful owner types map by trying to find spatial clusters with a certain ownership type.
We will approach this problem as a denoising one, and we will attribute to each parcel an owner type which is given by the category the most represented in its local neighborhood.
We therefore find the K nearest neighbors to a parcel (itself included), and assign as type the most represented type in the neighbors.

- Find a representative point for each parcel (maybe the center of mass of the parcel)
- For each parcel find its nearest neighbors
- compute the distribution of ownership for each neihborhood
- assign to the parcel the ownership in the neighborhood

In [None]:
K = 5
types = polygons_df[['parc_no', 'owner_type']].copy().set_index('parc_no').sort_index()['owner_type']
for (idx, (parc_no, proprio, _, owner_type, x, y)) in polygons_df.iterrows():
    distances2 = (polygons_df['x'] - x)**2 + (polygons_df['y'] - y)**2
    neigh = distances2.sort_values()[:K]
    neighbor_parcels = pd.concat((polygons_df, neigh), axis='columns', join='inner')
    types.loc[parc_no] = neighbor_parcels['owner_type'].value_counts().index[0]

In [None]:
polygons_df['owner_type'].value_counts().plot.pie();

In [None]:
types.value_counts().plot.pie();

In [None]:
m = getMap()
#TODO weight by area
def style_function(feature):
    colors = {
        'coop': 'yellow',
        'société' : 'red',
        'public' : 'green',
        'private': 'blue',
        'PPE': 'orange',
        'pension': 'purple',
        'fondation/association' : 'brown'
        
    }
    parc_num = feature['properties']['NO_PARC']
    cat = types[parc_num]
    
    return {
        'stroke':False,
        'fillColor': colors[cat]
    }

folium.GeoJson(
    geo_parcels, 
    style_function=style_function,
    # show the owner at hover
    tooltip=folium.GeoJsonTooltip(['proprio'])
).add_to(m)
m

### Price by quartier

To correctly exdcute this section, please install `geopandas`

In [None]:
def get_quartier(lat, long):
    for quartier in quartiers['features']:
        # skip because we don't have owner data for forest areas
        if quartier['properties']['Name'] == '90 - Zones foraines':  
            return None

        """Given longitute and latitude, return quartier in Lausanne"""
        pos = Point(long, lat)
        quartier_vertices = [(east, north) for east, north, z in quartier['geometry']['coordinates'][0]]
        quartier_poly = Polygon(quartier_vertices)
        if quartier_poly.contains(pos):
            return quartier['properties']['Name']

prices_for_all_parcels_with_quartiers = prices_for_all_parcels.copy()

prices_for_all_parcels_with_quartiers['quartier'] = \
    prices_for_all_parcels_with_quartiers[['lat', 'long']] \
    .apply(lambda row: get_quartier(row['lat'], row['long']), axis='columns')

price_by_quartiers_all = prices_for_all_parcels_with_quartiers[['price', 'quartier']].groupby('quartier').agg('median')

In [None]:
from matplotlib import colors, cm
import branca.colormap as cmb

def style_function_quartiers(feature):    
    min_price, max_price = np.quantile(price_by_quartiers_all["price"], q=(0.05, 0.90))

    colormap = cmb.linear.RdYlGn_06.scale(min_price, max_price)
    
    # invert colors
    colormap.colors = colormap.colors[::-1]
    colormap.caption = "Rent price by quartiers"
    
    
    quartier_name = feature['properties']["Name"]
    if (quartier_name == '90 - Zones foraines'):
        price = 0
    else:
        price = price_by_quartiers_all.loc[quartier_name].values[0]
        
    return {"stroke": False, "fillColor": colormap(price), "fillOpacity": 0.75}

In [None]:
# Remove '90 - Zones foraines' from quartiers
import geopandas
quartiers = json.load(open('data/raw/maps/quartiers.geojson'))
quartiers_pd = geopandas.read_file('data/raw/maps/quartiers.geojson')
quartiers_to_drop = (quartiers_pd['Name'] == '17 - Beaulieu / Grey / Boisy') | \
                    (quartiers_pd['Name'] == '90 - Zones foraines')
quartiers_dropped_pd = quartiers_pd[~quartiers_to_drop]

quartiers_dropped_pd = quartiers_dropped_pd.merge(price_by_quartiers_all, left_on='Name', right_index=True)

In [None]:
map_ = getMap()
folium.GeoJson(quartiers_dropped_pd.to_json(), 
               style_function=style_function_quartiers,
               tooltip=folium.GeoJsonTooltip(['price', 'Name'])
              ).add_to(map_)
                
map_.save('export/price_by_quartiers_all_parcelles.html')
map_