Added a few extra packages (cartopy, seaborn etc) to default given by notebook starting blurb below (np, pd, os). The referenced docker images say what packages are available - e.g. the [image gets cartopy via conda](https://github.com/Kaggle/docker-python/blob/master/Dockerfile#L45), and [pandas via pip](https://github.com/Kaggle/docker-python/blob/master/Dockerfile#L55)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Goal
This notebook is trying out a simple linear model to predict real-estate prices, based on [the dataset provided in this challenge](https://www.kaggle.com/quantbruce/real-estate-price-prediction)

# Read in the data
Set "No" as the index, to avoid seeing it in analysis!

In [None]:
df = pd.read_csv("../input/real-estate-price-prediction/Real estate.csv", index_col="No")

# Initial exploration of data
## Dataset basic properties
What does dataset contain, and how big is it?

In [None]:
df.tail()

## Correlations in dataset
From above, we see dataset has 414 entries. I'm assuming from column names that for purposes of this challenge, [XN are the explanatory variables, and Y is the explained variable](https://en.wikipedia.org/wiki/Dependent_and_independent_variables)
* So we've got 6 potential explanatory variables, and 1 explained variable.

Let's have a quick look at the correlations between the different variables.

Seaborn's handy [pairplot function](https://seaborn.pydata.org/generated/seaborn.pairplot.html) helps us look at:
- the distributions of each explanatory variable XN - any evidence of things we'd need to be careful of (non-flat distributions etc)
- the basic single variable correlations present between XN and Y
- any [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity) between the XN we should be aware of (and hence be careful, as may cause issues!)

Irritatingly, pairplot doesn't give the count # on the distributions - insights using that based on the handy visualisation of the .csv in [the original Kaggle dataset](https://www.kaggle.com/quantbruce/real-estate-price-prediction)

In [None]:
hpp = sns.pairplot(df, corner=True)  # corner: get rid of duplicate "noise" from upper triangle

### Correlations analysis
#### Individual variables and their distributions
Group's guess at properties - looks like [others have run into lack of metadata issue too](https://www.kaggle.com/quantbruce/real-estate-price-prediction/discussion/128822) - and remarks on distribution
* X1 transaction dates [float, decimal year]:
  * monthly data by looks of it
  * ~1 year's worth, mid-2012 to mid-2013
  * reasonably flat, bit biased to end of dataset
    * usually ~30/bin, though ~doubles (60-70) in final two bins (mid-2013)
* X2 house age [float, assume years old]:
  * bimodal distribution dominated by broad (10$\pm$10) cluster of recent houses
  * smaller, thinner cluster of older properties at ~35$\pm$7???
* X3 distance to the nearest MRT station [float, assume meters]
  * Steven H spotted in Taipei, so guessing distance to nearest [Taipei metro station](https://en.wikipedia.org/wiki/Taipei_Metro)
  * ~exponential distribution
  * dominated by 1st bin (~60% of properties are very near a station aka within 670 m???)
  * Widely spread after, so rest (other 40%) far lower, distribution generally falling off ~monotonically
  * Up to ~4 clusters - see X2:X3
* Number of convenience stores [int, assume within a certain distance of property]
  * Fairly flat, up to 10 (very convenient!)
* Latitude [float, assume degrees]
  * Looks like a city - unimodal distribution, asymmetric
    * Steven H spotted in Taipei!
* Longitude [float, assume degrees in -180:180 convention]
  * As per above, interesting hole (see below)
* House price of unit area [float, assume $TWD/m^2$]
  * Think this is our intended explained variable
    * Helpfully, looks like it's already been transformed to a nice "normalised" variable, removing need to look at nasty things like int-y "number of rooms" which would be poor proxy for this
  * ~normal distribution, centered on ~40, up to ~75
    * One outlier at ~120, which we may want to exclude...

NB we might be missing other important explanatory variables - are there good schools nearby, is property damaged, etc. 
* But frankly who cares for now - anything like this will contribute to our residuals!
* And some things like this we may be able to infer - do residuals show hotspots geographically, hinting at local factors not captured

#### Individual explanatory variables and Y
Look at final row for this
* X1: Possible slight peak in 3rd quarter of dataset (economy? supply/demand?) - i.e. not simple linear
* X2: Trough at ~25 (poor construction practices then?) - again not simple linear
* X3: ~noisy exponential falloff (premium for convenient MRT nearby?) - again not simple linear
* X4: noisy ~linear increase (premium for convenience of convenience stores?) - yay, finally
* X5, X6: complex, but generally more expensive near center. Makes sense to look at in 2D

#### Collinearity:
* X1: nothing obvious
* X2:
  * X2:X3 - some. Slight tendency for mid-age properties (~20 years old) to have larger upper bound (furthest from MRT)
* X3:
  * X3:X4 - some. All places with lots of convenience stores are close to MRT
  * X3:X5&X6 - strong, esp longitude. Distance to MRT clearly geographically dependent ([MRT map](https://en.wikipedia.org/wiki/Taipei_Metro#/media/File:Taipei_Metro_geographical_map.svg))
* X4:
  * X4:X5&X6 - strong. More convenience stores near centre.
* X5:
  * X5:X6 - strong. Sideways map! Let's have a closer look shortly.


#### Correlation matrix
Look at correlation values (positive, red = correlated; negative, blue = anticorrelated) for another way of looking at:
* correlations between each XN and Y (lowest row)
* collinearity between XN (other entries)

Note here 0 = X1, 1 = X2, ... 6 = Y

Ignore diagonal (perfect correlation with itself) and upper triangle (duplicated info)

In [None]:
hms = plt.matshow(df.corr(), cmap="RdBu_r", vmin=-1, vmax=1)
hcb = plt.colorbar(hms)

#### Mapping
Let's look at that geographical distribution a little more, in a "normal" orientation

When we do, we can see distribution strange. There's a big "hole" to SW of main cluster. From lat/lon values & comparing with Google maps, I think cluster is in [Xindian district](https://en.wikipedia.org/wiki/Xindian_District), and that we're looking at roughly [this bounding box](https://www.google.com/maps/place/Xindian+District,+New+Taipei+City,+Taiwan/@24.9665234,121.4967858,8124m/data=!3m1!1e3!4m5!3m4!1s0x346803de337f4fe1:0xe29baf27fbf0968f!8m2!3d24.978282!4d121.5394822), so hole possibly due to terrain (Maria A's point)

In [None]:
hjp = sns.jointplot(x="X6 longitude", y="X5 latitude", data=df)

# Proper mapping-based exploration of data
Let's have a look at our various variables in a lat-lon sense - i.e. on a "proper" map, to see what insights we can glean

In [None]:
varibs = [
    "Y house price of unit area",
    "X3 distance to the nearest MRT station",
    "X4 number of convenience stores",
    "X2 house age",
    "X1 transaction date",
]

fig, axes = plt.subplots(ncols=2, nrows=3, sharex=True, sharey=True, figsize=[10,8],
                         subplot_kw={'projection': ccrs.PlateCarree()})

# Set background color for fig and subplots to sth to help see data
# over full range without changing away from default viridis cmap
back_col = "darkgrey"
fig.patch.set_facecolor(back_col)

# Name lats and lons for convenience
lons = df["X6 longitude"]
lats = df["X5 latitude"]

# Get the (tight) bounds
x0, x1 = lons.min(), lons.max()
y0, y1 = lats.min(), lats.max()
m = 0.01  # Add a margin - set_xmargin / plt.margins seem to be ignored :(

vaxes = axes.flatten()[:-1]  # 5 variables, but 6 axes - make our zip simpler by lopping off!
axes[-1,-1].set_visible(False)  # Hide unused final axis https://stackoverflow.com/a/10035974
for varib, vax in zip(varibs, vaxes):
    plt.sca(vax)
    vax.background_patch.set_facecolor(back_col)  # https://github.com/SciTools/cartopy/issues/880
    vax.set_extent([x0-m, x1+m, y0-m, y1+m], ccrs.PlateCarree())
    
    hs = plt.scatter(lons, lats, c=df[varib], transform=ccrs.PlateCarree(), s=1)
    plt.colorbar(hs)
    
    plt.title(varib)

##### Analysis of map
Interesting! 
* Y: house prices generally max out in Xindian
* X3: fairly radial increase in "distance to nearest MRT" away from Xindian centre
  * Makes ~sense - looks like [Xindian MRT stop is a terminus](https://upload.wikimedia.org/wikipedia/commons/0/0a/Taipei_Metro_geographical_map.svg) - so would get a radial dependence away from that final node
* X4: number of convenience stores *within* Xindian does ~mimic a banana-like feature in there
  * Unclear whether this is causative (house prices depend on # of convenience stores) or correlative (you get more convenience stores where people have disposable income; latter also correlates with house prices)
  * Distinction probably doesn't matter for modelling purposes - either way, given feature seems to show up, prob worth including this as explanatory variable!
* X2: oldest houses (~40 y/o) are in Xindian. Then medium age (~30-10 y/o) everywhere. Newest <10 y/o generally near Xindian
* X1: nothing obvious in this view

Does look like there's quite a bit of duplicated info here (X3:X6), so worth being parsimonious about what we throw at the regression. 

Really need a verification chain to work out what is worth including. And thinking carefully about how to correct for "dumb"/meaningless skill inflation obtained by simply adding another variable - aka overfitting.

Prob not worth transforming any of the variables - only contender is poss X3, but that would be just throwing data away?

Will stop here for now - need to get hands dirty with actual modelling next!