## Working with the AirBnB Data

This is here to help you if you'd more practice by working with the _full_ Airbnb data set. There are *over 100 columns* that you could work with, so there's plenty there to sink your teeth into if you so desire.

<div style="color:red;border:dotted 1px red;padding:4px;margin-bottom:4px;margin-top:4px;background:rgb(239,205,205)"><b>This section is best undertaken <i>after</i> Week 4!</b></div>

### Randomness & Reproducibility

However, the rapid visualisation of the _full_ Airbnb data set using Geopandas/PySAL is hard: there's simply so much of it that visualisation is slow unless you're in a dedicated environment with lots of RAM. So, for the _exploratory_ part of our work we'd normally want to work with a _sample_ -- but what happens if every time we take a sample we get a _different_ sample? That obviously makes things a bit harder, it would be handy if we could get the _same_ random sample every time _while_ we're doing our testing and development before expanding to the full data set.

That's where `random.seed` and comes in: by setting a seed we ensure that any process based on a random/random sampling process will be reproducible. In other words, we'll get the _same_ random sample each time. To understand why this happens you'd need to read up on pseudo-randomness and computers but that's not really relevant here. Note, however, that we set the random seed in two places: in Python in general (`random`) and in numpy (`np.random`) because the latter is what Pandas usually uses.

In [None]:
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import os
import pandas as pd
import seaborn as sns
import geopandas as gpd

### Working with 'Random' on a Computer 

The next section is how we ensure that our sample is 'reproducible'. It's a bit hard to get your head around, but 'random' samples on computers are only *pseudo*-random, so if we give the *same* computer the same initial conditions (the 'seed' in the language of computer science) then we will get the *same* random sample. 

In [None]:
import numpy as np
np.random.seed(12345789) # For reproducibility

import random 
random.seed(123456789) # For reproducibility

### What's Going on Here?

See if you can figure out what this code does and why... Replace this text with text that helps you to remember how this works.

In [None]:
local_path  = os.path.join('airbnb.csv.gz')
remote_path = 'http://data.insideairbnb.com/united-kingdom/england/london/2019-09-14/data/listings.csv.gz'

if os.path.exists(local_path):
    df = pd.read_csv(local_path, compression='gzip', low_memory=False)
else:
    df = pd.read_csv(remote_path, compression='gzip', low_memory=False)
    df.to_csv(local_path, compression='gzip')

print("Full data set shape is: " + ' by '.join(str(i) for i in df.shape))

In [None]:
sample = df.sample(frac=0.1)
print("Sample data set shape is: " + ', '.join(str(i) for i in sample.shape))
sample.describe()

In [None]:
sample.columns.values

### A Simple Sanity Check

You can _always_ do a 'quick and dirty' scatter plot as a map to see if the data seems vaguely sensible -- it's obviously limited as a geo-visualisation but it can give you an _idea_ of whether or not you've done the right thing with your data. For example...

In [None]:
sns.jointplot(x="???", y="???", data=sample)

You'll notice that this 'map' isn't particularly good, but it does tell us that the longitude and latitude values are reasonable: you'd expect to find more AirBnB listings towards the middle of the city and there's a _hint_ of the Thames and the Lee Valley in there (though this is a bit of a stretch). To actually _map_ the data we'll need to be a little more rigorous... 

### Data Cleaning

This is 'raw' data, meaning that it _definitely_ contains a lot of problematic fields. I would suggest that you select the fields that you're most interested in and examine them in more detail to see if they are as useful as you might think, or whether additional effort is required to *make* them useful.

One thing you *might* want to think about is whether there are fields that share the *same* data quality issues. In that case, you might want to think about grouping them together in lists so that you can apply the same cleaning rules to all of them. This will make your code easier to read and more elegant to boot! For example:
```python
currency_fields = ['price','weekly_price','monthly_price']
date_fields = ['host_since','calendar_updated']

for f in currency_fields:
    # ... do something consistent with processing currencies
    pass
for f in date_fields:
    # ... do something consistent with processing dates
    pass
```

In [None]:
sample.price.describe() # Here's an example of a problem field to get you started...

### Exploring the Data

Use the code box below (and create more as needed) in order to explore distributions and the potential utility of the fields of interest. There is *no* right answer, this is a chance to check that you can perform analysis on multiple columns after removing poor-quality data (is it invalid? likely to be incorrect? or simply 'not a number'? how do you handle these?) and thinking, perhaps, about how things like price and type of accommodation vary by borough or neighbourhood...

# Mapping the Data

<div style="color:red;border:dotted 1px red;padding:4px;margin-bottom:4px;margin-top:4px;background:rgb(239,205,205)"><b>This section is best undertaken <i>after</i> Week 7!</b></div>

Let's step through what's going on below:
1. We need to import the `Point` class so that we move from separate x and y columns to a single 'point' that Geopandas can work with.
2. We then 'zip' up the x and y (i.e. lat and long) into pairs -- think of this as a simple way to pair _each_ x and y based on their row position and this allows us to move from separate columns to actual points.
3. The next step is to tell Geopandas what projection our data is in -- raw lat and long are _usually_ recorded in WGS84 which has the EPSG identifier 4326 (_i.e._ epsg:4326).
4. You'll notice that to create a new `GeoDataFrame` we do so _slightly_ differently from how we created a new `DataFrame` last term: we pass in the existing `pandas` data frame (`sample`), the CRS (projection), and finally the `geometry` that we created from the `zip` process.
5. The last step is to reproject the geometry into OSGB (Ordnance Survey GB) which has the EPSG identifier 27700. 

You can see the results of this step in the final step where we print out the first 3 rows of the reprojected data: notice that the point coordinates are no longer in lat/long!

Some of this _might_ seem a little tedious, but it's incredibly useful to be able to automate this process: we can reproject a whole series of shapefiles (e.g. every single file in a directory!), we can convert CSV files into shapes that load automatically into QGIS instead of having to do this process manually...

In [None]:
from shapely.geometry import Point
# Convert x,y to Points using zip(...)
geometry = [Point(xy) for xy in zip(sample.longitude, sample.latitude)]

eg_sz = 4 # How many rows of 'example' to show
print("From the data frame...")
print(sample.head(eg_sz)[['longitude','latitude']])
print("-" * 50)

print(" ")
print("From the geometry zip...")
print([", ".join([str(p.x), str(p.y)]) for p in geometry[:eg_sz]])
print("-" * 50)

print(" ")
# Initialise to WGS84
crs = {'init' :'epsg:4326'}
sdf = gpd.GeoDataFrame(sample, crs=crs, geometry=geometry)

# Reproject into OSGB
sdf = sdf.to_crs({'init' :'epsg:27700'})

# Check it worked (coordinates no longer in lat/long)
print(sdf.head(eg_sz)[['neighbourhood','geometry']])

# And save it as a new shapefile
sdf.to_file(os.path.join('shapes','AirBnB-Sample.shp'))

print("Done.")

Plotting the AirBnB sample will take some time... so be patient! You'll notice that the results are now also reported in OSGB units, not lat/long, so this is one way in which GeoPandas is more 'knowledgeable' about geodata than pandas.

In [None]:
# Ensures that we work with the sample we just saved in case
# we want to adjust our processing and don't want to have 
# to re-run the entire analysis just because we've overwritten 
# a column (see next step)
sdf = gpd.read_file(os.path.join('shapes','AirBnB-Sample.shp'))

In [None]:
# The format in the full data set is $1,250.00 so we need to deal with that
sdf['price'] = sdf.price.str.replace('$','').str.replace(',','').astype('float').fillna(0.0)

In [None]:
# Check we've got something plot-able
sdf.price.describe()

In [None]:
# Check distribution (could probably use a transform)
sns.distplot(sdf.price)

### WTF?

IF you didn't fix it in the `pandas` section above, then I'd suggest that some of the data values in the price distribution are definitely problematic: what would you be paying \$20,000 for\? Or even \$10,000? This requires investigation.

In [None]:
# Investigate price distribution here if you 
# haven't already done so.

In [None]:
# And make a map
f, ax = plt.subplots(1, figsize=(15, 11))
sdf.plot(ax=ax, column='price', cmap='OrRd', scheme='quantiles', k=5, edgecolor=None, legend=True, s=1.5)
plt.axis('equal')