<div style="float:left">
            <h1 style="width:450px">Data Creation Notebook</h1>
    <h2 style="width:450px">Assemble Training Data for <i>Intro to Programming</i></h2>
</div>
<div style="float:right"><img width="100" src="https://github.com/jreades/i2p/raw/master/img/casa_logo.jpg" /></div>

This notebook creates the composite data set used in the Geocomputation module. You are welcome to add additional data sets for the purposes of the final assessment in the module if you so wish.

In [None]:
import matplotlib as mpl
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import pandas as pd
import geopandas as gpd

## AirBnb Listings

We take the large data set here since we want to give the students maximum flexibility to find attributes of interest in the data set.

Assemble the URL in a way that makes it easy to change the extract over time/geography/type

In [None]:
import os
from urllib.parse import urlunsplit, urlencode
date   = '2020-08-24'
fn     = 'listings.csv.gz'
scheme = os.environ.get("SCHEME", "http")
site   = os.environ.get("NETLOC", "data.insideairbnb.com")
loc    = os.path.join('united-kingdom','england','london')
data   = os.path.join('data',fn)
path   = f"{loc}/{date}/{data}"
url    = urlunsplit((scheme, site, path, "", ""))

# Where to write data
opath  = os.path.join('data','src')

Download and check the shape of the data set

In [None]:
df = pd.read_csv(url, low_memory=False) # The 'low memory' option means pandas doesn't guess data types
print("Data frame shape: " + str(df.shape)) # What is the shape of the data?

Write it to a CSV using whatever compression scheme was specified in the 'fn'.

In [None]:
df.to_csv(os.path.join(opath,f"{date}-{fn}"), index=False)
print(df.columns.values)

Take a subsample for use in Weeks 2–4.

In [None]:
cols = ['id','name','description','host_id','host_name','host_since','latitude','longitude',
        'property_type','room_type','accommodates','bathrooms','bedrooms','beds','price',
       'minimum_nights','maximum_nights','availability_30','availability_60','availability_90','availability_365',
       'number_of_reviews','first_review','last_review','review_scores_rating','calculated_host_listings_count']
dfs = df.sample(100, random_state=42)[cols]
print("Data frame shape: " + str(dfs.shape))

dfs.to_csv(os.path.join(opath,f"{date}-sample-{fn}"), index=False)

And now convert the entire df to a geo-data frame.

In [None]:
# Convert lat/log to points
from shapely.geometry import Point
geometry = [Point(xy) for xy in zip(df.longitude, df.latitude)]

# Drop these columns and reproject into OSBG
df.drop(['longitude', 'latitude'], axis=1, inplace=True)
airbnb = gpd.GeoDataFrame(df, crs="EPSG:4326", geometry=geometry)
airbnb = airbnb.to_crs('EPSG:27700')
airbnb.plot(marker='*', color='green', markersize=1)
del(df)

## LSOA Data

Since we don't really cover spatial joins in this class I've appended the LSOA values from 2011 here so that students can add on other information of interest and/or aggregate the listings in useful ways.

In [None]:
lsoas = gpd.read_file('https://github.com/jreades/i2p/raw/master/data/src/LSOAs.gpkg', driver='GPKG')
lsoas.plot()

## Join the Data

Join the LSOA and Inside Airbnb data sets.

In [None]:
# Spatial join in way that preserves Inside Airbnb coordinates
gdf = gpd.sjoin(airbnb, lsoas, how="left", op='within', rsuffix='r')
gdf.drop(columns=['index_r','objectid','lsoa11nmw','st_lengths'], inplace=True)
gdf.plot(marker='*', color='green', markersize=0.5)

## Wrapping Up

Save the final output for use by students. In the interests of having some kind of versioning we capture the date and file name from the Inside Airbnb web site. Changing the URL definition above should ensure that we don't mindless overwrite each year's data.

In [None]:
import os
gpkg = f"{date}-{fn}".replace('csv.gz','gpkg')

# And save.
gdf.to_file(os.path.join(opath,gpkg), driver='GPKG')