## Data Matching - EPC

This notebook carries out the data matching process necessary to geo-reference EPC data with building polygons. Refer to Section 3.5 of the Dissertation document to review the methodology carried out below. 

In [1]:
import geopandas as gpd
import pandas as pd
import numpy as np
import glob
import matplotlib 
%matplotlib inline

In [2]:
# Read geojson file of buildings - this file was previously created in the Initial_cleaning.R document
buildings= gpd.read_file("Data/BuildingData/FinalBuildings.geojson")


In [7]:
buildings.columns

Index(['fid', 'featurecode', 'version', 'versiondate', 'theme',
       'calculatedareavalue', 'changedate', 'reasonforchange',
       'descriptivegroup', 'descriptiveterm', 'make', 'physicallevel',
       'physicalpresence', 'poly_broken', 'NUMPOINTS', 'geometry'],
      dtype='object')

In [8]:
#Remove unwanted columns
#colums to be deleted: 
delete=["poly_broken","physicalpresence","physicallevel","make","descriptiveterm","descriptivegroup","reasonforchange","changedate","versiondate","theme"]

#drop those columns
buildings.drop(delete, axis=1, inplace=True)

### !!! Reading AddressBase 

The data matching process carried out for this work relies on AddressBase data, a product supplied by the Ordnance Survey under a special license. For this reason, this data is not shared with the other data on GitHub. A license for AddressBase can be requested from the Ordnance Survey (https://www.ordnancesurvey.co.uk/business-government/products/addressbase). For this work, AddressBase Core was used. 

In [10]:
#read the AddressBase files and merge into a single dataframe
path = "./Data/AddressBase" # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename,  index_col=None, header=0)
    li.append(df)


AddressBase = pd.concat(li, axis=0, ignore_index=True)

In [11]:
#Let's try to join UPRN and buildings by TOID

AddressBaseSimple=AddressBase[['TOID','UPRN','SINGLE_LINE_ADDRESS']]

BuildingsUPRN=pd.merge(buildings, AddressBaseSimple, how='left', left_on='fid',right_on='TOID')

#Remove NA values
BuildingsUPRN.dropna(inplace=True)

In [12]:
#Now try to flatten database and get UPRNS into a list
BuildingsUPRN.head(10)

#convert UPRN to string to match
BuildingsUPRN["UPRN"] = BuildingsUPRN["UPRN"].astype(np.int64)
BuildingsUPRN["UPRN"] = BuildingsUPRN["UPRN"].astype(str)

In [13]:
print(BuildingsUPRN.dtypes)

fid                      object
featurecode               int64
version                   int64
calculatedareavalue      object
NUMPOINTS               float64
geometry               geometry
TOID                     object
UPRN                     object
SINGLE_LINE_ADDRESS      object
dtype: object


In [14]:
#creates a list of UPRNs for each fid
fidUPRN = BuildingsUPRN.groupby(['fid'])['UPRN'].apply(' , '.join).reset_index()

In [15]:
#get count of individual properties in each unique building
countofproperties=BuildingsUPRN.groupby(['fid'])['UPRN'].agg(['count'])

In [16]:
#save as csv
countofproperties.to_csv('Data/BuildingData/countofproperties.csv')

### Matching EPC dataset

In [18]:
#Simplify AddressBase with what we want for address matching
AddressBase2=AddressBase[["UPRN","SINGLE_LINE_ADDRESS","SUB_BUILDING","BUILDING_NAME","BUILDING_NUMBER","STREET_NAME","POSTCODE","TOID"]]


In [23]:
#!!! read full epc dataset for westminster - obtained from https://epc.opendatacommunities.org/

EPC= pd.read_csv('Data/EnergyData/certificates/certificates_westminster.csv')



(126400, 90)

In [24]:
#convert datatype
EPC["NUMBER_HABITABLE_ROOMS"] = EPC['NUMBER_HABITABLE_ROOMS'].astype('float64')

In [25]:
import numpy as np

#clean up string data
EPC_ob = EPC.select_dtypes(['object'])
EPC[EPC_ob.columns] = EPC[EPC_ob.columns].replace(np.nan, '', regex=True)

EPC.head()

Unnamed: 0,LMK_KEY,ADDRESS1,ADDRESS2,ADDRESS3,POSTCODE,BUILDING_REFERENCE_NUMBER,CURRENT_ENERGY_RATING,POTENTIAL_ENERGY_RATING,CURRENT_ENERGY_EFFICIENCY,POTENTIAL_ENERGY_EFFICIENCY,...,MECHANICAL_VENTILATION,ADDRESS,LOCAL_AUTHORITY_LABEL,CONSTITUENCY_LABEL,POSTTOWN,CONSTRUCTION_AGE_BAND,LODGEMENT_DATETIME,TENURE,FIXED_LIGHTING_OUTLETS_COUNT,LOW_ENERGY_FIXED_LIGHT_COUNT
0,1414941369242016021822181047269808,Flat 23 Chalfont Court,Baker Street,,NW1 5RS,7273662478,D,C,60,78,...,natural,"Flat 23 Chalfont Court, Baker Street",Westminster,Westminster North,LONDON,England and Wales: before 1900,2016-02-18 22:18:10,owner-occupied,,
1,442870599502017040108551778237198,Clifton Cottage,101 Clifton Hill,,NW8 0JR,1549892768,D,A,57,96,...,natural,"Clifton Cottage, 101 Clifton Hill",Westminster,Westminster North,LONDON,England and Wales: 2007 onwards,2017-04-01 08:55:17,rental (private),,
2,1367006411532015092309384201278507,Flat 3,1-3 Brewer Street,,W1F 0RD,1781529378,C,B,74,81,...,natural,"Flat 3, 1-3 Brewer Street",Westminster,Cities of London and Westminster,LONDON,England and Wales: 1900-1929,2015-09-23 09:38:42,rental (private),,
3,168235820962008101710111102168878,Flat 26 Caroline House,Bayswater Road,,W2 4RQ,7037272568,D,D,57,59,...,natural,"Flat 26 Caroline House, Bayswater Road",Westminster,Westminster North,LONDON,England and Wales: 1950-1966,2008-10-17 10:11:11,rental (private),,
4,42239819922017061900423424838073,Flat 32,13 Craven Hill,,W2 3EN,7221144568,F,E,30,42,...,natural,"Flat 32, 13 Craven Hill",Westminster,Westminster North,LONDON,England and Wales: 1900-1929,2017-06-19 00:42:34,rental (private),,


In [26]:
#we need to deal with properties where more than one certificate exists - we want the latest property
EPC=EPC.sort_values('INSPECTION_DATE').groupby('BUILDING_REFERENCE_NUMBER').tail(1)

In [27]:
#replace commas in AddressBase with spaces
AddressBase2["SINGLE_LINE_ADDRESS"]= AddressBase2["SINGLE_LINE_ADDRESS"].str.replace(",", "")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [28]:
#convert EPC addresses into string and remove any extra spaces on either side of string
EPC['ADDRESS1'] = EPC['ADDRESS1'].astype(str).str.upper()
EPC['ADDRESS2'] = EPC['ADDRESS2'].astype(str).str.upper()
EPC['ADDRESS3'] = EPC['ADDRESS3'].astype(str).str.upper()

EPC_ob = EPC.select_dtypes(['object'])
EPC[EPC_ob.columns] = EPC_ob.apply(lambda x: x.str.strip())

In [29]:
#join EPC addresses
EPC['LINE_ADDRESS'] = EPC[['ADDRESS1', 'ADDRESS2', 'ADDRESS3']].agg(' '.join, axis=1)
EPC['LINE_ADDRESS']=EPC['LINE_ADDRESS'].str.replace(',','')
EPC['LINE_ADDRESS']=EPC['LINE_ADDRESS'].str.strip()

In [30]:
#Now we want to match the two datasets by postcode 
EPC_UPRN= pd.merge(EPC,AddressBase2,how='inner', on='POSTCODE')

In [42]:
#replace double spaces
EPC_UPRN['SINGLE_LINE_ADDRESS'] = EPC_UPRN['SINGLE_LINE_ADDRESS'].str.replace('\s+', ' ', regex=True)

In [31]:
#drop address base 2 here to free up space
del(AddressBase2)
del(AddressBase)

In [33]:
#split addresses into separate words
EPC_UPRN["EPC_address"]=EPC_UPRN["LINE_ADDRESS"].str.split(" ")
EPC_UPRN["UPRN_address"]=EPC_UPRN["SINGLE_LINE_ADDRESS"].str.split(" ")

In [34]:
#remove the last 3 elements in the single line address
EPC_UPRN['UPRN_address'] = EPC_UPRN['UPRN_address'].str[:-3]

In [35]:
#find the words the two addresses have in common
row=EPC_UPRN.shape[0]
sets=[]

for i in range (0,row):
    test=set(EPC_UPRN['EPC_address'].iloc[i])&set(EPC_UPRN['UPRN_address'].iloc[i])
    sets.append(test)

In [36]:
EPC_UPRN["sets"]=sets


In [37]:
#calculate similarity coefficient
EPC_UPRN['coefficient']=EPC_UPRN['sets'].str.len()/EPC_UPRN['EPC_address'].str.len()

In [38]:
#Group by and get the maximum coefficient for each EPC certificate
Group= EPC_UPRN.groupby(['BUILDING_REFERENCE_NUMBER'], sort=False)['coefficient'].max()

#Make a dataframe
Group=Group.to_frame().reset_index()

In [39]:

idx = EPC_UPRN.groupby(['BUILDING_REFERENCE_NUMBER'])['coefficient'].transform(max) == EPC_UPRN['coefficient']

Group=EPC_UPRN[idx]

In [40]:
#drop duplicates
Final_match=Group.drop_duplicates(subset=['BUILDING_REFERENCE_NUMBER'], keep='last')

In [41]:
#filter out ones that clearly don't have a match
Final_match = Final_match[Final_match['coefficient'] >= 0.5]

In [42]:
Final_match.shape

(95864, 102)

In [43]:
#extract UPRN and Building Reference number
matched_EPC_UPRN= Final_match[['BUILDING_REFERENCE_NUMBER','UPRN']]

In [44]:
matched_EPC_UPRN["UPRN"] = matched_EPC_UPRN["UPRN"].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [45]:
#save as csv
matched_EPC_UPRN.to_csv('Data/BuildingData/EPC_UPRN.csv')

In [46]:
#Now match this to the UPRN building polygons
Buildings_matched=pd.merge(BuildingsUPRN,matched_EPC_UPRN,how='inner', on='UPRN')

In [47]:
#merge with original certificate data
Buildings_matched_final=pd.merge(Buildings_matched,EPC,how='inner', on='BUILDING_REFERENCE_NUMBER')

In [48]:
#export as a final geojson to be used for modelling
Buildings_matched_final.to_file("Data/BuildingData/buildings_EPC.geojson", driver='GeoJSON')