# Web Scraping
In this notebook, I used the carbonculture website in order to scrape building data for all 435 locations reported on this website. This data can be integrated into the Building Data Urban Genome 2 Project as a form of feature engineering in order to improve the accuracy of different models. My goal in this was to improve the performance of a classification model used to predict the energy rating on a scale of A - G for European buildings.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Here I read in the existing train and test datasets created from performing EDA on the metadata dataset

In [None]:
os.chdir('/kaggle/input/predicting-energy-rating-from-raw-data')
train_data = pd.read_csv('train_rating_eu.csv')
test_data = pd.read_csv('test_rating_eu.csv')

train_data = train_data.drop(['building_id', 'site_id', 'Unnamed: 0'], axis=1)
test_data = test_data.drop(['building_id', 'site_id', 'Unnamed: 0'], axis=1)

These are the main pages that I will be gathering information from. URL3 is the home page, URL1 is a list of buildings at one of the 12 locations included in this website, and URL2 is a webpage for a specific location

In [None]:
import requests

URL1 = 'https://platform.carbonculture.net/communities/ucl/30/apps/assets/list/place/'
URL2 = 'https://platform.carbonculture.net/places/119-torrington-place/1155/'
URL3 = 'https://platform.carbonculture.net/about/'
page1 = requests.get(URL1)
page2 = requests.get(URL2)
page3 = requests.get(URL3)

In [None]:
f = open('/kaggle/working/output1', 'wb')
f.write(page1.content)
f = open('/kaggle/working/output2', 'wb')
f.write(page2.content)
f = open('/kaggle/working/output3', 'wb')
f.write(page3.content)

In [None]:
from bs4 import BeautifulSoup

soup1 = BeautifulSoup(page1.content, 'html.parser')
soup2 = BeautifulSoup(page2.content, 'html.parser')
soup3 = BeautifulSoup(page3.content, 'html.parser')

The following creates a list of urls that correspond to the 12 different locations reported in this website. Each url contains information about all of the buildings reported for each individual location.

In [None]:
places_elems = soup3.find_all('a', href=True)
places_elems

In [None]:
i = 0
places = []

for place in places_elems:
    if i > 10 and i < 23:
        if place.text: 
            places.append(place['href'])
    i = i + 1
    
places

In [None]:
urls=[]
base = 'https://platform.carbonculture.net'
end = 'apps/assets/list/place/'

for p in places:
    urls.append((base + p + end))
    
urls

In [None]:
soups=[]

for u in urls:
    p = requests.get(u)
    soups.append(BeautifulSoup(p.content, 'html.parser'))

In [None]:
titles = []
titles_whole = []

for soup in soups:
    url_elems = soup.find_all(href=True)
    for elem in url_elems:
        if elem.text: 
            if elem['href'].find('places') == 1:
                titles.append(elem['href'])
                
for t in titles:
    titles_whole.append(base + t)

The following gathers data from the above created list of urls. The urls used correspond to the specific buildings and were obtained above by looping through the 12 general location urls and creating a new list containing each individual building url. There are multiple buildings reported for each location. Specifically, it scrapes data pertaining the year built, number of floors, number of occupants, and main heating type

In [None]:
years = []
floors = []
heating = []
occupants = []
i = 0

for w in titles_whole:
    test = requests.get(w)
    soup_test = BeautifulSoup(test.content, 'html.parser')
    test_elems = soup_test.find_all('li', class_='assets-meta__list-item')
    if len(test_elems) == 0:
        print(w)
        years.append('-')
        floors.append(-1)
        heating.append('-')
        occupants.append(-1)
    for elem in test_elems:
        a = str(elem.find('span'))[6:-7]
        if i == 0:
            years.append(a)
        elif i == 1:
            floors.append(a)
        elif i == 3:
            heating.append(a)
        elif i == 4:
            occupants.append(a)
            i = -1
        i = i + 1

The following gathers data from the table on the general location page (which lists each building reported at that location). From this table I scraped the name of the building, annual energy consumption, annual energy consumption per area, rating, and usable floor area (sqm). The loop runs 12 times (12 locations) and the nested loop runs through each individual building in the table (however many buildings are reported at that location)

In [None]:
i = 0
j = 0
title = []
consumption = []
floor_area = []
consumption_area = []
rating = []

for soup in soups:
    url = titles_whole[i]
    page = requests.get(url)
    soup_title = BeautifulSoup(page.content, 'html.parser')
    table_elems = soup.find_all('td')
    for elem in table_elems:
        if i != 5 and i != 6 and i != 7:
            elem = str(elem)
            elem = elem[4:-5]
            if i == 0:
                elem = elem[-11:-8]
                if elem != 'N/A':
                    elem = elem[2:]
                rating.append(elem)
            if i == 1:
                title.append(elem)
            elif i == 2:
                elem = elem[:-9]
                if len(elem) == 1:
                    elem = '-1'
                elem = elem.replace(',', '')
                consumption.append(elem)
            elif i == 3:
                elem = elem[:-9]
                if len(elem) == 1:
                    elem = '-1'
                elem = elem.replace(',', '')
                floor_area.append(elem)
            elif i == 4:
                elem = elem[:-9]
                if len(elem) == 1:
                    elem = '-1'
                elem = elem.replace(',', '')
                consumption_area.append(elem)
            j = j + 1
        if i == 7:
            i = -1
        i = i + 1


The following block of code changes the 'floors' data into a usable format: getting rid of any unneccessary text and converting any missing values into -1 so that the list can be converted to a float later on.

In [None]:
i = 0
null = occupants[320]

for o in occupants:
    if str(o) == null:
        occupants[i] = -1
    i = i + 1

i = 0
null = floors[33]
for f in floors:
    if str(f) == null:
        print('here!')
        floors[i] = -1
    elif len(str(f)) > 3:
        seq_type= type(f)
        f = seq_type().join(filter(seq_type.isdigit, f))
        floors[i] = f
    i = i + 1

Converting all of the numeric data into float lists:

In [None]:
consumption = list(map(float, consumption))
consumption_area = list(map(float, consumption_area))
floor_area = list(map(float, floor_area))
occupants = list(map(float, occupants))
floors = list(map(float, floors))

From the above scraped data, we can make our new dataframe:

In [None]:
columns = ['sqm', 'building', 'energy consumption', 'energy consumption per area', 'year built', 'floors', 'no. occupants', 'main heating type', 'rating2']
df = pd.DataFrame(columns=columns)

In [None]:
df['sqm'] = floor_area
df['building'] = title
df['energy consumption'] = consumption
df['energy consumption per area'] = consumption_area
df['year built'] = years
df['floors'] = floors
df['no. occupants'] = occupants
df['main heating type'] = heating
df['rating2'] = rating

In [None]:
df['sqm'] = df['sqm'].replace(-1, np.nan)
df['energy consumption'] = df['energy consumption'].replace(-1, np.nan)
df['energy consumption per area'] = df['energy consumption per area'].replace(-1, np.nan)
df['rating2'] = df['rating2'].replace('N/A', np.nan)
df['floors'] = df['floors'].replace(-1, np.nan)
df['no. occupants'] = df['no. occupants'].replace(-1, np.nan)
df['year built'] = df['year built'].replace('-', np.nan)
df['main heating type'] = df['main heating type'].replace('-', np.nan)
df

In [None]:
df.info()

From this dataframe, we can split it into a train and test by segmenting the values with and without a rating, respectively.

In [None]:
train1 = df[df['rating2'].notna()]
train1.info()

In [None]:
test1 = df[df['rating2'].isnull()]
test1 = test1.drop('rating2', axis=1)
test1.info()

Here I am merging our new train and test dataframes obtained from the above web scraping with our original train and test dataframes from the metadata dataset.

In [None]:
merge1 = pd.merge(left=train_data, right=train1, how='outer', left_on='sqm', right_on='sqm')
merge1.info()

In [None]:
merge2 = pd.merge(left=test_data, right=test1, how='outer', left_on='sqm', right_on='sqm')
merge2.info()

The below analysis of our merge yields the following question: if this dataset supposedly already includes data from this website, why do only 35 of the sqm's match up for the train dataset and 12 for the test dataset?



In [None]:
sqm1 = train_data['sqm']
sqm2 = train1['sqm']
year1 = train_data['yearbuilt']
year2 = train1['year built']

train_intersection = list(set(sqm1) & set(sqm2))
year_intersection = list(set(year1) & set(year2))


print("There are ", len(train_intersection), " sqm matches!")
print("There are ", len(year_intersection), " year matches!")

In [None]:
sqm1 = test_data['sqm']
sqm2 = test1['sqm']
year1 = test_data['yearbuilt']
year2 = test1['year built']

train_intersection = list(set(sqm1) & set(sqm2))
year_intersection = list(set(year1) & set(year2))


print("There are ", len(train_intersection), " sqm matches!")
print("There are ", len(year_intersection), " year matches!")