**What I want to do:** Get suburb information from microburbs (fantastic site btw) and understand which place would be ideal for us to move next.

Microburbs is a fantastic site that has aggregated useful information from several government sources as well as made intelligent derivations (pretty much what I would have done with the raw information - only more detailed and smarter!). However, the only way to see information in microburbs is through a map hover feature which makes it really hard to see and compare information from  all the surrounding suburbs of my suburb of interest. So let me get information of all the suburbs and then simply compare them side by side.

## Admin stuff

In [1]:
import bs4
from bs4 import BeautifulSoup
from urllib2 import urlopen
import pandas as pd
import re
import numpy as np
from time import sleep
%matplotlib inline
import pylab as plt
import os
import warnings

## Source suburb html

I hovered over the inner west region of Sydney in microburbs.com.au and got the areas-list element upon inspection. Will go ahead and store this in a variable and extract info from the html.

In [2]:
baseid = 'https://www.microburbs.com.au'

In [3]:
# open the file and read as a single buffer. Close the file after reading successfully
fd = open('dict_burb_html', 'r')
cache = fd.read()
fd.close()

In [4]:
# this line of code splits ther html I've collected into burb locations. What I want to do is mine the html and collect burbids and burblinks
dict_burb_html = cache.replace('\n','').split(',')

## Mine html and get the burbids/burblinks

In [5]:
# write a function for this:
def get_burbid_and_burblink(location, html_codedump, baseid = 'https://www.microburbs.com.au'):
    soup = BeautifulSoup(html_codedump, "lxml")
    burbid=[]; burblink = []
    for a in soup.find_all('a', href=True):
        burbid.append(a.string.string.encode('utf-8'))
        burblink.append(a['href'])
    df_temp = pd.DataFrame({'location':[location for x in burbid],'burbid':burbid,'burblink':burblink})
    df_temp['burblink'] = df_temp['burblink'].apply(lambda x: baseid+x)
    return df_temp

In [126]:
df_burbs = pd.DataFrame()
for burb in dict_burb_html:
    df_burbs = pd.concat((df_burbs, get_burbid_and_burblink(burb.split('|')[0],burb.split('|')[1])))
    print ('{:}            : Completed'.format(burb.split('|')[0]))

Summer Hill near station            : Completed
Lewisham near station            : Completed
Petersham near station            : Completed
Newtown near station            : Completed
Macdonaldtown near station            : Completed
Redfern near station            : Completed
Macmohans poimnt near wharf            : Completed
Central_near_gym            : Completed
Stanmore_near_station            : Completed
Erko_near_station            : Completed
Mosman near wharf            : Completed


In [127]:
df_burbs.shape

(148, 3)

In [128]:
# change burb dataframe index
df_burbs.set_index('burbid',inplace=True)

In [129]:
pd.options.display.max_colwidth = 100
df_burbs.tail()

Unnamed: 0_level_0,burblink,location
burbid,Unnamed: 1_level_1,Unnamed: 2_level_1
1141559,https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141559,Mosman near wharf
1141560,https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141560,Mosman near wharf
1141549,https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141549,Mosman near wharf
1141564,https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141564,Mosman near wharf
1141554,https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141554,Mosman near wharf


## Explore individual burb

In [123]:
burb_full_link = 'https://www.microburbs.com.au/NSW/Sydney/Mosman-Municipality/Mosman/1141554'
burbid=1141554

## Explore element inside code

In [23]:
html = urlopen(burb_full_link).read()  
soup = BeautifulSoup(html, 'lxml')

In [80]:
# this class has all the tabular info
tab = soup.findAll('div', attrs = {'class' : 'col-sm-6 col-lg-5'})

In [75]:
for tab in soup.findAll('div', attrs = {'class' : 'col-sm-6 col-lg-5'}):
    for strong_tag in tab.findAll('span', attrs = {'class' : 'human-score-value'}):
        print (strong_tag.text.strip().encode('utf-8'), strong_tag.attrs['class'][1])

('7', '')
('10', 'family')
('10', 'affluent')
('9', '')
('9', 'lifestyle')
('8', 'convenience')
('10', 'tranquility')
('9', 'internet')
('10', 'community')


In [121]:
scores=[]
for targetElements in soup.findAll('div', attrs = {'class' : 'col-sm-6 col-lg-5'}):
    for individual_score in targetElements.findAll('span', attrs = {'class' : 'human-score-value'}):
        scores.append(int(individual_score.text.strip().encode('utf-8')))
scores

[7, 10, 10, 9, 9, 8, 10, 9, 10]

In [124]:
pd.DataFrame([scores], 
             columns=['Hip','Family','Affluence','Safety','Lifestyle','Convenience','Tranquility','Internet','Community'],
            index=[burbid])

Unnamed: 0,Hip,Family,Affluence,Safety,Lifestyle,Convenience,Tranquility,Internet,Community
1141554,7,10,10,9,9,8,10,9,10
