As a part of my current job, I had to migrate data from a HTML page to a different format and it contained 112 rows with 9 different components that had to be extracted from each row. So I decided that this would be an excellent opportunity to learn web page crawling and use it to get the job done. I used Beautiful Soup library for parsing the HTML tags and their content and the library is quite easy to use and extensive. And I also utilised Pandas library for saving the data in a dataframe for further preprocessing as you will see later in the script.

As there is a severe lack of good tutorials on Web crawling(Most of them only cover the basics and simple pages), I had to rely solely on Beautiful Soup's documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Luckily, the documentation is quite user friendly and easy to understand and I was able to learn the library on the go as well as finish the script in a matter of a few hours. So here goes.

Since the page will eventually be taken down from the server after the migration, I have added the original HTML page to the repository as well.

Disclaimer: I am fairly new to pandas and Beautiful Soup library so a lot of things done in this notebook may not necessarily be the definitive way to do things. Any feedback is welcome. And the data I am using for this is available publicly on a website so there is no breach of privacy in any way.

In [62]:
# Import the necessary libraries
# requests library is used to send a GET request to the url specified to extract the required information
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

# Extract the required information from the given url
response = requests.get("http://www.chemeng.unimelb.edu.au/people/rhd.html#Abdellah")
content = response.content

# Parse the extracted information using Beautiful Soup library
parser = BeautifulSoup(content,'html.parser')

# Extract the content with <body></body> tags from the HTML page
body = parser.body

#Uncomment the below variable and run this cell to view the output
#body 

In [63]:
# Extract the 'h2' tags from the body to get all the names and ids of the students from the page 
h2 = body.find_all("h2")
ids = []
names = []

for h in h2:
    ids.append(h['id'])
    names.append(h.text)
    
#ids
#names

A dataframe is created containing the ids of all users present within the 'ids' list and saved under the 'ID' column in the dataframe. We will keep adding more columns to this as we extract more information from the page.

In [64]:
df = pd.DataFrame(ids,columns=['ID'])

# Add 'Names' column to the dataframe and pass the names list to it containing the names of all the students present on the page.
df['Names'] = names

All the other information about the students is present within divs with class 'col-2'. So we will find all of these divs and then extract the info from them one by one. 

As can be seen in the HTML page given, the images present in the page are saved in a div which not only has the class 'col-2' but also class 'first', so we will use the second class to further narrow down our search and extract the image sources from the page.

In [65]:
# Extract all the divisions with the class col-2
divs = body.find_all("div",class_="col-2")

img_srcs = []

# Extracting divs with class 'first' 
img_divs = body.find_all("div",class_="first") 


for imgs in img_divs:
    # Check whether the divs retrieved have <img> tags within them as some users may not have uploaded their images
    if (imgs.img != None):
        # Extract the 'src' value from with the <img> tags
        img_srcs.append(imgs.img['src'])
    else:
        img_srcs.append("None")
        
# Save the image-sources to 'Image Sources' column in the dataframe.
df['Image Sources'] = img_srcs[:-1]

You will observe in the HTML page that all the information for each user is contained with four divs. The first div contains images, the second contains thesis titles, the third contains academic info such as Supervisors and the fourth contains the contact information. Thus, every fourth div contains the same kind of information but for different users.

So, we will try and extract each piece of information for all the users collectively using the div value. I admit that this may not be the most efficient way to do it but it was the one I decided to go with at that moment.

In [66]:
thesis_titles = []
i = 1

while(i<447):
    # Check whether the student has a thesis title info
    if(divs[i].p != None):
        thesis_titles.append(divs[i].p.text)
    else:
        thesis_titles.append("None")
    i += 4

# Add the thesis titles to the dataframe
df['Thesis Title'] = thesis_titles
# df[:5]

In [67]:
telephone = []
email = []
room = []
i = 3

while(i<451):
    # Check whether the div has contact info
    if(divs[i].p != None):
        # If the <p> tag contains a string starting with 'T:',add it to 'telephone' list
        temp = divs[i].find_all("p",string = re.compile("T:"))
        if temp:
            telephone.append(temp[0].text)
        else:
            telephone.append(' ')
        # If the <p> tag contains a string starting with 'R' or 'O',add it to 'room' list
        temp_room = divs[i].find_all("p",string = re.compile("[RO]"))
        if temp_room:
            room.append(temp_room[0].text)
        else:
            room.append(' ')
        # I tried using the same pattern as the one for telephone numbers, by finding the 'E:' string. 
        # But for some reason it didn't work, so I had to work out another way
        # Finds all the <p> tags
        temp_email = divs[i].find_all("p")
        for t in temp_email:
            # If the tag's text contains "E:",append it to 'email' list
            if "E:" in t.text:
                email.append(t.text)
            # If the tag's text is a telephone number or a room number,then it is not an emailid & append ' ' to 'email' list
            elif t.text == temp and t.text == temp_room:
                email.append(' ')
    i += 4

df['Telephone'] = telephone 
df['Email'] = email
df['Room'] = room
df[:5]

Unnamed: 0,ID,Names,Image Sources,Thesis Title,Telephone,Email,Room
0,Abdellah,Mohamed Hussein Ali Abdellah,/people/images/mohamed-abdellah.jpg,Solvent resistant nano-filtration for recovery...,T: 8344 8863,"E: sendstudentmail(""mabdellah"");","Room 215, Building 165"
1,Ali,Mr Suhaib Ali,images/suhaib-ali.jpg,,T: 8344 6640,"E: sendstudentmail(""s.ali4"");","Room B.07, C&BE Building 167"
2,Allison,Stephanie Allison,/people/images/stephanie-allison.jpg,,T: 8344 8678,"E: sendstudentmail(""s.allison"");","Room 5.01, Chemistry East Building 154"
3,Bai,Tianyi (Alisa) Bai,/people/images/tianyi-bai.jpg,Polymer Surfactant Interactions in Emulsions,T: 8344 6640,"E: sendstudentmail(""t.bai2"");","Room B07, Desk 4, C&BE Building 167"
4,Bhaskaran,Ayana Bhaskaran,/people/images/ayana-bhaskaran.jpg,Design of new artificial active sites,,"E: sendstudentmail(""abhaskaran"");","Room B07, Desk 8, C&BE Building 167"


In [68]:
supervisors = []
groups= []
siblings=[]
i = 2

while (i<449):
    # Check whether the div has supervisor or group info
    if(divs[i].p!=None):
        # If the <h3> tag contains a string starting with 'Supervisor',add it to 'supervisor' list
        temp = divs[i].find_all("h3",string=re.compile("Supervisor"))
        for t in temp:
            # Get all the siblings of the given tag found above and add it to 'siblings' list
            # t.next_siblings produces a list of type 'generator', which contains variables of type 'tag'
            siblings.append(t.next_siblings)
    i += 4
for items in siblings:
    temp=[]
    for item in items:
        temp.append(item)
    supervisors.append(temp)
# supervisors[:3]

In [69]:
# Remove all the '\n' from the list of lists
supervisors_mod = []

for lists in supervisors:
    temp = []
    for item in lists:
        if item != '\n':
            temp.append(item)
    supervisors_mod.append(temp)
# supervisors_mod[:5]

Now we need to split the Supervisors and Groups into two separate lists so we can save it into separate columns. This can be done by splitting on the h3 tag with the heading 'Discipline Group'. Everything before the h3 tag goes into supervisors_new list and after the tag goes into the group list.

In [70]:
supervisors_new = []
groups= []

for lists in supervisors_mod:
    index = 0
    for item in lists:
        if item.name == 'h3':
            # Get the index of the h3 tag from 'supervisors_mod' list for the split
            index = lists.index(item)
    groups.append(lists[index+1:])
    supervisors_new.append(lists[:index])
# supervisors_new[:5]

In [71]:

supervisors_final = []

for lists in supervisors_new:
    temp = []
    for item in lists:
        # Get the content within the <p> tags and add it to 'supervisors_final' list
        # We keep the <a> tags as we need them in the new html file as it is
        temp.append(item.contents[0])
    supervisors_final.append(temp)

df['Supervisors'] = supervisors_final
# df
# supervisors_final[:5]

In [72]:
groups_final = []

for lists in groups:
    temp = []
    for item in lists:
        # Get the content within the <p> tags and add it to 'groups_final' list
        # We keep the <a> tags as we need them in the new html file as it is
        temp.append(item.contents[0])
    groups_final.append(temp)
    
df['Groups'] = groups_final
# df[:5]
# groups_final

In [73]:
# Add +61 to the telephone numbers
df['Telephone'] = df['Telephone'].str.replace("T:\s+","+61 ")

# Get the image sources url in the proper format
df['Image Sources'] = df['Image Sources'].str.replace("/people/","")
df['Image Sources'] = df['Image Sources'].str.replace("images","/people/images")
# df[:5]

In [74]:
# Steps to extract emailids in the proper format
df['Email'] = df['Email'].str.replace("E: sendstudentmail\(\"","")
df['Email'] = df['Email'].str.replace("E: sendpgradmail\(\"","")
df['Email'] = df['Email'].str.replace("\"\);","")
df['Email'] = df['Email'].str.replace(";","")
# df[:5]

In [75]:
# Errors found in the original html file on exploring the csv file generated later in the script
df.loc[8]['Email'] = df.loc[8]['Email'].replace("d.biswas","d.biviano")

In [76]:
# Replace the 'None' values with empty spaces
df = df.replace('None','')
df[:5]

Unnamed: 0,ID,Names,Image Sources,Thesis Title,Telephone,Email,Room,Supervisors,Groups
0,Abdellah,Mohamed Hussein Ali Abdellah,/people/images/mohamed-abdellah.jpg,Solvent resistant nano-filtration for recovery...,+61 8344 8863,mabdellah,"Room 215, Building 165","[<a href=""staff.php?person_ID=14055"">Sandra Ke...","[<a href=""http://www.co2crc.com.au/"">CO2CRC</a>]"
1,Ali,Mr Suhaib Ali,/people/images/suhaib-ali.jpg,,+61 8344 6640,s.ali4,"Room B.07, C&BE Building 167","[<a href=""staff.php?person_ID=456615"">Paul Web...","[<a href=""/webley/"">Clean Energy Laboratory</a>]"
2,Allison,Stephanie Allison,/people/images/stephanie-allison.jpg,,+61 8344 8678,s.allison,"Room 5.01, Chemistry East Building 154","[<a href=""staff.php?person_ID=1972"">Greg Qiao<...","[<a href=""/polymer-science/"">Polymer Science</a>]"
3,Bai,Tianyi (Alisa) Bai,/people/images/tianyi-bai.jpg,Polymer Surfactant Interactions in Emulsions,+61 8344 6640,t.bai2,"Room B07, Desk 4, C&BE Building 167","[<a href=""staff.php?person_ID=11455"">Ray Dagas...","[<a href=""/dagastine/"">Dagastine Group</a>]"
4,Bhaskaran,Ayana Bhaskaran,/people/images/ayana-bhaskaran.jpg,Design of new artificial active sites,,abhaskaran,"Room B07, Desk 8, C&BE Building 167","[<a href=""staff.php?person_ID=18324"">Luke Conn...","[<a href=""/connal/"">Connal Group</a>]"


In [77]:
# Write the data in a html file in the new format
filename = "rhd-new.html"
f = open(filename,'w')
strs = ""
complete = ""

strs="""<li class="person">
            <div class="person__photo" style="background-image: url(http://www.chemeng.unimelb.edu.au%s");"></div>
            <div class="person__info">
              <div class="person__profile">
                <h3><a data-bound="true" id="%s" href="#">%s</a></h3>
                <dl class="chemeng-def-list">
                  <dt><strong>Thesis title:</strong></dt>
                  <dd>%s</dd>\n
                  <dt><strong>Supervisor:</strong></dt>
                  %s
                  <dt><strong>Discipline/Group:</strong></dt>                     
                  %s
                  <dt><strong>Location:</strong></dt>
                  <dd>%s</dd>
                </dl>
              </div>
              <div class="person__contact">
                <p class="person__phone"><a href="tel:%s">%s</a></p>
                <p class="person__email"><a href="mailto:%s@student.unimelb.edu.au">%s@student.unimelb.edu.au</a></p>
              </div>
            </div>
          </li>\n"""

supers = """<dd>%s</dd>\n"""

for index in range(df.shape[0]):
    supers_all = ""
    groups_all = ""
    for supervisor in df.iloc[index]['Supervisors']:
        supers_all += supers % str(supervisor.encode('utf-8'))
    for group in df.iloc[index]['Groups']:
        groups_all += supers % str(group)
    complete += strs %(df.iloc[index]['Image Sources'],df.iloc[index]['ID'],df.iloc[index]['Names'],\
                      df.iloc[index]['Thesis Title'],supers_all.decode('utf-8'),\
                      groups_all.decode('utf-8'),df.iloc[index]['Room'],\
                      df.iloc[index]['Telephone'],df.iloc[index]['Telephone'],\
                      df.iloc[index]['Email'],df.iloc[index]['Email'])

f.write(complete.encode('utf-8'))
f.close()

In [78]:
# Write the dataframe to a csv file to ensure everything is correct and in the proper format
df.to_csv("rhd.csv",encoding='utf-8')

Outliers

This script extracted most of the content out of the page, but there were a few exceptions:
* 4 users had a '@pgrad.unimelb.edu.au' emailids instead of the '@student.unimelb.edu.au' emailids
* 3 users had links to their online profiles in the original html file with the "View Online Profile" link

Although these issues could be fixed with a little more code, but I decided against it as these were minor and rare anomalies and could be fixed manually faster.