The list of the various MSc programmes offered by the School of EECS is provided at the following URL: http://eecs.qmul.ac.uk/postgraduate/programmes/. Perform web scraping on the table present in the above URL and convert it into a pandas dataframe that would include one row for each programme of study as shown in the webpage. The dataframe should include the following 5 columns: name of postgraduate degree programme (e.g. Advanced Electronic and Electrical Engineering), programme code for part-time study (e.g. H60C), programme code for full-time study (e.g. H60A), URL for part-time study programme details, URL for full-time study programme details. Perform data cleaning to remove unnecessary characters when needed. In the report include the code that was used to scrape, convert and clean the table and provide evidence that the table has been successfully scraped (e.g. by displaying the contents of the dataframe)

In [1]:
import pandas as pd
from urllib.request import urlopen
from bs4 import BeautifulSoup

#Open the url
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)

#Create BeautifulSoup object from the html content
soup = BeautifulSoup(html, 'lxml')

# Create an empty list where the table header is stored
header_list = []

# Find the 'th' html tags which denote table header

# finds the header along with the tags <th>
col_labels = soup.find_all('th')   
# convert the col_lables into string
col_str = str(col_labels) 
# Extract the text without HTML tags
cleantext_header = BeautifulSoup(col_str, "lxml").get_text()  
# Add the clean table header to the list
header_list.append(cleantext_header) 

print(header_list) #validate if the header is extracted correctly

['[Postgraduate degree programmes, Part-time(2 year), Full-time(1 year)]']


In [2]:
# Get all the rows in the table with tr tags
rows = soup.find_all('tr')  
    
# Create an empty list where the table will be stored
table_list = []

# For every row in the table, find each cell element and add it to the list
for row in rows:
    # Get all data associated with td tags in the rows
    row_td = row.find_all('td')
    # convert the cells into string
    row_cells = str(row_td)

    # Extract the text without HTML tags
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()  
    
    for td in row.find_all('td'):
        if td.find('a') is not None:
            aurl = td.find('a').get('href') 
            row_cleantext = row_cleantext + "," + aurl
        else:
            empty = ' '
            row_cleantext = row_cleantext + "," + empty
    # Add the clean table row to the list
    table_list.append(row_cleantext) 


#print(table_list) #validate if the table is extracted correctly

In [3]:
#convert the header list into the data frame
df_header = pd.DataFrame(header_list)
#delimit the dataframe with comma in order to have seperate columns for the header
df_header2 = df_header[0].str.split(',', expand=True)
display(df_header2)

Unnamed: 0,0,1,2
0,[Postgraduate degree programmes,Part-time(2 year),Full-time(1 year)]


In [4]:
#Remove [from the Region and ] from the Online Shopper
df_header2[0] = df_header2[0].str.strip('[')
df_header2[2] = df_header2[2].str.strip(']')
display(df_header2)  #validate if the header is cleaned properly

Unnamed: 0,0,1,2
0,Postgraduate degree programmes,Part-time(2 year),Full-time(1 year)


In [5]:
#convert the table list into the data frame
df_table = pd.DataFrame(table_list)
#delimit the dataframe with comma in order to have seperate columns for the header
df_table2 = df_table[0].str.split(',', expand=True)
#display(df_table2)

In [6]:
#Remove [from column 0 and ] from the column 2
df_table2[0] = df_table2[0].str.strip('[')
df_table2[2] = df_table2[2].str.strip(']')

# Drop the first row of the table, as it contains all None values
df_table2 = df_table2.iloc[1:]

# Drop the first row of the table, as it contains all None values
df_table2 = df_table2.drop(df_table2.columns[3], axis=1)

#display(df_table2)

In [13]:
# Concatenate the header and body dataframes 

#add the header and table as a list 
frames = [df_header2, df_table2]
#concatenate the frames list into a dataframe
df = pd.concat(frames)

# Name the columns
df_with_header = df.rename(columns={df.columns[0]: df.iloc[0][0],
                                    df.columns[1]: df.iloc[0][1],
                                    df.columns[2]: df.iloc[0][2],
                                    df.columns[3]: "URL for part-time study programme",
                                    df.columns[4]: "URL for full-time study programme"})

# Remove the replicated header from the dataframe
df_final = df_with_header.drop(df_with_header.index[0]) 

#print the column values fully
pd.options.display.max_colwidth = 999

#display the resulting dataframe
display(df_final)

Unnamed: 0,Postgraduate degree programmes,Part-time(2 year),Full-time(1 year),URL for part-time study programme,URL for full-time study programme
1,Advanced Electronic and Electrical Engineering,H60C,H60A,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
2,Artificial Intelligence,I4U2,I4U1,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc/
3,Big Data Science,H6J6,H6J7,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/big-data-science-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/big-data-science-msc/
4,Computer Games,,I4U4,,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-games-msc/
5,Computer Science,G4U2,G4U1,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-science-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-science-msc/
6,Computer Science by Research,G4Q2,G4Q1,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-science-by-research-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-science-by-research-msc/
7,Computing and Information Systems,G5U6,G5U5,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computing-and-information-systems-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computing-and-information-systems-msc/
8,Data Science and Artificial Intelligence by Conversion,,I4U5,,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/data-science-and-artificial-intelligence-msc/
9,Electronic Engineering by Research,H6T6,H6T5,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/electronic-engineering-by-research-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/electronic-engineering-by-research-msc/
10,Internet of Things (Data),I1T2,I1T0,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/internet-of-things-data-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/internet-of-things-data-msc/
