Used following URL: http://eecs.qmul.ac.uk/~emmanouilb/income_table.html for web scraping. This webpage includes a table on individuals income and shopping habits.

Using Beautiful Soup, scraped the table and converted it into a pandas dataframe. Performed data cleaning when necessary to remove extra characters.  

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [None]:
url = "http://eecs.qmul.ac.uk/~emmanouilb/income_table.html"
html = urlopen(url)

In [None]:
soup = BeautifulSoup(html, 'lxml')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [None]:
  # the 'tr' tag in html denotes a table row

header_list=[]  
col_labels=soup.find_all('th') 
col_str = str(col_labels)
cleantext = BeautifulSoup(col_str, "lxml").get_text()
header_list.append(cleantext)

In [None]:
rows = soup.find_all('tr')
table_list = []

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td = row.find_all('td')
    row_cells = str(row_td)
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()  # extract the text without HTML tags
    table_list.append(row_cleantext)

In [None]:
df_header = pd.DataFrame(header_list)
#split the '0' column into multiple columns at the comma position using str.split method
df_header2 = df_header[0].str.split(',', expand=True)
#Remove unnecessary characters 
df_header2[0] = df_header2[0].str.strip('[')
df_header2[3] = df_header2[3].str.strip(']')
df_header2.head()

Unnamed: 0,0,1,2,3
0,Region,Age,Income,Online Shopper


In [None]:
df_table = pd.DataFrame(table_list)
#split the '0' column into multiple columns at the comma position using str.split method
df_table2 = df_table[0].str.split(',', expand=True)
# Remove uneccesary characters
df_table2[0] = df_table2[0].str.strip('[')
df_table2[0] = df_table2[0].str.strip(']')
df_table2[3] = df_table2[3].str.strip(']')
#Removing the row with missing values
df_table3 = df_table2.dropna(axis=0, how='any')

df_table3.head(11)

Unnamed: 0,0,1,2,3
1,India,49.0,86400.0,No
2,Brazil,32.0,57600.0,Yes
3,USA,35.0,64800.0,No
4,Brazil,43.0,73200.0,No
5,USA,45.0,,Yes
6,India,40.0,69600.0,Yes
7,Brazil,,62400.0,No
8,India,53.0,94800.0,Yes
9,USA,55.0,99600.0,No
10,India,42.0,80400.0,Yes


In [None]:
#Concatenate two dataframes
frames = [df_header2, df_table2]
df = pd.concat(frames)

df2 = df.rename(columns=df.iloc[0]) # We assign the first row to be the dataframe header
df_table3 = df2.drop(df2.index[0])
df_table3.head(10)

Unnamed: 0,Region,Age,Income,Online Shopper
1,India,49.0,86400.0,No
2,Brazil,32.0,57600.0,Yes
3,USA,35.0,64800.0,No
4,Brazil,43.0,73200.0,No
5,USA,45.0,,Yes
6,India,40.0,69600.0,Yes
7,Brazil,,62400.0,No
8,India,53.0,94800.0,Yes
9,USA,55.0,99600.0,No
10,India,42.0,80400.0,Yes


In [None]:
df_table3

Unnamed: 0,Region,Age,Income,Online Shopper
1,[India,49.0,86400.0,No]
2,[Brazil,32.0,57600.0,Yes]
3,[USA,35.0,64800.0,No]
4,[Brazil,43.0,73200.0,No]
5,[USA,45.0,,Yes]
6,[India,40.0,69600.0,Yes]
7,[Brazil,,62400.0,No]
8,[India,53.0,94800.0,Yes]
9,[USA,55.0,99600.0,No]
10,[India,42.0,80400.0,Yes]


The list of the various MSc programmes offered by the School of EECS is available at the following URL: http://eecs.qmul.ac.uk/postgraduate/programmes/. Performed web scraping on the table present in the above URL and converted it into a pandas dataframe that would include one row for each programme of study as shown in the webpage. The dataframe includes the following 5 columns: name of postgraduate degree programme (e.g. Advanced Electronic and Electrical Engineering), programme code for part-time study (e.g. H60C), programme code for full-time study (e.g. H60A), URL for part-time study programme details, URL for full-time study programme details. Performed data cleaning to remove unecessary characters when needed. 

In [None]:
url_1 = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html_1 = urlopen(url_1)

In [None]:
soup_1 = BeautifulSoup(html_1, 'lxml')
print(type(soup_1))

<class 'bs4.BeautifulSoup'>


In [None]:
# Create an empty list where the table header will be stored
header_list1 = []

# Find the 'th' html tags which denote table header
col_labels = soup_1.find_all('th')
col_str = str(col_labels)
cleantext_header = BeautifulSoup(col_str, "lxml").get_text()  # extract the text without HTML tags
header_list1.append(cleantext_header) # Add the clean table header to the list

print(header_list_1)

['[Postgraduate degree programmes, Part-time(2 year), Full-time(1 year)]']


In [None]:
# finding all the rows with the tag-"tr"
rows = soup_1.find_all('tr') 

In [None]:
# Create an empty list where the table will be stored
table_list = []

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td = row.find_all('td')
    row_cells = str(row_td)
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()  # extract the text without HTML tags
    
    for td in row.find_all('td'):
      if td.find('a') is not None:
        aurl=td.find('a').get('href')
        row_cleantext=row_cleantext + "," + aurl
      else:
        empty=''
        row_cleantext=row_cleantext + "," + empty
    
    table_list.append(row_cleantext)  # Add the clean table row to the list
    
print(table_list)

['[]', '[Advanced Electronic and Electrical Engineering, H60C, H60A],,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/', '[Artificial Intelligence, I4U2\xa0, I4U1\xa0],,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/artificial-intelligence-msc/', '[Big Data Science, H6J6, H6J7],,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/big-data-science-msc/,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/big-data-science-msc/', '[Computer Games, \xa0, I4U4],,,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-games-msc/', '[Computer Science, G4U2, G4U1],,https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/computer-science-msc/,https://www.qmul

In [None]:
df_header = pd.DataFrame(header_list1)
#split the "0" column into multiple columns at the comma position using str.split method
df_header2 = df_header[0].str.split(',', expand=True)
df_header2.head()

Unnamed: 0,0,1,2
0,[Postgraduate degree programmes,Part-time(2 year),Full-time(1 year)]


In [None]:
# Remove uneccesary characters
df_table2[0] = df_table2[0].str.strip('[')
df_table2[2] = df_table2[2].str.strip(']')

df_table = pd.DataFrame(table_list)
#split the "0" column into multiple columns at the comma position using str.split method
df_table2 = df_table[0].str.split(',', expand=True)

#Removing the first row of the table which contains NaN values
df_table2=df_table2.iloc[1:]
df_table2 = df_table2.drop(df_table2.columns[3], axis=1)


Unnamed: 0,0,1,2
1,[India,49.0,86400.0
2,[Brazil,32.0,57600.0
3,[USA,35.0,64800.0
4,[Brazil,43.0,73200.0
5,[USA,45.0,
6,[India,40.0,69600.0
7,[Brazil,,62400.0
8,[India,53.0,94800.0
9,[USA,55.0,99600.0
10,[India,42.0,80400.0


In [None]:
# We concatenate the two dataframes
frames = [df_header2, df_table2]
df = pd.concat(frames)
#Renaming the columns
df_head=df.rename(columns={df.columns[0]:df.iloc[0][0],
                           df.columns[1]:df.iloc[0][1],
                           df.columns[2]:df.iloc[0][2],
                           df.columns[3]:"Part-time study programme URL",
                           df.columns[4]:"Full-time study programme URL"})

 # Removing the duplicate header
df3 = df_head.drop(df_head.index[0])
display(df3)

Unnamed: 0,Postgraduate degree programmes,Part-time(2 year),Full-time(1 year),Part-time study programme URL,Full-time study programme URL
1,Advanced Electronic and Electrical Engineering,H60C,H60A,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
2,Artificial Intelligence,I4U2,I4U1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
3,Big Data Science,H6J6,H6J7,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
4,Computer Games,,I4U4,,https://www.qmul.ac.uk/postgraduate/taught/cou...
5,Computer Science,G4U2,G4U1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
6,Computer Science by Research,G4Q2,G4Q1,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
7,Computing and Information Systems,G5U6,G5U5,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
8,Data Science and Artificial Intelligence by Co...,,I4U5,,https://www.qmul.ac.uk/postgraduate/taught/cou...
9,Electronic Engineering by Research,H6T6,H6T5,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
10,Internet of Things (Data),I1T2,I1T0,https://www.qmul.ac.uk/postgraduate/taught/cou...,https://www.qmul.ac.uk/postgraduate/taught/cou...
