# Assignment 1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to use [requests](http://www.python-requests.org/en/master/) to download HTML pages from a website?
* How to select content on a webpage with [lxml](http://lxml.de/)? 

You can either use Spark DataFrame or [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) to do the assignment. In comparison, pandas.DataFrame has richer APIs, but is not good at distributed computing.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of HTML, DOM, and XPath. I found that this is a good resource: [https://data-lessons.github.io](https://data-lessons.github.io/library-webscraping-DEPRECATED/). Please take a look at

* [Selecting content on a web page with XPath
](https://data-lessons.github.io/library-webscraping-DEPRECATED/xpath/)
* [Web scraping using Python: requests and lxml](https://data-lessons.github.io/library-webscraping-DEPRECATED/04-lxml/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at SFU. One day, you want to analyze CS faculty data and answer two interesting questions:

1. Who are the CS faculty members?
2. What are their research interests?

To do so, the first thing is to figure out what data to collect.

## Task 1: SFU CS Faculty Members

You find that there is a web page in the CS school website, which lists all the faculty members as well as their basic information. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://www.sfu.ca/computing/people/faculty.html](https://www.sfu.ca/computing/people/faculty.html).




### (a) Crawling Web Page

A web page is essentially a file stored in a remote machine (called web server). You can use [requests](http://www.python-requests.org/en/master/) to open such a file and read data from it. Please complete the following code to download the HTML page and save it as a text file (like [this](./faculty.txt)). 

In [11]:
import requests

# 1. Download the webpage
url = 'http://www.sfu.ca/computing/people/faculty.html'
r = requests.get(url)

# 2. Save it as a text file (named faculty.txt)
filename = 'faculty.txt'
with open(filename, mode='wb') as file:     
    file.write(r.content)


### (b) Extracting Structured Data

An HTML page follows the Document Object Model (DOM). It models an HTML page as a tree structure wherein each node is an object representing a part of the page. The nodes can be searched and extracted programmatically using XPath. Please complete the following code to transform the above HTML page to a CSV file (like [this](./faculty_table.csv)). 

In [12]:
import lxml.html 
from lxml import html
from lxml import etree
from io import StringIO, BytesIO
import csv
import re
import pandas as pd

# 1. Open faculty.txt
with open(filename,'r') as fileread:
    html = etree.HTML(fileread.read())
    
# 2. Parse the HTML page as a tree structure
result = etree.tostring(html, pretty_print=True, method="html")
#print(result)

# 3. Extract related content from the tree using XPath

# 3.1 Extract name and rank of each faculty
content_name_rank = html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[@class]/div[@class]/div/div[2]/h4[1]/text()[1]')
content_name_rank = [re.sub('/^[a-zA-Z0-9!@#$&()\\-`.+,/\"]*$/', '', element) for element in content_name_rank] # To strip out junk characters
content_name = []
content_rank = []

# 3.2 Split 'name' and 'rank' of CS faculty into different lists
for element in content_name_rank:
    if element != '':
        name_rank = element.split(',')
        name = name_rank[0].strip(' ')
        rank = name_rank[1].strip(' ')
        # CS Faculty Name
        content_name.append(name)
        # CS Faculty Rank
        content_rank.append(rank)
#print(content_name)
#print(content_rank)

# 3.3 Extract 'area' of each faculty
total_faculty_left_column = html.xpath('count(//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div)') # Total number of @class='textimage section' divs on the left coulmn
total_faculty_right_column = html.xpath('count(//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div)') # Total number of @class='textimage section' divs on the right coulmn
content_area = []

for iter_val in range(int(total_faculty_left_column)): # Consider the faculty list from the left column of the page alone
    iter_val += 1
    if len(html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Area")]/../text()')) > 0:
        content_area += html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div[' + str(iter_val) + ']/div/div[2]//b[contains(text(),"Area")]/../text()')
    else:
        content_area += html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div[' + str(iter_val) + ']/div/div[2]//b/../text()[2]')

for iter_val in range(int(total_faculty_right_column)): # Consider the faculty list from the right(second) column of the page alone
    iter_val += 1
    if len(html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Area")]/../text()')) > 0:
        content_area += html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div[' + str(iter_val) + ']/div/div[2]//b[contains(text(),"Area")]/../text()')
    else:
        content_area += html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div[' + str(iter_val) + ']/div/div[2]//b/../text()[2]')

content_area = [re.sub('[^a-zA-Z0-9,&;.()\-\/ ]+', '', element) for element in content_area] # To strip out junk characters
area = []

for element in content_area: # To remove empty fields that got retrieved
    if element != '':
        content = element
        content = content.strip(' ')
        # CS Faculty Area
        area.append(content)
#print(area)

# 3.4 Extract 'profile' of each faculty
content_profile = html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[@class]/div[@class]/div/div[2]//*[contains(text(),"Profile")]/@href')
profile = []

for element in content_profile: # Add http://www.sfu.ca as prefix to profile links of faculty that turn out to be incomplete
    if str(element).startswith("http"):
        profile.append(element)
    else:
        element = "http://www.sfu.ca" + element
        profile.append(element)
#print(profile)

# 3.5 Extract 'homepage' of each faculty
homepage = []
for iter_val in range(int(total_faculty_left_column)): # Consider the faculty list from the left column of the page alone
    iter_val += 1
    if len(html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Home")]/@href')) > 0:
        content_homepage = html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[1]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Home")]/@href')
        homepage += content_homepage
    else:
        homepage.append('')

for iter_val in range(int(total_faculty_right_column)): # Consider the faculty list from the right(second) column of the page alone
    iter_val += 1
    if len(html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Home")]/@href')) > 0:
        content_homepage = html.xpath('//*[@id="page-content"]/section/div[2]/div[3]/div[2]/div[' + str(iter_val) + ']/div/div[2]//*[contains(text(),"Home")]/@href') 
        homepage += content_homepage
    else:
        homepage.append('')
#print(homepage)

# 3.6 Create list of lists for each faculty
list_concat = [list(concat_elementwise) for concat_elementwise in  zip(content_name, content_rank, area, profile, homepage)]
#print(list_concat)

# 3.7 Convert list of lists to pandas DataFrame
df = pd.DataFrame(list_concat, columns = ["name", "rank", "area", "profile", "homepage"])
#print(df)

# 4. Save the extracted content as an csv file (named faculty_table.csv)
filename = 'faculty_table.csv'
df.to_csv(filename, encoding='utf-8', index=False)


## Task 2: Research Interests

Suppose you want to know the research interests of each faculty. However, the above crawled web page does not contain such information. 

### (a) Crawling Web Page

You notice that such information can be found on the profile page of each faculty. For example, you can find the research interests of Dr. Jiannan Wang from [http://www.sfu.ca/computing/people/faculty/jiannanwang.html](http://www.sfu.ca/computing/people/faculty/jiannanwang.html). 


Please complete the following code to download the profile pages and save them as text files. There are 60 faculties, so you need to download 60 web pages in total. 

In [13]:
import requests # just in case we run this snippet separately

# 1. Download the profile pages of 60 faculties
count = 0
for item in profile:
    r = requests.get(item)
    filename = str(content_name[count])+'.txt' # Store 60 faculty files each in the name of the respective faculty member
    count += 1
    
    # 2. Save each page as a text file
    with open(filename, mode='wb') as file:
        file.write(r.content)


### (b) Extracting Structured Data

Please complete the following code to extract the research interests of each faculty, and generate a file like [this](./faculty_more_table.csv). 

In [14]:
import lxml.html 

# 1. Open each text file and parse it as a tree structure 
count = 0
content_research_interest = []
temp_res_intrst = []
res_interest = []

for item in profile:
    filename = str(content_name[count])+'.txt'
    with open(filename, 'r') as fileread:
        html = etree.HTML(fileread.read())
        
        # 2. Extract the research interests from each tree using XPath
        content_research_interest = html.xpath('//*[@id="page-content"]/section[1]//div[*[contains(text(),"Research")]]/*[contains(text(),"Research")]/../ul//text()[normalize-space()]')
        content_research_interest = [re.sub('[^a-zA-Z0-9,&;.:()\-\/ ]+', '', element) for element in content_research_interest] # To strip out junk characters
        res = []
        
        if len(content_research_interest) == 0: # To get the research_interests of faculty that are not in the form of a list
            temp_res_intrst = html.xpath('//*[@id="page-content"]/section[1]//div[*[contains(text(),"Research")]]/*[contains(text(),"Research")]/../p//text()[normalize-space()]')
            content_research_interest = [re.sub('[^a-zA-Z0-9,&;.:()\-\/ ]+', '', element) for element in temp_res_intrst] # To strip out junk characters
            
        for element in content_research_interest: # To remove empty fields that got retrieved
            if element != '':
                # CS Faculty Research interest
                res.append(element)
    count += 1
    res_interest.append(res)
#print(res_interest)

# 3. Add the extracted content to faculty_table.csv
list_merged = [list(concat_elementwise) for concat_elementwise in  zip(content_name, content_rank, area, profile, homepage, res_interest)] # Merge faculty_table content with faculty research interest data
#print(list_merged)

# 3.1 Convert list of lists to pandas DataFrame
df_res = pd.DataFrame(list_merged, columns = ["name", "rank", "area", "profile", "homepage", "research_interests"])
#print(df_res)

# 4. Generate a new CSV file, named faculty_more_table.csv
filename = 'faculty_more_table.csv'
df_res.to_csv(filename, encoding='utf-8', index=False)


## Submission

Complete the code in this [notebook](A1.ipynb), and submit it to the CourSys activity `Assignment 1`.