# Assignment 1: Web Scraping

## Objective

Data scientists often need to crawl data from websites and turn the crawled data (HTML pages) to structured data (tables). Thus, web scraping is an essential skill that every data scientist should master. In this assignment, you will learn the followings:


* How to download HTML pages from a website?
* How to extract relevant content from an HTML page? 

Furthermore, you will gain a deeper understanding of the data science lifecycle.

**Requirements:**

1. Please use [pandas.DataFrame](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) rather than spark.DataFrame to manipulate data.

2. Please use [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) rather than [lxml](http://lxml.de/) to parse an HTML page and extract data from the page.

3. Please follow python code style (https://www.python.org/dev/peps/pep-0008/). If TA finds your code hard to read, you will lose points. This requirement will stay for the whole semester.

## Preliminary

If this is your first time to write a web scraper, you need to learn some basic knowledge of this topic. I found that this is a good resource: [Tutorial: Web Scraping and BeautifulSoup](https://www.dataquest.io/blog/web-scraping-beautifulsoup/). 

Please let me know if you find a better resource. I'll share it with the other students.

## Overview

Imagine you are a data scientist working at SFU. Your job is to extract insights from SFU data to answer questions. 

In this assignment, you will do two tasks. Please recall the high-level data science lifecycle below. I suggest that when doing this assignment, please remind yourself of what data you collected and what questions you tried to answer.


<img src="lifecycle.png" width="500">
<center><h4>Figure 1. Data Science Lifecycle</h4></center>

## Task 1: SFU CS Faculty Members

Sometimes you don't know what questions to ask. No worries. Start collecting data first. 

In Task 1, your job is to write a web scraper to extract the faculty information from this page: [https://www.sfu.ca/computing/people/faculty.html](https://www.sfu.ca/computing/people/faculty.html).




### (a) Crawl Web Page

A web page is essentially a file stored in a remote machine (called web server). Please write code to download the HTML page and save it as a text file ("csfaculty.html").

In [1]:
from requests import get

url = 'https://www.sfu.ca/computing/people/faculty.html'
response = get(url)

def write_to_file(response_text, file_name):
    output_file = open(file_name, 'w')
    output_file.write(response_text)
    
write_to_file(response.text, 'csfaculty.html')

### (b) Extract Structured Data

Please write code to extract relevant content (name, rank, area, profile, homepage) from "csfaculty.html" and save them as a CSV file (like [faculty_table.csv](./faculty_table.csv)). 

In [2]:
from bs4 import BeautifulSoup
import pandas as pd

file = open('csfaculty.html', 'r')
file_content = file.read()

html_soup = BeautifulSoup(file_content, 'html.parser')
faculty_container = html_soup.find_all('div', class_ = 'textimage section')

faculty_names = []
faculty_ranks = []
faculty_areas = []
faculty_profiles = []
faculty_homepages = []

def prep_string(input_string):
    return ("_".join(input_string.split())).upper()
    
for faculty in faculty_container:
    # get faculty name and rank
    name_and_rank = faculty.h4.text
    
    # split name and rank
    name_and_rank = str.replace(name_and_rank, '\n', ',')
    split_name_and_rank = name_and_rank.split(',')
    
    # get faculty name
    faculty_name = split_name_and_rank[0]
    faculty_names.append(faculty_name.title())
    
    # get faculty rank
    faculty_rank = str.strip(split_name_and_rank[1])
    faculty_ranks.append(faculty_rank.title())

    # get areas
    area = ''
    area_paragraph = faculty.p
    if (type(area_paragraph) != type(None)):
        area = area_paragraph.text.title().replace('Area:', '').strip()
    area = area.replace(';', ',')
    faculty_areas.append(area)
    
    # get contact information and home page links
    links = faculty.find_all('a', class_ = '')
    
    profile_link = ''
    homepage_link = ''
    
    # iterate through links [because they don't have any specific html attribute]
    for link in links:
        
        # check if href attribute is avaialable 
        if (link.has_attr('href')):
            
            # check if it's contact information <a> 
            if (prep_string(link.text) == 'PROFILE_&_CONTACT_INFORMATION'):
                profile_link = link['href']
                
                # add http://www.sfu.ca to start of relative paths
                if (profile_link.startswith('http://www.sfu.ca/') == False):
                    profile_link = 'http://www.sfu.ca' + profile_link
                    
            # check if it's home page <a>
            elif (prep_string(link.text) == 'HOME_PAGE'):
                homepage_link = link['href']

    faculty_profiles.append(profile_link)
    faculty_homepages.append(homepage_link)


df_faculty = pd.DataFrame({'name': faculty_names, 
                           'rank': faculty_ranks,
                           'areas': faculty_areas,
                           'profile': faculty_profiles,
                           'homepage': faculty_homepages
                         })

df_faculty.to_csv('faculty_table.csv', index=False)

print('faculty_table.csv saved successfully.')

faculty_table.csv saved successfully.


### (c) Interesting Finding

Note that you don't need to do anything for Task 1(c). The purpose of this part is to give you some sense about how to leverage exploratory data analysis (EDA) to come up with interesting questions about the data. EDA is an important topic in data science; you will  learn it soon from this course. 


First, please install [dataprep](http://dataprep.ai).
Then, run the cell below. 
It shows a bar chart for every column. What intersting findings can you get from these visualizations? 

In [3]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_table.csv")

plot(df)

Below are some examples:

**Finding 1:** Professor# (29) is more than 3x larger than Associate Professor# (9). 

**Questions:** Why did it happen? Is it common in all CS schools in Canada? Will the gap go larger or smaller in five years? What actions can be taken to enlarge/shrink the gap?


**Finding 2:** Homepage has 20.3% missing values. 

**Questions:** Why are there so many missing values? Is it because many faculty do not have their own homepages or do not add their homepages to the school page? What actions can be taken to avoid this to happen in the future? 

## Task 2: Age Follows Normal Distribution?

In this task, you start with a question and then figure out what data to collect.

The question that you are intersted in is `Does SFU CS faculty age follow a normal distribution?`

To estimate the age of a faculty member, you can collect the year in which s/he graduates from a university (`gradyear`) and then estimate `age` using the following equation:

$$age \approx 2020+23 - gradyear$$

For example, if one graduates from a university in 1990, then the age is estimated as 2020+23-1990 = 53. 



### (a) Crawl Web Page

You notice that faculty profile pages contain graduation information. For example, you can see that Dr. Jiannan Wang graduated from Harbin Institute of Technology in 2008 at [http://www.sfu.ca/computing/people/faculty/jiannanwang.html](http://www.sfu.ca/computing/people/faculty/jiannanwang.html). 


Please write code to download the 64 profile pages and save each page as a text file. 

In [4]:
from requests import get 
from time import sleep
from random import randint
from IPython.display import clear_output

# read generated faculty data
df_faculty = pd.read_csv('faculty_table.csv')


# function to save file
def write_to_file(response_text, file_name):
    output_file = open(file_name, 'w')
    output_file.write(response_text)

    
# iterate through faculties
for index, faculty in df_faculty.iterrows():
    
    # check if faculty has profile link
    if (not pd.isnull(faculty['profile'])):
        
        # get profile page
        url = faculty['profile']
        response = get(url)

        # write to output
        write_to_file(response.text, faculty['profile'][43:])
        
        clear_output()
        print('%i faculty pages saved successfully.' % (index + 1))
        
        # sleep for some seconds until next request
        sleep(randint(1, 4))
        
print('faculty pages saved successfully.')

64 faculty pages saved successfully.
faculty pages saved successfully.


### (b) Extract Structured Data

Please write code to extract the earliest graduation year (e.g., 2008 for Dr. Jiannan Wang) from each profile page, and create a csv file like [faculty_grad_year.csv](./faculty_grad_year.csv). 

In [5]:
from time import sleep
from random import randint 
from bs4 import BeautifulSoup
from requests import get 
import pandas as pd 


# read generated faculty data
df_faculty = pd.read_csv('faculty_table.csv')


# this function finds the first graduation year from <li> tags
def get_min_gradyear_from_li(educations_list):
    grad_year_min = 9999
    
    # iterate through all graduations of prof. in <li> 
    for education in educations_list:
        
        # split parts using ','
        split_education = education.text.split(',')
        
        # get last part after split which is graduation year
        grad_year = split_education[len(split_education) - 1].strip()
        
        # get last 4 digits which is year
        grad_year = grad_year[len(grad_year) - 4:]
        
        # check if it's for sure only digits
        if (grad_year.isdigit()):
            grad_year = int(grad_year)
            
            # check for minimum graduation date
            if grad_year < grad_year_min:
                grad_year_min = grad_year

    # return value
    return grad_year_min


# this function finds the first graduation year from <p> tag
def get_min_gradyear_from_p(educations_paragraph):
    grad_year_min = 9999
    
    # split input paragraph by line breaks '\n'
    educations_list = educations_paragraph.text.lower().split('\n')
    
    # iterate through graduations
    for education in educations_list:
        
        # check if data is not empty
        if (education != ''):
            
            # remove . from data [some cases]
            education = education.replace('.', '')
            
            # split using ','
            split_education = education.split(',')
            
            # last part after split is graduation year
            grad_year = split_education[len(split_education) - 1].strip()
            
            # get last 4 digits which is year
            grad_year = grad_year[len(grad_year) - 4:]
            
            # make sure its only digits
            if (grad_year.isdigit()):
                grad_year = int(grad_year)
                
                # find the minimum value
                if grad_year < grad_year_min:
                    grad_year_min = grad_year

    # return value
    return grad_year_min


# arrays of data to generate
faculty_names = []
faculty_gradyears = []


# iterate through faculties
for index, row in df_faculty.iterrows():
    
    # add faculty name to array
    faculty_names.append(row['name'])
    
    # set min grad year
    grad_year_min = 9999
    
    # if faculty has profile page then start process
    if (not pd.isnull(row.profile)):
        
        # open saved file
        file_name = row['profile'][43:]
        file = open(file_name, 'r')
        file_content = file.read()

        # parse html
        html_soup = BeautifulSoup(file_content, 'html.parser')

        # read all div s
        sections = html_soup.find_all('div', 'text parbase section')
        
        # iterate through all divs
        for section in sections:

            # read <h2>
            headers = section.find_all('h2')

            # iterate through h2 to find Education title
            for header in headers:

                # if 'Education' found
                if (header.text.strip().title() == 'Education'):

                    # get parent element
                    parent = section.h2.parent

                    # find if there's any <li> tag [some eduation are <li> some are <p>]
                    educations = parent.find_all('li')

                    # if there's any <li>
                    if (len(educations) > 0):
                        grad_year_min = get_min_gradyear_from_li(educations)
                    else: # otherwise, data is in a paragraph <p>
                        grad_year_min = get_min_gradyear_from_p(parent.p)
    
    # add grad year to array
    faculty_gradyears.append(grad_year_min)
    

# create pandas data frame
df_gradyear = pd.DataFrame({'name': faculty_names, 
                            'gradyear': faculty_gradyears
                           })

# change 9999 to '' 
df_gradyear['gradyear'] = df_gradyear['gradyear'].apply(lambda year: year if year != 9999 else '')

# save dataframe to output
df_gradyear.to_csv('faculty_grad_year.csv')

# print job completed
print('data saved to faculty_grad_year.csv')

data saved to faculty_grad_year.csv


### (c) Interesting Finding

Similar to Task 1(c), you don't need to do anything here. Just look at different visualizations w.r.t. age and give yourself an answer to the question: `Does SFU CS faculty age follow a normal distribution?`

In [6]:
from dataprep.eda import plot
import pandas as pd

df = pd.read_csv("faculty_grad_year.csv")
df["age"] = 2020+23-df["gradyear"]

plot(df, "age")

## Submission

Complete the code in this [notebook](https://github.com/sfu-db/bigdata-cmpt733/blob/master/Assignments/A1/A1.ipynb), and submit it to the CourSys activity `Assignment 1`.

<hr/>
## <span style="color:red">Bonus: Contribute to dataprep</span>

**Total Bonus: 0.2 + 0.3 = 0.5**

1. If you create an issue (i.e., report a bug or request a feature) at [link1](https://github.com/sfu-db/dataprep/issues) or [link2](https://github.com/sfu-db/DataConnectorConfigs/issues), you will get **0.2** bonus points.

2.  Creating more issues will *not* give you more bonus points, but you are encouraged to do so for learning more.

2. If you send a pull request (i.e., fix a bug or implement a feature or add data connector for a new website) at [link1](https://github.com/sfu-db/dataprep/pulls) or [link2](https://github.com/sfu-db/DataConnectorConfigs/pulls) and the pull request gets merged into the repo, you will get **0.3** bonus points. 

4.  Sending more pull requests will *not* give you more bonus points, but you are encouraged to do so for learning more.

5. These bonus points will be directly added to your final grade.






**How to Submit**
* Submit github link(s) to the CourSys activity `Bonus`.
* Due on `March 15, 2020`

If you love dataprep, please support it by simply clicking the **<span style="color:red">Star</span>** on [Github](https://github.com/sfu-db/dataprep).
<img src="dataprep-star.png" width="1000">