# ATS Workshops Info Scraping Program



## The Idea


**Three steps of web scraping**

   - Making requests
   - Connecting and dumping rescources(HTML/XML/JSON)
   - Analyzing and extracting information

When we visit a website, we want to get the web pages from the web server, so our web browser makes a request(a connection) to the server, telling it who we are and what files we are asking for. The server then sends back files that contain fragments that let our browser render a web page. 
Those send-back files can be in the forms of HTML, XML, JSON, IMG, CSS, etc. Most of the time, the text information, which is what we usually want, is conveyed in the forms of HTML, XML, and JSON.




### Making requests

We mainly use the package called `urllib.request` to make Http requests and use the requests to make connections. 

You can also use the package `requests` to do this.

The way to use `urllib.request` to make a request is like this:

```
req = urllib.request.Request(
        my_url, 
        data=None, 
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        }
    )
```

Here is the documentation for [`urllib.request.Request()`](https://docs.python.org/3/library/urllib.request.html#urllib.request.Request)

There are three parameters we should define: `url`, `data`, and `headers`.

> * `url` should be a string containing a valid URL.

> * `data` must be an object specifying additional data to send to the server, or None if no such data is needed. 

> * `headers` should be a dictionary, and will be treated as if add_header() was called with each key and value as arguments. This is often used to “spoof” the User-Agent header value, which is used by a browser to identify itself – some HTTP servers only allow requests coming from common browsers as opposed to scripts. For example, Mozilla Firefox may identify itself as "Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", while urllib’s default user agent string is "Python-urllib/2.6" (on Python 2.6).

There is another parameter that can be used to set methods as `GET`/`POST`. The default is `GET` if `data` is `None` or `POST` otherwise.


**Tips**

Some times, the request method may not be 'GET', especially when craping data from tables on web pages. At this time, the data is more likely transfered in the form of JSON or XML by the method of 'POST'. Use the development tool of your web broswer to figure out which request plays the role of data transfering and use the url of that request as your request url and make sure using the same request method.





### Connecting and dumping rescources

Once made a requet, we can send it using `urllib.request.urlopen` to make connection:

```
uClient = urllib.request.urlopen(req)
page_html = uClient.read()
uClient.close()
```

At the same time, we dump the rescources by calling the `read()` function and this function will return the data in string type even you've already known it is an html file. Therefore, you need to analyze it.





### Analyzing and extracting inforamtion

If you're getting an html file, then you need parse it with an html.parser:

```
page_soup = soup(page_html, "html.parser")
```

We can call this parsing procedure as "soup". A soup object will be return.

The next steps are info extraction. The methosdology is to extract info by walking through the DOM and searching its tags and variables of those tags. Functions used in these steps can be `findAll()` or `select()` of `BeautifulSoup` objects. In this program, I mainly used `findAll()`. Besides, buld-in string functions can be used in string processing. For deatils, check out the code below.


**Tips**

- To learn the sturcture of traget web page, use the development tool of your web broswer to inspect it.

- To store the scraped data, you can save it directly in a csv file or create a `DataFrame` with `pandas` and manipulate the data with it.






## Code




### Dependencies

In [None]:
import urllib.request
import requests
from bs4 import BeautifulSoup as soup

import logging
logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')
logging.debug('This is a log message.')


### Customized reusable soup functions

In [None]:
def soup_a_page(req):
    # make connection
    uClient = urllib.request.urlopen(req)
    page_html = uClient.read()
    uClient.close()
    
    # html parser
    page_soup = soup(page_html, "html.parser")
    
    return page_soup

In [None]:
def soup_register_info(courseId):
    logging.debug("soup_register: " + str(courseId))
    post_data = {'intCourseId':courseId}
    return_data = requests.post("https://northeastern.gosignmeup.com/public/Course/CourseDetails",data=post_data)
    
    return_soup = soup(return_data.content, "html.parser")
    
    return return_soup




### Scrape function

In [None]:
def scrape(my_url, f):
    logging.debug('start scraping')
    
    # creat a request of fake user agent
    req = urllib.request.Request(
        my_url, 
        data=None, 
        headers={
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
        }
    )
    
    # html parser
    page_soup = soup_a_page(req)
    
    # grab all articles
    articles = page_soup.findAll("article")
    
    #grab title, date, and time
    for article in articles:
        
        #title and title_date
        raw_title = article.header.h2.a.text
        title = raw_title.rpartition(" ")[0]
        title_date = raw_title.rpartition(" ")[2]
        
        #subtitle_date, start_time, end_time
        date_and_time = article.header.h5.text.strip().partition("\xa0\xa0•\xa0\n")
        subtitle_date = date_and_time[0].replace("," , "").replace(" ", "/").replace("May", "5").replace("Jun", "6").replace("Jul", "7").replace("Aug", "8")
        time = date_and_time[2].strip()
        start_time = time.partition(" - ")[0]
        end_time = time.partition(" - ")[2]
        
        #details page
        detail_url = article.div.findAll("p")[1].findAll(href=True)[0]['href']
        
        #make new connection for detail page
        detail_soup = soup_a_page(detail_url)
        
        #grab location
        location = detail_soup.findAll("article")[0].header.findAll("h5")[2].text.partition(": ")[2]
        
        #grab register link and soup it
        register_url = detail_soup.findAll("article")[0].section.findAll("a")[0]['href']
        courseId_str = register_url.rsplit("=")[1]
        courseId_int = int(courseId_str)
        
        return_soup = soup_register_info(courseId_int)
        
        #course name
        reg_course_name = return_soup.div['data-course-name']
        
        #location
        reg_location = return_soup.find("input",{"id":"hdlocation"})['value']
        
        session_div = return_soup.find("div",{"id":"CourseDates_and_TimesContainerDet"}).findAll("div",{"style":"padding:5px; width:100%; min-height:20px; overflow: auto;"})[0]
        session_info_div = session_div.findAll("div", {"style": "padding:5px; height:20px; display: table-row; width:100%;"})[0]
        
        #date
        reg_date = session_info_div.findAll("div")[0].text.strip()
        
        #start time
        reg_start_time = session_info_div.findAll("div")[1].text.replace("\xa0","").replace("\r\n","").replace("(EST)","").strip().partition(" - ")[0]
        
        #end time
        reg_end_time = session_info_div.findAll("div")[1].text.replace("\xa0","").replace("\r\n","").replace("(EST)","").strip().partition(" - ")[2]
        
        instructors = []
        
        try:
            instructors_h2s = return_soup.findAll("div",{"id":"CourseInstructorContainerDet"})[0].findAll("h2")
        except:
            instructors_h2s = return_soup.findAll("div",{"id":"CourseInstructorsContainerDet"})[0].findAll("h2")
        
        
        for instructor in instructors_h2s:
            name = instructor.b.text.replace("\xa0","")
            instructors.append(name)
            logging.debug("appended instructor: " + name)
        
        reg_instructors = ""    
        
        for i in instructors:
            reg_instructors += i + "|"
        
        #instructors
        reg_instructors = reg_instructors.rsplit("|",1)[0]
        
        
        #write info into csv
        f.write(title + "," + title_date + "," + subtitle_date + "," + start_time + "," + end_time
                + "," + location + "," + reg_course_name + "," + reg_date + "," + reg_start_time + ","
                + reg_end_time + "," + reg_location + "," + reg_instructors + "\n")
    




### Main function

In [None]:
def main():
    # test page url
    my_url = "https://www.northeastern.edu/ats/event/"
    #initialize output csv file
    fileName = "workshops.csv"
    f = open(fileName, "w", encoding='utf-8')
    logging.debug('opened file:' + fileName)
    headers = "title, title_date, subtitle_date, start_time, end_time, location, reg_course_name, reg_date, reg_start_time, reg_end_time, reg_location, reg_instructors\n"
    f.write(headers)
    logging.debug('wrote headers')
    scrape(my_url, f)
    f.close()

if __name__ == '__main__':
    main() 






## Related reading material

   - [urllib.request Documentation](https://docs.python.org/3/library/urllib.request.html#module-urllib.request)
   - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
   - [Python Web Scraping Tutorial using BeautifulSoup](https://www.dataquest.io/blog/web-scraping-tutorial-python/)