## Worksheet 09 - Getting started with Web Scraping
  
Your Name:   Christopher Truong
Your Class: INST 447  
Your Section: 0101 (MWF) | 0102 (TTh)  
Your favorite flavor of Frozen Yogurt, Ice Cream, Sorbet, or Other:  Chocolate

## Reading Reflection  

Write a 75 word (+/- 15 words) response about the two assigned readings: 


> “Bots Are Scraping Your Data For Cash Amid Murky Laws And Ethics.” Accessed March 15, 2018. https://www.fastcompany.com/40456140/bots-are-scraping-your-public-data-for-cash-amid-murky-laws-and-ethics-linkedin-hiq (Links to an external site.).

> Fiesler, Casey. “Law & Ethics of Scraping: What HiQ v LinkedIn Could Mean for Researchers Violating TOS.” Medium (blog), August 15, 2017. https://medium.com/@cfiesler/law-ethics-of-scraping-what-hiq-v-linkedin-could-mean-for-researchers-violating-tos-787bd3322540 (Links to an external site.).

Prompting questions to get your brain moving:
<ol>
<li>What are your thoughts about the ethics of scraping?</li>
<li>How would you feel if you found that your content was being scraped?</li>
</ol>

I believe that it is wrong to take people's data without them knowing because it could contain personal information that they do not want shared. If i found out that my content was being scraped I would not mind too much considering that everything on the itnernet is essentially permanenet. There are many precaustions towards people telling them to be wary of what they post; however I feel as though other people would be much more mad than me.  


In [48]:
import requests
from bs4 import BeautifulSoup
import re

# Scraping Testudo

URL: https://app.testudo.umd.edu/

## Check if we can scrape!

We need to answeer all 3 of the below:
<ol>
<li>Is there a robots.txt file?</li>
<li>Is there an X-Robots-Tag in the headder? (https://developers.google.com/search/reference/robots_meta_tag)</li>
<li>Is there a robots meta tag that prohibits scraping? (http://www.robotstxt.org/meta.html)</li>
</ol>

We will also check to see if there is a licensing agreement prohibiting the use of this data.

#### Step 1: Check for robots.txt

In [49]:
r = requests.get("https://app.testudo.umd.edu/robots.txt")

We now check the response code to see what the server has said.
So that you don't have to remember all the code numbers, the requests library has them as constants for you to refrence.  
See: http://docs.python-requests.org/en/master/api/#status-code-lookup  
Notice that it is very complete:  

In [50]:
requests.codes.i_am_a_teapot

418

We if it is ok, then we have a robots.txt file to deal with. If not, then we don't have to worry about it.

In [51]:
r.status_code == requests.codes.ok

False

Nice, no robots.txt file.

#### Step 2: 

In [52]:
testudo_url = "https://app.testudo.umd.edu/"

r = requests.get(testudo_url)
r.status_code == requests.codes.ok

True

Is there an X-Robots-Tag in the header?  
  

r.headers is a dictionary, so I can check to see if the key exists in it and proceed from there.

In [53]:
if r.headers.get('x-robots-tag'):
    print(r.headers.get('x-robots-tag'))
else:
    print('No x-robots-tag found')

No x-robots-tag found


#### Step 3: 
Is there a robots meta tag?  

Meta tags can appear on any html page, so we should write a function to check. This will let us use it on any page...

In [54]:
parsed = BeautifulSoup(r.text, 'lxml')

In [55]:
def get_robots_meta(soup):
    # get all the meta tags in the page
    meta = soup.find('meta', attrs={'name': 'robots'})
    
    # if there are meta tags
    if meta:
        # pull out their content and turn the commands into a list
        # then return that list of commands
        return meta.attrs.get('content').split(',')
    else:
        # or return an empty list
        return [] 

Now we can look for those important terms:

In [56]:
robo_meta = get_robots_meta(parsed)

In [57]:
# build a list of the important terms to look for
stops = ['noindex', 'nofollow', 'none']

# this is a very fancy way to check if any items of a list are in another list
# but you could use any method that works for 
any(i in stops for i in robo_meta)

False

#### Not programming, but looking:
Are there policies or licensing agreements that prevent or allow our scraping of the data?  



Not for this particular website. However, Websites such as Craigslist forbid it.

## What iSchool Classes are listed on Testudo for Fall 2018?

It is getting close to registration time. Wouldn't it be nice to be able to have a way to be told automatically if they change the courses listed?  

The URL I have provided goes straight to the course listings for the INST Fall 2018 Semester course listings. We'll need to parse the page and get a list of the courses and the sections and the times each section meets.

In [59]:
testudo_url = "https://app.testudo.umd.edu/soc/201808/INST"

### First get the page.

<ol><li>Get the page.</li>
<li>Check the response status.</li>
<li>Parse the response with BeautifulSoup</li>
<li>Check if there is a robots meta.</li>
<li>Check if there is a 'x-robots-tag' in the header response.</li></ol>

In [60]:
r = requests.get(testudo_url)
r.status_code == requests.codes.ok

True

In [61]:
parsed = BeautifulSoup(r.text, 'lxml')
# Check for the robots meta 
get_robots_meta(parsed)

[]

In [64]:
if r.headers.get('x-robots-tag'):
    print(r.headers.get(x-robots-tag))
else:
    print('No x-robots-tag found')

No x-robots-tag found


#### Find the elements to grab.

In your browser, use the inspector to find the element that contains each course.

What element contains each course?

*Your answer here*

#### Now get a list of each of those elements from the parsed html

How many do you get?

In [65]:
courses = parsed.find_all('div', attrs={'class': 'course'})
len(courses)

53

Instead of using find, we could also use select. Select lets us use CSS-style selectors, if that is what you feel comfortable with. Then go for it.

CSS Selectors reference: https://www.w3schools.com/cssref/css_selectors.asp

select is to find_all, as select_one is to find

In [66]:
len(parsed.select('.course'))

53

#### Let's test on the first course

Create a dictionary that contains:  
- Course ID
- Course Title
- Course Credits

In [67]:
first_course = courses[0]
course_dict = {'course_id': first_course.select_one('.course-id').text,
               'course_title': first_course.select_one('.course-title').text,
               # As select is to find_all, select_one is to find:
               'course_credits': first_course.select_one('.course-min-credits').text}

In [68]:
course_dict

{'course_credits': '3',
 'course_id': 'INST126',
 'course_title': 'Introduction to Programming for Information Science'}

#### Loading sections:
The sections are kept on a separate page and loaded with JavaScript:  

> https://app.testudo.umd.edu/soc/201808/sections?courseIds=<course-id\>  
    
You need to replace the <course-id\> with the course id of the course whose sections you want to lookup.


Make a request to get the sections for that first course that you worked with and parse that response.
We will then add the sections' information to the dictionary you created above for the course.

#### Make the request
With requests.get we can build the query string (the part after the '?') by using a dictionary as the second argumnet. This makes building complex queries much easier over time and prevents you from passing the same key multiple times.

We do this like:
> requests.get(url, {'key', 'value'})

In [69]:
r = requests.get('https://app.testudo.umd.edu/soc/201808/sections',
                 {'courseIds': 'INST126'})
r.status_code == requests.codes.ok

True

#### Create a parse 'soup' object from the response with BeaufulSoup

In [70]:
sect_parsed = BeautifulSoup(r.text, 'lxml')

#### Get the section's container element

Go to your browser and find the container element that holds each section's info. Then create a list with each section in it:

In [71]:
sections = sect_parsed.select('div.delivery-f2f')

In [72]:
len(sections)

3

#### Now get the info for each section.
Save the section_id, instructor, and days with time that the class meets. 
Save each section into the dictionary that you used for the course.

You should end up with a data structure that looks like:
<pre>
[{'instructor': ['Instructor Name'],
  'meeting_place': 'BUILDING ROOM#',
  'meeting_time': 'DAYS TIME',
  'section_id': 'SECTION_ID'}]
</pre>
That is a list that contains a dictionary for each section. Note that the key 'instructor' is also a list because sometimes there are muliple instructors for a section.

In [73]:
sections_list = []
for s in sections:
    s_id = s.select_one('.section-id').text.strip()
    inst = []
    for name in s.select('.section-instructor'):
            inst.append(name.text.strip())
    days = sections[0].select_one('.section-days').text.strip()
    time = sections[0].select_one('.class-start-time').text.strip() + ' - ' + sections[0].select_one('.class-end-time').text.strip()
    meeting_place = sections[0].select_one('.section-class-building-group').text.strip().replace('\n', ' ')
    sections_list.append({'section_id': s_id,
                          'instructor': inst,
                          'meeting_time': days + ' ' + time,
                          'meeting_place': meeting_place})

#### Add the sections to the course dictionary

Add the sections information to the course information so that you end up with a structure like:

<pre>
{'course_credits': 'NUM CREDITS',
 'course_id': 'COURSE_ID',
 'course_title': 'COURSE_TITLE',
 'sections': [{'instructor': ['INSTRUCTOR NAME'],
   'meeting_place': 'BUILDING ROOM#',
   'meeting_time': 'DAYS TIME',
   'section_id': 'SECTION_ID'}]}
</pre>

In [74]:
course_dict['sections'] = sections_list

### Collect them All!

Ok, so you've just collected the info for a single course. Now do it for all of the courses for INST.

The result should be a data structure that when printed looks like:

<pre>
[{'course_id': 'COURSE_ID',
 'course_title': 'COURSE_TITLE',
 'course_credits': 'NUM CREDITS',
 'sections': [{'instructor': ['INSTRUCTOR NAME'],
   'meeting_place': 'BUILDING ROOM#',
   'meeting_time': 'DAYS TIME',
   'section_id': 'SECTION_ID'}]}
]
</pre>
That is a list that contains a dictionary. The dictionary contains the information for the course and has a key called sections. The key sections contains a list of dictionaries that contain each sections' information.

In [75]:
semester = '201808'
# courses
r = requests.get("https://app.testudo.umd.edu/soc/%s/INST" % semester)
r.status_code == requests.codes.ok

True

In [76]:
courses_soup = BeautifulSoup(r.text, 'lxml')

In [77]:
len(courses_list)

53

In [78]:
courses_list

[{'course_credits': '3',
  'course_id': 'INST126',
  'course_title': 'Introduction to Programming for Information Science',
  'sections': [{'instructor': ['Jonathan Brier'],
    'meeting_place': 'ATL 1113',
    'meeting_time': 'MWF 2:00pm - 2:50pm',
    'section_id': '0101'},
   {'instructor': ['Bill Kules'],
    'meeting_place': 'ESJ 2212',
    'meeting_time': 'TuTh 9:30am - 10:45am',
    'section_id': '0102'},
   {'instructor': ['Instructor: TBA'],
    'meeting_place': 'SQH 1117',
    'meeting_time': 'TuTh 3:30pm - 4:45pm',
    'section_id': '0103'}]},
 {'course_credits': '3',
  'course_id': 'INST201',
  'course_title': 'Introduction to Information Science',
  'sections': [{'instructor': ['Instructor: TBA'],
    'meeting_place': 'SQH 1120',
    'meeting_time': 'TuTh 12:30pm - 1:45pm',
    'section_id': '0101'},
   {'instructor': ['Instructor: TBA'],
    'meeting_place': 'TWS 0310',
    'meeting_time': 'MWF 10:00am - 10:50am',
    'section_id': '0102'},
   {'instructor': ['Kelly Hoffm

In [79]:
# sections
def get_sections(course_id, semester = '201808'):
    url = 'https://app.testudo.umd.edu/soc/%s/sections' % semester
    r = requests.get(url,
                     {'courseIds': course_id})
    if r.status_code == requests.codes.ok:
        sections_soup = BeautifulSoup(r.text, 'lxml')
        sections = sections_soup.select('div.delivery-f2f')
        sections_list = []
        for s in sections:
            s_id = s.select_one('.section-id').text.strip()
            inst = []
            for name in s.select('.section-instructor'):
                    inst.append(name.text.strip())
            if s.select_one('.class-message'):
                class_message = s.select_one('.class-message').text.strip()
                meeting_time = class_message
                meeting_place = ''
            else:
                days = s.select_one('.section-days').text.strip()
                time = s.select_one('.class-start-time').text.strip() + ' - ' + s.select_one('.class-end-time').text.strip()
                meeting_time = days + ' ' + time
                meeting_place = s.select_one('.section-class-building-group').text.strip().replace('\n', ' ')
            sections_list.append({'section_id': s_id,
                                  'instructor': inst,
                                  'meeting_time': meeting_time,
                                  'meeting_place': meeting_place})
        return sections_list
    else:
        print("SO SAD!")
        return []

In [80]:
get_sections('INST201')

[{'instructor': ['Instructor: TBA'],
  'meeting_place': 'SQH 1120',
  'meeting_time': 'TuTh 12:30pm - 1:45pm',
  'section_id': '0101'},
 {'instructor': ['Instructor: TBA'],
  'meeting_place': 'TWS 0310',
  'meeting_time': 'MWF 10:00am - 10:50am',
  'section_id': '0102'},
 {'instructor': ['Kelly Hoffman'],
  'meeting_place': 'SQH 1119',
  'meeting_time': 'TuTh 3:30pm - 4:45pm',
  'section_id': '0103'},
 {'instructor': ['Instructor: TBA'],
  'meeting_place': 'TYD 1132',
  'meeting_time': 'MW 5:00pm - 6:15pm',
  'section_id': '0104'}]

In [81]:
course_divs = courses_soup.select('.course')

courses_list = []
for course_div in course_divs:
    course_id = course_div.select_one('.course-id').text
    course_title = course_div.select_one('.course-title').text
    course_credits = course_div.select_one('.course-min-credits').text
    sections = get_sections(course_id)
    print(course_id)
    course_dict = {'course_id': course_id,
                   'course_title': course_title,
                   'course_credits': course_credits,
                   'sections': sections}
    courses_list.append(course_dict)

INST126
INST201
INST309
INST309B
INST311
INST314
INST326
INST327
INST335
INST346
INST352
INST354
INST362
INST377
INST414
INST447
INST462
INST466
INST490
INST604
INST610
INST613
INST615
INST620
INST627
INST630
INST631
INST641
INST643
INST647
INST650
INST651
INST652
INST660
INST701
INST706
INST709
INST714
INST728L
INST733
INST735
INST737
INST750
INST760
INST775
INST782
INST784
INST799
INST800
INST810
INST888
INST898
INST899
