# Using R to Gather Data From Websites

Both R and Python have packages that allow them to interact with your computer, and in particular to gather data from various places your computer can reach.

For example if you have a set of PDF files you can write scripts that will go through those files and gather information -- for example I could give you a set of all the syllabi the school have used in the last 2 decades and we could write a script to go through them all and try to identify the statements about accomodations and analyze the words used by instructors and how they have evolved in 20 years.

Another common example is to gather data from websites. The standard example used, if you google how to do this is to pull weather data off of one of the NOAA websites. However more recently I have seen this used in:

1. A research project by Chad Topaz studying the biases in sentincing decisions by federal judges. They used R to automatically gather data from the federal website for courts and compile the record into an analyzable dataset.

2. Marketing companies are frequently doing this to study the competition for a product. For example one might imagine a consultant for UNC searching a large set of university websites looking for key words like "Data Science".

3. Recently a reported crawled the websites of the state of Missouri and discovered web pages where the Social Security Numbers of teachers were in the html data on the site - not displayed but being used as tags. I suspect he found this by reading the data on the website as we do below and then searching for things formatted like SSN.

Since we have questions about maybe what pre-requisites courses have, I though it would be interesting to show how I would use R to pull the university catalog of courses into R where we could then analyze various pieces of it.

This is a complicated problem, and it will require us to learn a little bit about how commercial websites are structured, so rather than have it pre-prepared, we are going to work through it together.

Note - I had hoped that the websites displaying university catalogs were of common enough stock that we would be able to use the same code to analyze different ones. Unfortunately the answer to that is no.

I'm going to show you two approaches. One is a brute force method that just pulls everything off the website and then searches for the keywords we want. This is good, but if those words show up in multiple places or if we need the structure of the website too it might not be exatly what we need. However because it does not use the structure of the website, it is more robust and will work from site to site nearly the same.

Method 2 is going to use the rvest library, and will involve making use of the structure of the website design to navigate to the parts of it we want to gather.

## Method 1

In [1]:
# Method 1 is to just read the website:
course_catalog = readLines("https://unco.smartcatalogiq.com/en/Current/Undergraduate-Catalog/Course-Descriptions")

“incomplete final line found on 'https://unco.smartcatalogiq.com/en/Current/Undergraduate-Catalog/Course-Descriptions'”


In [5]:
course_catalog[3]

In [6]:
grep('Linear Algebra', course_catalog)

## Method 2

In [36]:
library(rvest)
library(xml2)

In [37]:
course_catalog = read_html("https://unco.smartcatalogiq.com/en/Current/Undergraduate-Catalog/Course-Descriptions")

In [38]:
# The structure of this thing is much more complicated:
course_catalog[2]

$doc
<pointer: 0x559d4554e1f0>


In [132]:
# Using inspect in the website we see that the class name of the sections headings containing the courses is course-name
course_titles <- course_catalog %>% html_nodes(".course-name") %>% html_text(trim=TRUE)
course_titles

If you check though, the same trick does not work super well to gather say the descriptions or course pre-requisites attached to each courses:  For one thing they are not in the class giving the course title; and for another they are in the class title "desc" which is used for multiple things in the page.

In [118]:
course_list <- course_catalog %>% html_element(".courselist") %>% html_children()

In [137]:
course_list

{xml_nodeset (15174)}
 [1] <h2 class="course-name">\r\n\t\t<a href="/en/Current/Undergraduate-Catal ...
 [2] <div class="desc">\r\n\t\tReviews the emergence of Africana Studies as a ...
 [3] <div class="sc-credithours">\r\n\t\t<div class="credits">\r\n\t\t\t3\r\n ...
 [4] <div class="desc">\r\n\r\n\t</div>\n
 [5] <div class="desc">\r\n\r\n\t</div>\n
 [6] <div class="desc">\r\n\r\n\t</div>\n
 [7] <div class="sc-extrafield">\r\n\t\t<h3>\r\n\t\t\tCourse Attribute\r\n\t\ ...
 [8] <h2 class="course-name">\r\n\t\t<a href="/en/Current/Undergraduate-Catal ...
 [9] <div class="desc">\r\n\t\tAddresses social conditions that led to format ...
[10] <div class="sc-credithours">\r\n\t\t<div class="credits">\r\n\t\t\t3\r\n ...
[11] <div class="desc">\r\n\r\n\t</div>\n
[12] <div class="desc">\r\n\r\n\t</div>\n
[13] <div class="desc">\r\n\r\n\t</div>\n
[14] <div class="sc-extrafield">\r\n\t\t<h3>\r\n\t\t\tCourse Attribute\r\n\t\ ...
[15] <h2 class="course-name">\r\n\t\t<a href="/en/Current/Undergraduat

In [141]:
course_list[[1]]

{html_node}
<h2 class="course-name">
[1] <a href="/en/Current/Undergraduate-Catalog/Course-Descriptions/AFS-Africa ...

In [145]:
course_list[[1]] %>% html_children()

{xml_nodeset (1)}
[1] <a href="/en/Current/Undergraduate-Catalog/Course-Descriptions/AFS-Africa ...

In [146]:
course_list[[1]] %>% html_children() %>% html_text()

In [154]:
link = course_list[[1]] %>% html_children() %>% html_attr("href")
link

In [163]:
# If the website has a table we want, things are even easier:
course_catalog %>% html_node("table")

{xml_missing}
<NA>