<a href="https://colab.research.google.com/github/sjslack18/DH100-Project/blob/main/DH100_Notebook_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
library(rvest)
library(tidyverse)
library(lubridate)


#The following functions will be used to extract page IDs from pages on The APP
# website containing a list of links, and to extract a 
#president's name from the title of the page.

# function to extract text from hyperlink (as character)
extract_link_title <- function(anchor_tag) {
  return(unlist(strsplit(unlist(strsplit(anchor_tag, '>'))[2], '<'))[1])
}
# function to extract target URL from hyperlink (as character)
extract_link_page_id <- function(anchor_tag) {
  return(html_attr("href"))
}
# function to extract president's name from title
extract_president <- function(title) {
  return(unlist(strsplit(title, ':'))[1])
}

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.3     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.2     [32m✔[39m [34mdplyr  [39m 1.0.6
[32m✔[39m [34mtidyr  [39m 1.1.3     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m         masks [34mstats[39m::filter()
[31m✖[39m [34mreadr[39m::[32mguess_encoding()[39m masks [34mrvest[39m::guess_encoding()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m            masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




In [None]:
#function for date access
get_date <- function(page_data) {
page_date <- page_data %>% html_nodes(xpath="//*[@id='block-system-main']/div/div/div[1]/div[2]/span") %>%
html_attr("content") %>% as_datetime()
return (page_date)
}

In [None]:
# The deglaze() function will scrape content from the site, given a page ID,
# and do some minimal preprocessing (assemble title, date, and page text).

deglaze <- function(page_id) {
  page_data <- read_html(paste0('http://www.presidency.ucsb.edu', page_id))
  page_title <- page_data %>%
    html_node('title') %>%
    html_text()
  page_date <- page_data %>%
    get_date()
  page_text <- page_data %>%
    html_nodes('p') %>%
    html_text() %>%
    paste(collapse = ' ')
  return(as_tibble(cbind(title = page_title, text = page_text)) %>%
   mutate(date = page_date))
}

Taken from kshaffers github page and modified but largely unused, lots of problems encountered so I had to do most of it from scratch anyway. The first cell is mostly from kshaffer but is largely unused, and the deglaze function is his except for the page_date code.

In [None]:
test_links <- read_html('https://www.presidency.ucsb.edu/documents/app-categories/pressmedia/press-briefings?items_per_page=60') %>%
html_nodes('a') %>%
html_attr("href") %>%
as_tibble() %>%
filter(grepl('/documents/press',value,fixed=TRUE))
test_links[1:5,1]


This cell tests a different process for gathering links to scrape, although I had difficulty finding a way to get titles from the original nodes that I wanted. 

In [None]:
# scraping one page for links

#having difficulties with this, will try another way

# pb_links <- read_html('https://www.presidency.ucsb.edu/documents/app-categories/pressmedia/press-briefings?items_per_page=60') %>%
# html_nodes('a') %>%
# as.character() %>%
# as_tibble() %>%
# unique() %>%
# filter(grepl('/documents/press',value,fixed=TRUE)) %>%
# mutate(title = mapply(extract_link_title, value),
#          page_id = value) %>%
#   select(title, page_id)
# test = pb_links$title[1]
# extract_link_page_id(test)

Unused for now, was having problems using the page_id because it seemed to be unusable for both regex and xml operations.

In [None]:
# # function to scrape all links from page into tibble (to be appended iteratively)
# link_scraper <- function(page_address) {
# link_tibble <- read_html(page_address) %>%
# html_nodes('a') %>%
# as.character() %>%
# as_tibble() %>%
# unique() %>%
# filter(grepl('/documents/press-briefing',value,fixed=TRUE)) %>%
# mutate(title = mapply(extract_link_title, value),
#          page_id = mapply(extract_link_page_id, value)) %>%
#   select(title, page_id)
# }

In [None]:
link_scraper <- function(page_address) {
  link_tibble <- read_html(page_address) %>%
  html_nodes('a') %>%
  html_attr("href") %>%
  as_tibble() %>%
  filter(grepl('/documents/press',value,fixed=TRUE))
}


New link scraper function based on test, returns only the link extensions and not titles.

In [None]:
#compile all pages to scrape links for
page_links <-  paste0('https://www.presidency.ucsb.edu/documents/app-categories/pressmedia/press-briefings?items_per_page=60&page=',c(1:105))

#create tibble for first page (address has different format)
pb_links <- link_scraper('https://www.presidency.ucsb.edu/documents/app-categories/pressmedia/press-briefings?items_per_page=60')

#scrape all links iteratively, appending tibble after each iteration
for (page in page_links) {
new_links <- link_scraper(page)
pb_links <- rbind(pb_links,new_links)
}



In [None]:
#testing deglaze function
test_page <- pb_links[367,1]
test_results <- deglaze(test_page)


In [None]:
#scrape all actual data in this step

scraped_data <- deglaze(pb_links[1,1])
for (row in c(2:nrow(pb_links))) {
new_row <- deglaze(pb_links[[row,1]])
scraped_data <- rbind(scraped_data, new_row)
}

scraped_data[1:5,]



Scraping and compiling the data into one very long tibble happens here: it take a long time to run this cell and seemingly is still problematic, though all the individual steps execute without problem.

In [None]:
#Write to CSV
head(scraped_data)
write.csv(scraped_data, "scraped_data.csv", row.names = FALSE)
# from google.colab import files
# files.dowload("scraped_data.csv")

After writing the scraped data to a CSV, another notebook can be used, so that the previous cells will not have to run after the scraped data is uploaded to drive. This both cuts down on runtime by not rescraping and allows me to use python separately, which has better NLP processing tools.