#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science – Summer '22

# Notebook 7: Web Scraping and APIs

Web scraping is a big topic. There are a lot of reasons someone might want to scrape web content, but the reason applicable to this class is to get data that may be useful for answering questions about some social phenomena.

People who provide web content are typically savvy to the existence of tools for web scraping. You can often find references to automated web scraping in a site's Terms of Use (or equivalent), which often prohibit automated scraping.

I'll just make two points here. First, the desirability of the data on a site is probably positively correlated with how prohibitive it is to scrape it, which can be unfortunate. Often, data that are interesting to us also have monetary value, and sites don't like to give it away for free.

Second, we should try to be clear about what we mean by "web scraping." Regarding the second point, we are typically referring to accessing a website's content in a way that is mediated by a tool or set of tools that makes it qualitatively different from browsing the web normally. As we'll see in our first example using the `requests` library, this can be as simple as using a line of Python code to store a web search in memory, rather than rendering it directly and immediately in a browser. We can then view what we've scraped (e.g., rendered HTML), which wouldn't be much different from normal browsing. We could also save it, or save some feature or set of features we've extracted from it. Doing this repeatedly is typically where things become problematic.

At the most basic level, repeatedly scraping a site (or some part of it) means making repeated requests of the site's servers. That can be a problem in itself. The first point above just adds to this: sites may also want to protect their data, and may make it available subject to terms that prohibit automated scraping. Content is also served in different ways. Static websites are much easier to scrape than dynamic ones, which require a different approach.

One compromise many sites make is to offer an application programming interface (API). In this notebook, we're going to keep our focus on getting data that may be useful for answering social research questions. Toward that end, we'll explore scraping static web content with an eye toward getting Twitter user handles for members of the US senate. Finally, we'll use an API to access data archived from Reddit.

To clarify the purpose of this notebook, I want to draw attention to one of the early points made in the [Luscombe et al. (2022) reading](https://doi.org/10.1007/s11135-021-01164-0):

> In practice, scraping is often closer to an art than a science, and can take years of practice to master (Possler
et al. 2019). At the same time, it is a craft that requires continuous learning and problem
solving, particularly as website development evolves and becomes ever more complex and
thereby less accessible using existing tools. (p. 1024)

Websites are structured very differently, so it is often the case that code for scraping must be tailored to a particular website. Additionally, websites change. Code that worked for a particular website at one point in time may stop working if the website changes. This notebook is meant to give students without any experience with web scraping a gentle introduction to scraping static content so that you can get a sense of whether it is worth the trouble. This notebook will *not* provide you with code (or permission!) to scrape any and all of the sites you might be interested in. We will also go through an example of using an API, which is more likely to be directly useful for the class.

In the short term (e.g., for class projects), it will be easier to simply download a corpus that is readily available. APIs may offer a middle ground between using a ready-made corpus and building a scraper, but some APIs are tricky to access and use. There are tools available for working with Twitter's API, for example, [but you must request access to the API for your project](https://developer.twitter.com/en/docs/twitter-api) and wait for approval, which isn't guaranteed. Reddit's official API also [has limitations relevant to this class](https://www.reddit.com/wiki/api-terms/) (for example, you must be of legal age to sign a contract).

If you are interested in Reddit data, one excellent resource is [pushshift.io](https://pushshift.io/). pushshift.io archives data from sites like Reddit in addition to providing an API for more specific searches. If you are interested in submissions to Reddit, for example, you can download them as bulk files for individual months throughout Reddit's history. However, the files from recent years have gotten to be quite large and may be difficult to work with if you don't have a large hard drive with a lot of available space. The pushshift API can help you get content that is more directly related to your research question–for example, submissions to specific subreddits from within a specific period of time–but there are some downsides to working with it.

## Setup

For this notebook, you'll need to install `beautifulsoup4` and `psaw`.

If you use Anaconda, you can install `beautifulsoup4` by running the following line in the Anaconda interpreter:

```
conda install -c anaconda beautifulsoup4 
```

Otherwise, you install it using `pip`. You will need to install `psaw` using pip regardless. (Depending on your setup, you may need to use `pip3` instead.)

```
pip3 install --user beautifulsoup4
pip3 install --user psaw
```

In [None]:
import datetime as dt
import pandas as pd
import requests
import time

from bs4 import BeautifulSoup
from IPython.display import display, HTML
from psaw import PushshiftAPI

## Web Scraping with Requests and BeautifulSoup

### Example 1. Rendering Search Results inside Jupyter

At its most basic level, "scraping the web" is just using a computer to access web content in a different way. The next two cells show how we can use the `requests` library to store the results of a web search in memory (in a variable we'll call <tt>results</tt>), which we can then render inside the notebook.

We'll use `requests.get()` to get the web content we want to examine. The [`requests` library](https://docs.python-requests.org/en/master/) enables us to make HTTP requests, even with authentication.

Running the second cell may change the way the notebook is displayed. You can comment it out and run the cell again if needed.

In [None]:
url = "https://www.google.com/search?q=weather+stanford"
results = requests.get(url)

In [None]:
display(HTML(results.text))

### Example 2. Scraping Quotes from a Scraping Sandbox

To get a sense of how scraping static content works, we'll start with a sandbox designed for this purpose. https://toscrape.com/ offers a couple of environments, including a [fictional bookstore](https://books.toscrape.com/). Since this is a class on text analysis, we're going to take a look at [another page](https://quotes.toscrape.com/), which displays quotes. When we make a request, we're hoping for a [response code of 200](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/200). You can read more about other response codes [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

In [None]:
url = "https://quotes.toscrape.com/"
quotes_page = requests.get(url)

In [None]:
quotes_page.json

The first thing to note is that we can interact with the result like it's a string. If you type "quotes_page." (ending with a period) and press the `tab` key, Jupyter will list several attributes you can explore, like the status code and headers.

In [None]:
print(quotes_page.text[:500]) # first 500 characters

In [None]:
quotes_page.status_code

In [None]:
quotes_page.headers

We'll use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the result and find the content we are interested in.

In [None]:
soup = BeautifulSoup(quotes_page.text, "html.parser")

In [None]:
type(soup)

In [None]:
print(soup.prettify()[:500])

We can now search the <tt>soup</tt> for all kinds of content. If you type "soup." (ending with a period) in a Code cell and press the `tab` key, Jupyter will show different attributes or methods that are available.

At this stage, scraping benefits from some knowledge of HTML. BeautifulSoup will allow us to access pieces of the page we scraped using various tags.

In [None]:
soup.title

In [None]:
soup.h1

In [None]:
soup.p

In [None]:
soup.a

In [None]:
soup.find_all("a")

Here we print one `div` section (a chunk of the HTML) that shows a single quote and the author.

In [None]:
print(soup.prettify()[600:1538])

The `.find_all()` method can be used for various types of content. Here we use it to get all of the `div` tags containing quotes. We then use `.find_all()` on each result to find the `span` tags nested inside. We use Python's `str.replace()` method to get rid of some unwanted text specific to this example and print the results.

In [None]:
# the first result for the "quote" class
for div in soup.find_all(class_="quote"):
    print(div)
    break

In [None]:
# the first span in the first div
for div in soup.find_all(class_="quote"):
    for span in div.find_all("span"):
        print(span) #.text.replace("(about)", ""))
        break
    break

In [None]:
# the text we want from inside each span in a quote-class div
for div in soup.find_all(class_="quote"):
    for span in div.find_all("span"):
        print(span.text.replace("(about)", ""))

### Example 3. Something Useful: Identifying Twitter Handles of Members of the Senate

As we've noted, at its most basic level scraping is just accessing a site. Here we will scrape a "real" website–but we are only going to make *one* request. Specifically, we'll get the Twitter handles (along with state and party) of each current US senator from a site maintained by the UC San Diego Library. (Please be respectful of the site and don't spam them with requests!)

In [None]:
url = "https://ucsd.libguides.com/congress_twitter/senators"

In [None]:
senate_page = requests.get(url)

In [None]:
print(senate_page.text)

In [None]:
soup = BeautifulSoup(senate_page.text, "html.parser")

You can compare the way the HTML is printed when using `.prettify()` on <tt>soup</tt> to printing the text from the original result from `requests`.

In [None]:
print(soup.prettify())

If you explore the site in a browser or just scroll through the <tt>soup</tt>, you can see that the names, states, parties, and Twitter handles of the senators are arranged in a table, which is convenient for us. We'll use `.find_all()` to identify the table.

In [None]:
len(soup.find_all("table"))

In [None]:
tables = soup.find_all("table")
for table in tables:
    print(type(table), len(table))

We can also see that the info we want is inside `tr` tags, which are rows.

In [None]:
print(str(tables[0])[:1000])

The information we want for each senator (name, handle, state, and party) is contained in one row. The handle is in the URL of the `a` tag, while the senator's name is in the text of that tag. The state and party are in additional `td` tags.

In [None]:
tables[0].findAll("tr")[1]

Here we use [`enumerate()` (guide)](https://pythonbasics.org/enumerate/) with a for loop just to look at the first few results.

This code finds all of the `tr` tags, ignores any without a link (e.g., to a Twitter account), finds all of the elements of the `ck_border` class, and prints the text. This prints the senator's name, state, and party. The `a` tag's attributes are like a dictionary, and the value for the key "href" is the URL to the senator's Twitter.

In [None]:
for i, result in enumerate(soup.find_all("tr")):
    if i < 4:
        if result.a:
            for element in result.find_all(class_="ck_border"):
                print(element.text)
            print(result.a.attrs["href"])
        print()

Now that we have figured out the way the information is structured, we will extract the name, state, party, and Twitter handle for each US senator. We'll create an empty list called <tt>senator_data</tt> to store the data initially. We'll use a nested for loop just like the one above, for we'll append each senator's name, state, party, and handle to a list called <tt>row</tt> before appending that row–one per senator–to <tt>senator_data</tt>.

In [None]:
senator_data = []

for result in soup.find_all("tr"):
    if result.a:
        row = []
        for element in result.find_all(class_="ck_border"):
            row.append(element.text)
        handle = result.a.attrs["href"]
        handle = handle.replace("https://twitter.com/", "")
        row.append(handle)
        senator_data.append(row)
    else:
        print(result) # show the rows that aren't added to the dataset we're making

In [None]:
senator_data[:5]

In [None]:
len(senator_data)

Now we will create a pandas dataframe from this list of lists. The `columns` argument lets us name the columns in the resulting dataframe.

In [None]:
df = pd.DataFrame(senator_data, columns=["senator", "state", "party", "twitter_handle"])

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.to_csv("senate_twitter_dataframe.csv", index=None) # save the dataframe as a CSV

## Scraping Reddit Content using `psaw`

Another amazing resource for social media data is [pushshift.io](https://pushshift.io/), which archives vast amounts of data and makes it easily accessible. We'll use the [`psaw` library](https://github.com/dmarx/psaw) to access content from the pushshift.io Reddit API. Please be respectful of the service that pushshift.io offers. For example, if you want to use the API to get your own data, please request only a small amount of data first so that you can prototype everything, then request only as much data as you need and do so at a moderate pace. You may be temporarily blocked from using the API if you request too much too fast.

For this example, we'll get posts to r/WallStreetBets from the last week of January, 2021. During this time, there was a lot of excitement about the rise of the GameStop stock–and then trading was halted on some platforms, [such as Robinhood](https://www.reuters.com/business/us-congress-hold-hearings-gamestop-trading-state-stock-markets-2021-01-28/).

First, create an instance of the `PushShiftAPI()` class.

In [None]:
api = PushshiftAPI()

We'll use the helper function <tt>get_results_from_pushshift()</tt> to turn the results we get into a list.

In [None]:
def get_results_from_pushshift(subreddit: str, start_epoch, before_epoch, limit=10):
    """Fetches `limit` submissions to `subreddit` between `start_epoch` and `before_epoch`"""
    results = list(api.search_submissions(after=start_epoch,
                                      before=before_epoch,
                                      subreddit=subreddit,
                                      limit=limit))
    return results

In [None]:
%%time

wsb = []

year = 2020
month = 1
days = range(24,31)

epochs = []

for day in days:
    start_epoch=int(dt.datetime(year, month, day).timestamp())
    try:
        before_epoch=int(dt.datetime(year, month, day+1).timestamp())
    except:
        before_epoch=int(dt.datetime(year, month+1, 1).timestamp()) # first day of next month
        
    epochs.append((start_epoch, before_epoch))
    results = get_results_from_pushshift("WallStreetBets", start_epoch, before_epoch)
    wsb.append(results)
    time.sleep(1)

In [None]:
wsb

In [None]:
wsb_flat = [post for sublist in wsb for post in sublist] # turn the list of lists into list of posts

In [None]:
len(wsb_flat)

In [None]:
wsb_df = pd.DataFrame([post.d_ for post in wsb_flat])

In [None]:
wsb_df.head()

In [None]:
wsb_df.shape

In [None]:
wsb_df[["author", "title", "selftext", "score"]]

This event spawned a number of related subreddits. I have not personally followed these closely, but there is a lot there could be looked at sociologically, such as the dynamics of what could be considered a social movement or particular beliefs (and/or language) about certain stocks, companies, and regulatory agencies.

You can modify the code above to look at different subreddits or periods of time. The code cell we use to collect the data is only looking at one month (January) in one year (2021), so there is a single loop that iterates through specific days. If you want to look at multiple months or years, you can nest for loops and iterate through those. Just bear in the mind the amount of data you are requesting!