#### Sociology 128D: Mining Culture Through Text Data: Introduction to Social Data Science

# Notebook 7: Web Scraping and APIs

Web scraping is a big topic. There are a lot of reasons someone might want to scrape web content, but the reason applicable to this class is to get data that may be useful for answering questions about some social phenomena.

People who provide web content are typically savvy to the existence of tools for web scraping. You can often find references to automated web scraping in a site's Terms of Use (or equivalent), which often prohibit automated scraping.

I'll just make two points here. First, the desirability of the data on a site is probably positively correlated with how prohibitive it is to scrape it. Second, we should try to be clear about what we mean by "web scraping."

Regarding the second point, we are typically referring to accessing a website's content in a way that's mediated by a tool or set of tools that makes it qualitatively different from browsing the web normally. As we'll see in our first example using the `requests` library, this can be as simple as using a line of Python code to store a web search in memory, rather than rendering it directly in a browser. We can then view what we've scraped (e.g., rendered HTML), which wouldn't be much different from normal browsing. We could also save it, or save some feature or set of features we've extracted from it; and doing this a lot is typically where things become problematic.

At the most basic level, repeatedly scraping a site (or some part of it) means making repeated requests of the site's servers. That can be a problem in itself. The first point above just adds to this: sites may also want to protect their data, and may make it available subject to terms that prohibit automated scraping. Content is also served in different ways. Static websites are much easier to scrape than dynamic ones, which require a different approach.

One compromise many sites make is to offer an application programming interface (API). In this notebook, we're going to keep our focus on getting data that may be useful for answering social research questions. Toward that end, we'll explore scraping static web content with an eye toward getting Twitter user handles for members of the US senate, and we'll then use those handles to get tweets. Finally, we'll use an API to access data from Reddit.

In [None]:
import datetime as dt
import nest_asyncio
import pandas as pd
import requests
import time
import twint

from bs4 import BeautifulSoup
from IPython.core.display import display, HTML
from psaw import PushshiftAPI

nest_asyncio.apply()

## Web Scraping with Requests and BeautifulSoup

### Example 1. Rendering Search Results inside Jupyter

At its most basic level, "scraping the web" is just using a computer to access web content in a different way. The next two cells show how we can use the `requests` library to store the results of a web search in memory (in a variable we'll call <tt>results</tt>), which we can then render inside the notebook.

We'll use `requests.get()` to get the web content we want to examine. The [`requests` library](https://docs.python-requests.org/en/master/) enables us to make HTTP requests, even with authentication.

Running the second cell may change the way the notebook is displayed. You can comment it out and run the cell again if needed.

In [None]:
url = "https://www.google.com/search?q=weather+stanford"
results = requests.get(url)

In [None]:
# display(HTML(results.text))

### Example 2. Scraping Quotes from a Scraping Sandbox

To get a sense of how scraping static content works, we'll start with a sandbox designed for this purpose. https://toscrape.com/ offers a couple of environments, including a [fictional bookstore](https://books.toscrape.com/). Since this is a class on text analysis, we're going to take a look at [another page](https://quotes.toscrape.com/), which displays quotes.

In [None]:
url = "https://quotes.toscrape.com/"
quotes_page = requests.get(url)

The first thing to note is that we can interact with the result like it's a string. If you type "quotes_page." (ending with a period) and press the `tab` key, Jupyter will list several attributes you can explore, like the status code and headers.

In [None]:
print(quotes_page.text[:500])

In [None]:
print(quotes_page.text[:200])

In [None]:
quotes_page.status_code

In [None]:
quotes_page.headers

We'll use [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) to parse the text and find the content we are interested in.

In [None]:
soup = BeautifulSoup(quotes_page.text, "html.parser")

In [None]:
type(soup)

In [None]:
print(soup.prettify()[:500])

We can now search the <tt>soup</tt> for all kinds of content. If you type "soup." (ending with a period) in a Code cell and press the `tab` key, Jupyter will show different attributes or methods that are available.

In [None]:
soup.h1

In [None]:
soup.p

In [None]:
soup.a

In [None]:
soup.find_all("a")[:10]

Here we print one `div` section (a chunk of the HTML) that shows a single quote and the author.

In [None]:
print(soup.prettify()[600:1538])

The `.find_all()` method can be used for various types of content. Here we use it to get all of the `div` tags containing quotes. We then use `.find_all()` on each result to find the `span` tags nested inside. We use Python's `str.replace()` method to get rid of some unwanted text and print the results.

In [None]:
for div in soup.find_all(class_="quote"):
    for span in div.find_all("span"):
        print(span.text.replace("(about)", ""))

### Example 3. Something Useful: Identifying Twitter Handles of Members of the Senate

As we've noted, at its most basic level scraping is just accessing a site. Here we will scrape a "real" website--but we are only going to make *one* request. Specifically, we'll get the Twitter handles (along with state and party) of each current US senator from a site maintained by the UC San Diego Library.

In [None]:
url = "https://ucsd.libguides.com/congress_twitter/senators"

In [None]:
senate_page = requests.get(url)

In [None]:
print(senate_page.text[:500])

In [None]:
soup = BeautifulSoup(senate_page.text, "html.parser")

You can compare the way the HTML is printed when using `.prettify()` on <tt>soup</tt> to printing the text from the original result from `requests`.

In [None]:
# print(soup.prettify())

If you explore the site in a browser or just scroll through the <tt>soup</tt>, you can see that the names, states, parties, and Twitter handles of the senators are arranged in a table, which is convenient for us. We'll use `.find_all()` to identify the table.

In [None]:
len(soup.find_all("table"))

In [None]:
tables = soup.find_all("table")
for table in tables:
    print(type(table), len(table))

We can also see that the info we want is inside `tr` tags, which are rows.

In [None]:
print(str(tables[0])[:1000])

The information we want for each senator (name, handle, state, and party) is contained in one row. The handle is in the URL of the `a` tag, while the senator's name is in the text of that tag. The state and party are in additional `td` tags.

In [None]:
tables[0].findAll("tr")[1]

Here we use `enumerate()` with a for loop just to look at the first few results.

This code finds all of the `tr` tags, ignores any without a link (e.g., to a Twitter account), finds all of the elements of the `ck_border` class, and prints the text. This prints the senator's name, state, and party. The `a` tag's attributes are like a dictionary, and the value for the key "href" is the URL to the senator's Twitter.

In [None]:
for i, result in enumerate(soup.find_all("tr")):
    if i < 4:
        if result.a:
            for element in result.find_all(class_="ck_border"):
                print(element.text)
            print(result.a.attrs["href"])
        print()

Now that we have figured out the way the information is structured, we will extract the name, state, party, and Twitter handle for each US senator. We'll create an empty list called <tt>senator_data</tt> to store the data initially. We'll use a nested for loop just like the one above, for we'll append each senator's name, state, party, and handle to a list called <tt>row</tt> before appending that row--one per senator--to <tt>senator_data</tt>.

In [None]:
senator_data = []

for result in soup.find_all("tr"):
    if result.a:
        row = []
        for element in result.find_all(class_="ck_border"):
            row.append(element.text)
        handle = result.a.attrs["href"]
        handle = handle.replace("https://twitter.com/", "")
        row.append(handle)
        senator_data.append(row)
    else:
        print(result)

In [None]:
senator_data[:5]

In [None]:
len(senator_data)

Now we will create a pandas dataframe from this list of lists. The `columns` argument lets us name the columns in the resulting dataframe.

In [None]:
df = pd.DataFrame(senator_data, columns=["senator", "state", "party", "twitter_handle"])

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.to_csv("senate_twitter_dataframe.csv", index=None)

## Scraping Tweets using `twint`

[`twint`](https://github.com/twintproject/twint) describes itself as "an advanced Twitter scraping tool written in Python that allows for scraping Tweets from Twitter profiles without using Twitter's API." `twint` has been featured in plenty of guides to scraping tweets, but there seem to be issues such as the way it handles dates, among other problems. One workaround is to handle some of the configuration in the search string itself using Twitter's search operators, rather than configuring `twint` as intended.

You can see Twitter's standard search operators [here](https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/search-operators).

[Here are some helpful thoughts](https://thoughtfaucet.com/search-twitter-by-location/) about using (and the limitations of) location data, including [tips for finding geocodes](https://thoughtfaucet.com/search-twitter-by-location/make-a-geocode-for-twitter-location-search/) and some examples of searching for tweets from [particular events](https://thoughtfaucet.com/search-twitter-by-location/examples/).

**Note:** I recommend [applying for a Twitter developer account](https://developer.twitter.com/en/apply-for-access) and accessing tweets through the official API. We will use `twint` for this example, but I do not recommend violating Twitter's terms by accessing excessive amounts of data (etc.). I've set the tweet limits low for this notebook for a reason.

First, we'll look at tweets from US senators around April 28, when President Biden [addressed a joint session of Congress](https://en.wikipedia.org/wiki/2021_Joe_Biden_speech_to_a_joint_session_of_Congress). Next, we'll look at geotagged tweets.



### Example 1. Tweets from US Senators

We'll use the dataframe we created in the previous section to identify the twitter handles of current US senators.

In [None]:
df = pd.read_csv("senate_twitter_dataframe.csv")

In [None]:
c = twint.Config()
c.Hide_output = True
c.Store_csv = True
c.Output = "senate_tweets.csv"
c.Limit = 10

In [None]:
run_twint = input("Scrape twitter data? ")

if run_twint in ["yes", "y"]:
    for handle in df.twitter_handle.values:
        searchstr = f"from:{handle} until:2021-04-29 since:2021-04-28"
        c.Search = searchstr
        twint.run.Search(c)
        time.sleep(1)

In [None]:
tweets_df = pd.read_csv("senate_tweets.csv")

In [None]:
tweets_df.date.min(), tweets_df.date.max(), tweets_df.shape

In [None]:
tweets_df.head()

In [None]:
tweets_df[["username", "name", "tweet"]].sample(10)

### Example 2. Geocoded Data

In [None]:
c = twint.Config()
c.Hide_output = True
c.Store_csv = True
c.Output = "geo_tweets.csv"
c.Limit = 100
searchstr = "until:2021-07-21 since:2021-07-19 geocode:43.045110,-87.915820,5km" # within 5km of Deer District
c.Search = searchstr
twint.run.Search(c)

In [None]:
geo_df = pd.read_csv("geo_tweets.csv")

In [None]:
geo_df.date.min(), geo_df.date.max(), geo_df.shape

In [None]:
geo_df.head()

In [None]:
geo_df[geo_df["date"]=="2021-07-20"]["tweet"]

## Scraping Reddit Content using `psaw`

Another amazing resource for social media data is [pushshift.io](https://pushshift.io/), which archives vast amounts of data and makes it easily accessible. We'll use the [`psaw` library](https://github.com/dmarx/psaw) to access content from the pushshift.io Reddit API.

First, create an instance of the `PushShiftAPI()` class.

In [None]:
api = PushshiftAPI()

We'll use a helper function to turn the results we get into a list.

In [None]:
def get_results(subreddit: str, start_epoch, before_epoch, limit=10):
    res = list(api.search_submissions(after=start_epoch,
                                      before=before_epoch,
                                      subreddit=subreddit,
                                      limit=limit))
    return res

In [None]:
wsb = []

year = 2020
month = 1
days = range(24,31)

epochs = []

for day in days:
    start_epoch=int(dt.datetime(year, month, day).timestamp())
    try:
        before_epoch=int(dt.datetime(year, month, day+1).timestamp())
    except:
        before_epoch=int(dt.datetime(year, month+1, 1).timestamp())
        
    epochs.append((start_epoch, before_epoch))
    res = get_results("WallStreetBets", start_epoch, before_epoch)
    wsb.append(res)
    time.sleep(1)


In [None]:
wsb_flat = [post for sublist in wsb for post in sublist]

In [None]:
len(wsb_flat)

In [None]:
wsb_df = pd.DataFrame([post.d_ for post in wsb_flat])

In [None]:
wsb_df.head()

In [None]:
wsb_df[["author", "title", "selftext"]]