## Fun with the Genius API

As we have already begun to discuss, many web sites and organizations offer web APIs, which can be a rich source for textual data. We're going to go over how one real-world API works, the [Genius API](https://docs.genius.com/). By introducing you to this API, you'll learn the tools necessary to sign up for, query, and interpret APIs from other providers (as you will be asked to do in your first quiz for this course).

### Signing up for an API Key (aka Client Access Token)

Before you can use the Genius API, you need to sign up for a "client access token," which is another name for an API key, as was discussed in the homework. Do so by filling out the [New API Client form](https://genius.com/api-clients/new). If you don't yet have an account on Genius.com, you'll be prompted to register first. 

The next questions don't really apply to our use in class, but they're required to get your token. You'll be prompted to fill out a short form about the "App" that you need the Genius API for. You only need to fill out "App Name" and "App Website URL." You can enter any words you want in "App Name." Similarly, you can enter any URL in the "App Website URL," like so:

<img src="http://lklein.com/wp-content/uploads/2021/09/Screen-Shot-2021-09-15-at-11.21.25-AM.png" style="width:400px">

When you click "Save," you'll be given a series of API keys: a "Client ID" and a "Client Secret." To generate your "Client Access Token," which is the API key that we'll be using in this notebook, you need to click "Generate Access Token".

The token is just a string of letters and numbers. It'll look something like this:

    6617c28c371f0a138f7912a35365564afe538605
    
That's your "key" for that API. Whenever you make a request to that API, you'll need to include your key in the request. The exact method for including the key will be explained below. (Note: the key above is just something I made up; it's not a valid key; don't try using it in actual requests.)

In [None]:
# sign up for a client access token from Genius

copy and paste your "Client Access Token" into the quotation marks below, and run the cell to save your variable

In [None]:
client_access_token = "5lblo4MzDdH9aqy3iJdTlfMooGt749XVzVNsOevVx_RgAZBLtVzS4G8MCmuh7yf3"

### Making an API Request

Remember: making an API request looks a lot like typing a specially-formatted URL. That's kind of what it is. But instead of getting a rendered HTML web page in return, you get some data in return.

There are a few different ways that we can query the Genius API, all of which are discussed in the [Genius API documentation](https://docs.genius.com/#/getting-started-h1). (In general, an API's documentation will explain how to use the API.) The way we're going to cover in this lesson is the [basic search](https://docs.genius.com/#songs-h2), which allows you to get a bunch of Genius data about any artist or songs that you search for, and it looks something like this:

`http://api.genius.com/search?q={search_term}&access_token={client_access_token}`

Let's break it down. But first, we need to: 

In [None]:
import requests # requests again

Then we need the base URL for the Genius API. We'll assign it like this:

In [None]:
base_url = "http://api.genius.com" # this is the URL for the Genius API; we're just storing it as a string
base_url

'http://api.genius.com'

Up next, we add '/search', which is what we learned about from reading the documentation. It tells the Genius API that we want to do a basic search. We'll add it to the end of the base_url (which is just a string) like so:

In [None]:
search_url = base_url + "/search" 
search_url

'http://api.genius.com/search'

Next, we have '?q={search term}'. 

The "q" is Genius's search paramater; it tells Genius that what follows is what we're searching _for_. Let's search for Beyoncé's mew single, "Break My Soul."

In [None]:
search_term = "Break My Soul" 

Finally, we have '&access_token={client_access_token}'. You've already defined this term above with your own token!

We can put it all back together now:

In [None]:
genius_search_url = f'http://api.genius.com/search?q={search_term}&access_token={client_access_token}'

But wait? What's that 'f' doing in front of the URL? 

This yet another way of formatting strings, known as a [formatted string literal or f-string](https://cito.github.io/blog/f-strings/). 

What it means is that, if you preface a string with an "f", any variables placed in curly braces ( `{}` ) will be interpreted inline. So in this case, {search_term} will be replaced by our search_term, and {client_access_token} will be replaced by our client_access_token.

Note that you could *also* do: 

In [None]:
genius_search_url2 = search_url + "?q=" + search_term + "&access_token=" + client_access_token

But in this case the f-string is a bit more legible.

So now here we go with the API call!

In [None]:
# and here's the API call
resp = requests.get(genius_search_url)
data = resp.json()

data

{'meta': {'status': 200},
 'response': {'hits': [{'highlights': [],
    'index': 'song',
    'type': 'song',
    'result': {'annotation_count': 6,
     'api_path': '/songs/8122123',
     'artist_names': 'Beyoncé',
     'full_title': 'BREAK MY SOUL by\xa0Beyoncé',
     'header_image_thumbnail_url': 'https://images.genius.com/f597d7ff44a041e8c9b10192123e147a.300x300x1.jpg',
     'header_image_url': 'https://images.genius.com/f597d7ff44a041e8c9b10192123e147a.1000x1000x1.jpg',
     'id': 8122123,
     'lyrics_owner_id': 5748418,
     'lyrics_state': 'complete',
     'path': '/Beyonce-break-my-soul-lyrics',
     'pyongs_count': 51,
     'relationships_index_url': 'https://genius.com/Beyonce-break-my-soul-sample',
     'release_date_components': {'year': 2022, 'month': 6, 'day': 21},
     'release_date_for_display': 'June 21, 2022',
     'song_art_image_thumbnail_url': 'https://images.genius.com/d904df8c008f003866de7825da2fcace.300x300x1.jpg',
     'song_art_image_url': 'https://images.geniu

This request is finding all songs that include the search string `Break My Soul`. 

As described in the [documentation](https://docs.genius.com/#/response-format-h1), the results take the form of a dictionary with two keys: `response` (which points to a dictionary of a list of dictionaries; phew!) and `meta`, whose value is a string (`'status'`), which gives you the HTML status code for the response (i.e. whether the request was successful). 

Because the response is a dictionary, we can isolate the two top-level keys to get an overall view of the response:

In [None]:
data.keys()

dict_keys(['meta', 'response'])

So we know that the response was successful. 

But let's dig a little deeper into the `response` key. It itself is a dictionary, so we can look at _its_ keys.

In [None]:
data['response'].keys()

dict_keys(['hits'])

So there is only one key, `hits`, which I will tell you contains a _further_ list of dictionaries: one for each of the hits in the search result.

Let's take a look at the first result:

In [None]:
data['response']['hits'][0]

{'highlights': [],
 'index': 'song',
 'type': 'song',
 'result': {'annotation_count': 6,
  'api_path': '/songs/8122123',
  'artist_names': 'Beyoncé',
  'full_title': 'BREAK MY SOUL by\xa0Beyoncé',
  'header_image_thumbnail_url': 'https://images.genius.com/f597d7ff44a041e8c9b10192123e147a.300x300x1.jpg',
  'header_image_url': 'https://images.genius.com/f597d7ff44a041e8c9b10192123e147a.1000x1000x1.jpg',
  'id': 8122123,
  'lyrics_owner_id': 5748418,
  'lyrics_state': 'complete',
  'path': '/Beyonce-break-my-soul-lyrics',
  'pyongs_count': 51,
  'relationships_index_url': 'https://genius.com/Beyonce-break-my-soul-sample',
  'release_date_components': {'year': 2022, 'month': 6, 'day': 21},
  'release_date_for_display': 'June 21, 2022',
  'song_art_image_thumbnail_url': 'https://images.genius.com/d904df8c008f003866de7825da2fcace.300x300x1.jpg',
  'song_art_image_url': 'https://images.genius.com/d904df8c008f003866de7825da2fcace.1000x1000x1.jpg',
  'stats': {'unreviewed_annotations': 1,
   'c

So this is what we want: the dictionary for each of the search results.

But lo and behold, it contains additional levels of data, and they each appear to be dictionaries! 

Three of the four-- `highlights`, `index`, and `type`-- each only have one item.

But the `result` dictionary is where the good stuff is. 

Important items in this dictionary are the song title itself (`title`), the URL for the song lyrics (`url`), and the `primary artist` key, which points to *another* dictionary with the name of the artist (`name`). 

The artist name could be used with a different API endpoint to get more detail about a particular artist. But this information is enough for our purposes at the moment.

To get a more compact view of the results of our initial query, for song titles with "Break My Soul" in them, let's see if we can print out the full song title for each search hit:

In [None]:
# Remember list comprehension format: [ predicate expression FOR temporary variable name IN source list ]

titles = [song['result']['title'] for song in data['response']['hits']]

titles

# This means, for each song in data['response']['hits'], add its ['result']['title'] to a new list called "titles"


['BREAK MY SOUL',
 'BREAK MY SOUL (THE QUEENS REMIX)',
 'Let Your Heart Go (Break My Soul)',
 'Beyoncé - BREAK MY SOUL (Tradução em Português)',
 'BREAK MY SOUL (Honey Dijon Remix)',
 'Break My Soul',
 'Beyoncé - BREAK MY SOUL (Türkçe Çeviri)',
 'Beyoncé - BREAK MY SOUL (Traducción al Español)',
 'You Can’t Break My Soul',
 'BREAK MY SOUL (ACAPELLA VERSION)']

**Question:** What key would we change to list the URLs for the lyrics of each of these songs?

In [None]:
# your code here

lyrics = [song['result']['url'] for song in data['response']['hits']]

lyrics


['https://genius.com/Beyonce-break-my-soul-lyrics',
 'https://genius.com/Beyonce-and-madonna-break-my-soul-the-queens-remix-lyrics',
 'https://genius.com/Ti-let-your-heart-go-break-my-soul-lyrics',
 'https://genius.com/Genius-brasil-traducoes-beyonce-break-my-soul-traducao-em-portugues-lyrics',
 'https://genius.com/Beyonce-and-honey-dijon-break-my-soul-honey-dijon-remix-lyrics',
 'https://genius.com/H-y-b-r-i-d-break-my-soul-lyrics',
 'https://genius.com/Genius-turkce-ceviri-beyonce-break-my-soul-turkce-ceviri-lyrics',
 'https://genius.com/Genius-traducciones-al-espanol-beyonce-break-my-soul-traduccion-al-espanol-lyrics',
 'https://genius.com/Anonymous-group-you-cant-break-my-soul-lyrics',
 'https://genius.com/Beyonce-break-my-soul-acapella-version-lyrics']

**Exercise:** Adapting the syntax above, list the name of the artist for each of these songs.
    
**Hint:** Remember that the artist `name` is contained *within* the dictionary `primary artist`

In [None]:
# your code here
artists =  [song['result']['artist_names'] for song in data['response']['hits']]
artists


['Beyoncé',
 'Beyoncé & Madonna',
 'T.I. (Ft. The-Dream)',
 'Genius Brasil Traduções',
 'Beyoncé & Honey Dijon',
 'H Y B R I D',
 'Genius Türkçe Çeviri',
 'Genius Traducciones al Español',
 'Anonymous (Group)',
 'Beyoncé']

### Working with responses

Now we have a response from the API, and we've parsed it into a Python data structure that we know how to use (a dictionary). But now what do we do with it?

In some cases, the response from the API contains all of the data that you need to create your dataset. But in other cases, you need to chain together additional information gained from an API call with another API call--or, in yet other cases, with some web scraping. 

In this case, you'll notice that the response contains an item, `url`, that contains a link to a URL with the song's lyrics. But it doesn't actually provide the lyrics themselves. This has to do with the sad reality that most for-profit companies don't want to give away their most valuable data for free. 

It turns out that Genius has made their lyrics data increasingly difficult to access. But if we *wanted* to create a song lyrics dataset that contained "Break My Soul"... 

## Web Scraping on the Actual Web: 2022 Edition 

Let's start with what we know how to do: finding the URL for the lyrics for Beyoncé's "Break My Soul"

Remember that we've already got our `data` stored from our API call. 

Now let's create a variable for our artist name:

In [None]:
# Create a placeholder list to hold any matches
lyrics_url = []

# Use our data variable, populated with info from the API, to pull up the URL we need to scrape 
for song in data['response']['hits']:
    if song['result']['primary_artist']['name'] == "Beyoncé":
        lyrics_url.append(song['result']['url']) # appending to list 

lyrics_url

['https://genius.com/Beyonce-break-my-soul-lyrics',
 'https://genius.com/Beyonce-break-my-soul-acapella-version-lyrics']

Looks like there are two matches, but the first seems more authoritative. Let's use that.

In [None]:
url = lyrics_url[0]

url

'https://genius.com/Beyonce-break-my-soul-lyrics'

Next step: See if we can get the contents of the page at that URL

**Does anyone remember the first step?**

Hint: it involves the `requests` library

In [None]:
response = requests.get(url)

response


<Response [403]>

In [None]:
# note about colab, we'll eventually load the page locally instead 

*NB: In shifting over from Jupyter to Colab, I encountered my first issue with the platform, which is that Genius.com does not like that http requests are originating from somewhere in the Google cloud, and suspects the worst. As a workaround for today's class, we'll just load a static version of the document to use for the rest of the lesson. But do remember in the future if you are getting 403 errors and no one else on the internet who is trying to do the same thing as you seems to be getting them, try downloading your notebook and running it locally on your laptop. That worked for me!* 

In [None]:
# load in the static html instead
response = requests.get('https://raw.githubusercontent.com/laurenfklein/QTM340-Fall22/main/corpora/lyrics/Beyonce-break-my-soul-lyrics.html')

Once we've gotten the contents of the page, it's a good idea to take a look.

**How can you print the response from the server as text?**

In [None]:
# your code here



Whoa! That's a lot more complicated than kittens! Let's go back to Chrome and [take a look](https://genius.com/Beyonce-break-my-soul-lyrics) using Developer Tools.


To get a lay of the (HTML) land, try doing a command-f for "intro: big freedia" in the developer window, since that's some text that seems to start the portion of the page that has the lyrics. 

**First round of questions for the class**:
* What is the tag enclosing the phrase "intro: big freedia"? 
* Does this tag have any attributes, and if so, what are they?

**Second set of questions:**
* Is there only one div with the attribute "data-lyrics-container=true", or are there many?
* What BeautifulSoup method should we use to ensure that we get the appropriate number of div tag(s)?

In [None]:
# need to import BeautifulSoup since we haven't yet used it in this notebook
from bs4 import BeautifulSoup

# now let's use BeautifulSoup to parse the html document that we got using requests just a few minutes ago
html_str = response.text
document = BeautifulSoup(html_str, "html.parser")

# and your BeautifulSoup query goes here... 
lyrics = document.find_all("div", attrs={"data-lyrics-container": "true"})


In [None]:
lyrics[0].string

In [None]:
lyrics[0].get_text()

"[Intro: Big Freedia & Beyoncé]I'm 'bout to explode, take off this loadBend it, bust it open, won't ya make it goYaka-yaka, yaka-yaka, yaka-yaka, yaka-yakaYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)Yaka-yaka, yaka-yaka, yaka-yaka, yaka-yakaYaka-yaka, yaka-yaka, yaka-yaka, yaka-yaka (Release ya wiggle)La-la-la-la, la-la-la-la, la-la-la-laLa-la-la-la, la-la-la-la, la-la-la-la, la-la-la-la, la[Chorus: Beyoncé]You won't break my soulYou won't break my soulYou won't break my soulYou won't break my soulI'm tellin' everybodyEverybodyEverybodyEverybody[Verse 1: Beyoncé]Now, I just fell in loveAnd I just quit my jobI'm gonna find new driveDamn, they work me so damn hardWork by nineThen off past fiveAnd they work my nervesThat's why I cannot sleep at night[Pre-Chorus: Beyoncé]I'm lookin' for motivationI'm lookin' for a new foundation, yeahAnd I'm on that new vibrationI'm buildin' my own foundation, yeahHold up, oh, baby, baby[Chorus: Beyoncé]You won't break my soul (Na, na)You

In [None]:
# does not work! 
lyrics_divs.string

NameError: ignored

In [None]:
# does work, sorta!
lyrics_divs.get_text()

NameError: ignored

It's not perfect, but it's good enough for now! 

## A Quick Note on API Wrappers

An API wrapper is a package that makes an API easier to use and/or extends the API’s functionality. 

For example, a data scientist named John Miller wrote a Python package called [LyricsGenius](https://github.com/johnwmillr/LyricsGenius), which makes working with the Genius API easier and adds functionality not offered by the Genius API, including scraping lyrics (but it also doesn't work via Colab).

The Twitter API has something called [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/), which is the equivalent for the Twitter API. 

And the best thing to use for Reddit data is the [PushShift API Wrapper](https://github.com/dmarx/psaw), or PSAW. 









## And a bit more on legal / ethical considerations, via Melanie Walsh

### Legal considerations

If internet data is publicly available (e.g., tweets from a public Twitter account), it is generally considered legal to collect this data, even if a particular platform says that you cannot. In 2019, the Ninth Circuit Court of Appeals ruled that scraping publicly accessible websites likely does not violate federal anti-hacking laws. You can read more about [this legal ruling](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-protects-scraping-public-data#:~:text=Linkedin%20Protects%20Scraping%20of%20Public%20Data,-Share%20It%20Share&text=In%20a%20long%2Dawaited%20decision,and%20Abuse%20Act%20(CFAA)) from the Electronic Frontier Foundation.

### Institutional Review Boards (IRBs)

Research that involves human participants (e.g., surveys, interviews, blood draws) needs to be approved by an Institutional Review Board (IRB). But research about publicly available internet data does not typically require IRB approval.

### Publishing, Privacy, and Citation 

Just because something is legal or gets approved by an IRB does not mean it is ethical. Collecting, sharing, and publishing internet data created by or about individuals can lead to unwanted public scrutiny, harm, and other negative consequences for those individuals. For these reasons, some researchers attempt to anonymize internet data before sharing it or before publishing an article that cites a post specifically. Yet anonymizing internet data also does not give credit to internet users as creators and authors.

There is no single, simple answer to the many difficult questions raised by internet data collection. It is important to develop an ethical framework that responds to the specifics of your particular research project or use case (e.g., the platform, the people involved, the context, the potential consequences, etc.).

In any published research, you may want to consider seeking explicit permission from internet users when you want to quote them in an article, or only share internet data that meets a certain threshold of publicness, such as tweets from verified Twitter accounts or Reddit posts with a certain number of upvotes. 

### Models & Examples of Social Media Data in Published Research

Below are a few examples of how researchers have approached social media data in published research:

### Paraphrasing Posts
In Maria Antoniak, David Mimno, and Karen Levy’s [article about a Reddit subcommunity dedicated to birthstories (r/BabyBumps)](https://maria-antoniak.github.io/resources/2019_cscw_birth_stories.pdf), which we will read later this semester, they paraphrased Reddit submissions discussed in the article and then deleted all collected Reddit data after the article was published.

### Linking to Posts & Using “Reasonably Public” Thresholds
In Deen Freelon, Charlton McIlwain, and Meredith D. Clark’s [report about the #BlackLivesMatter movement](https://cmsimpact.org/wp-content/uploads/2016/03/beyond_the_hashtags_2016.pdf), they included links to tweets rather than the full text of tweets and only linked to tweets with a minimum of 100 retweets published by Twitter users who had at least 3,000 followers or were verified. They embargoed their Twitter data for a year and then publicly released a list of tweet IDs. Tweet IDs can be used by third-parties to re-download any tweets that have not been deleted yet.

### Direct Collaboration & Conversation with Users
In Emory alum Moya Bailey’s [article about the #GirlsLikeUs hashtag](http://www.digitalhumanities.org/dhq/vol/9/2/000209/000209.html), created by trans advocate Janet Mock, she asked for Mock’s permission to work on the project before it began and collaborated with Mock to develop research questions and determine the project’s direction.

## Additional Recommended Reading

* [Doc Now White Paper](https://www.docnow.io/docs/docnow-whitepaper-2018.pdf), Bergis Jules, Ed Summers, Dr. Vernon Mitchell, Jr.
* [No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service](https://cmci.colorado.edu/~cafi5706/ICWSM2020_datascraping.pdf), Casey Fiesler, Nathan Beard, Brian C. Keegan
* [#transform(ing)DH Writing and Research: An Autoethnography of Digital Humanities and Feminist Ethics](http://www.digitalhumanities.org/dhq/vol/9/2/000209/000209.html), Moya Bailey
* [The #TwitterEthics Manifesto](https://modelviewculture.com/pieces/the-twitterethics-manifesto), Dorothy Kim and Eunsong Kim

*I wrote version 1.0 of this notebook in Fall 2019. It has since been supplemented with material from Melanie Walsh's chapter [Song Genius API](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/Data-Collection/Genius-API.html) from her online textbook [_Introduction to Cultural Analytics & Python_](https://melaniewalsh.github.io/Intro-Cultural-Analytics/features/welcome.html) as well as from Prof. Dan Sinykin's 2020 iteration of QTM 340. I last revised this notebook in Fall 2022*.

