blog/phish_from_vt.Rmd

---
date: 2017-05-21
title: "Phish by the Numbers"
image: "blog/img/phish_cover.png"
showonlyimage: true
type: 'post'
---

```{r, include=FALSE}
knitr::opts_chunk$set(warning = F, message = F, error = F)
```


```{r, echo=FALSE}
library(tidyverse)
library(httr)
library(stringr)
library(lubridate)
library(forcats)
library(broom)
library(stargazer)
library(wesanderson)
library(viridis)
library(rvest)
library(knitr)
library(pander)

# Source functions
source(file = 'scripts/spotify_helpers.R')
source(file = 'scripts/phish_in_helpers.R')

# Spotify credentials
client_id <- '8ec1c22eb38d4e0b9e10e8b715a5f74e'
client_secret <- '5636a032ece24f38aad8cbd01b266ec2'

# Phish.net credentials
api_key <- 'B53003FE66383F8D79E7'
app_key <- '0A3D9E14542B42F37A5E'
end_point <- 'https://api.phish.net/v3/'

# Top jams to explore
top_jams <- c('You Enjoy Myself', 'Tweezer', 'Bathtub Gin', 'David Bowie',
              "Down with Disease", 'Ghost', "Mike's Song", 'Reba')
```

> "My geekiness is getting in the way of my nerdiness" - Patton Oswalt 

Phish is a quartet and stalwarts of the modern jam band scene they helped create in the late 80's. Like their Deadhead contemporaries, Phish fans (phans) have an almost unhealthy obsession with the music, speaking in reverential tones of near mythical versions of songs, like 2015's [Tweezer -> Prince Caspian](https://youtu.be/jPrQYYlFPA4), while endlessly debating the details (for the record, no, I do not think they went back into Tweezer). This devotion to the music, an unabiding urge to get a taste of IT, has numerous implications - on bank accounts, relationships ([Confessions of a Phish Wife](http://www.vogue.com/article/confessions-of-a-phish-wife)), livers, and, in this case, data. 

Coinciding with the emergence of the internet, phans have been compiling, debating, and analyzing Phish data online for literally decades. Whether it's in the pursuit of tracking your "stats" (shows attended, songs heard, etc.) or quantifying "bust outs" (songs not heard in X many shows), data is no stranger to a Phish discussion. So, from the moment I learned that there are *numerous* online sources of data about this band I love, the data monkey in my brain has been burning the banana at both ends thinking of ways to, like everything else in my life, over analyze Phish. 

## Soundcheck
The data for this analysis comes from the following sources who are kind enough to make their data available through application programming interfaces (APIs):

+ [Phish.net](https://api.phish.net) - A project of the non-profit [Mockingbird Foundation](http://mbird.org), an entirely volunteer run foundation started by Phish fans in 1996. If you enjoy this post, please consider making a donation!
+ [Phish.in](http://phish.in/api-docs) - A live Phish audio streaming site and my de facto homepage these days
+ [Spotify](https://developer.spotify.com/web-api/endpoint-reference/)

I access, analyze, and visualize all the data using [R](https://cran.r-project.org), an open source programming language original designed for statistical analyses but that can now make burritos, babysit your kid, and even moderate civil discussions between Trump supporters and Berkeley students. At least it seems that way. 

I've included certain interesting code blocks below but, for anyone interested in the full code, checkout my [GitHub page](https://github.com/tclavelle).

## Set 1: The Band
### Where does Phish play?
Phish gained popularity the old fashioned way, by making the faces of New England college students melt harder than Nectar's gravy fries while touring relentlessly in the late '80s and early '90s. By the mid '90s, the Phish from Vermont had built a loyal fan base that, I would argue, remains one of music's strongest to this day. Along the way, Phish played over hundreds of venues in almost every state. **So, which states have Phish shown the most love over the years?** 

To answer this question, I used data on the number of shows per venue available [here at phish.net](https://phish.net/venues). I couldn't find an access point for this data through the .net API, so instead I chose to use the R package `rvest` to scrape the data from the website itself.

```{r}
# Scrape table of show counts by venue from phish.net
url <- "https://phish.net/venues"
venues <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/div[1]/div[2]/table') %>%
  html_table()
venues <- venues[[1]]
```

The `xpath='/html/body/div[1]/div[2]/table'` argument refers to the table element on the page that I'm interested in. You can find this info by left-clicking and using the handy `Inspect` feature in Chrome. For more about using `rvest` to scrape data from the web, see [this helpful post](http://blog.corynissen.com/2015/01/using-rvest-to-scrape-html-table.html) by Cory Nissen.

```{r}
# Get data for state map
states <- map_data(map = 'state') 
state_info <- data_frame(region = state.name,
                         state      = state.abb)

# Summarize shows by state and join with map data
state_shows <- venues %>%
  tbl_df() %>%
  group_by(State) %>%
  summarize(shows  = sum(`Times Played`, na.rm = T),
            venues = length(`Venue Name`)) %>%
  rename(state = State) %>%
  left_join(state_info) %>%
  mutate(region  = tolower(region),
         percent = 100 * shows / sum(shows, na.rm = T)) %>%
  arrange(desc(shows))

```

Next, I summarized the venue data to see how many shows and different venues Phish has played in each state (Figure 1). 

```{r, echo=F, fig.align='center', fig.cap='Map of Phish performances by State'}
# Plot shows by state
p1 <- states %>%
  left_join(state_shows) %>%
  ggplot(aes(x = long, y = lat, group = group, fill = shows, key = venues)) +
  geom_polygon() +
  scale_fill_viridis() +
  theme_bw()

p1
```

New York (`r state_shows$shows[1]` shows; `r state_shows$venues[1]` venues) and Vermont (`r state_shows$shows[2]` shows; `r state_shows$venues[2]` venues) are the clear winners and collectively home to `r round(sum(state_shows$shows[1], state_shows$shows[2]), digits = 2)`% of all Phish shows. California (`r state_shows$shows[3]` shows; `r state_shows$venues[3]` venues), Colorado (`r state_shows$shows[4]` shows; `r state_shows$venues[4]` venues), and Massachusetts (`r state_shows$shows[5]` shows; `r state_shows$venues[5]` venues) are also no stranger to a Phish show. Wyoming, the Dakotas, and Arkansas...not so much.

### What does Phish play?
One thing I always get a kick out of is watching the mental gymnastics people go through trying to comprehend how someone could see the same band multiple (13?) nights in a row, or 20, 40, 50, 100, 200+ times in their life. Some people get it, some people don't, and when people don't get it it's almost as if you told them you pour milk in the bowl before the cereal. Utter madness. However, when the band your seeing has a catalog larger than Donald Trump's supply of artificial tanner, every show is different. That being said, **which songs do Phish play the most?** Again, I turned to **phish.net** to find the number of performances of each song.   

```{r}
# Get data from Phish.net
songs <- read_html('http://phish.net/song/')

# Get show table 
songs_df <- songs %>%
  html_nodes('#song-list') %>%
  html_table(fill = T)
songs_df <- songs_df[[1]]

# Prep data for wordcloud
songs_df <- songs_df %>%
  mutate(Times = as.numeric(Times),
         word  = as.factor(`Song Name`)) %>%
  rename(freq = Times) %>%
  select(word, freq) %>%
  filter(is.na(freq)==F & freq > 0)
```

I'm a fan of using word clouds. I'm also a fan of the movie The Life Aquatic With Steve Zissou. So here's a word cloud of the most played Phish songs using Steve Zissou's favorite colors, courtesy of the wonderfully random `wesanderson` R package.

```{r, echo = FALSE, fig.align = "center", out.width='75%', fig.cap='Word cloud of Phish songs sized by number of performances'}
knitr::include_graphics("../img/phish_cloud.png")
# knitr::include_graphics("img/phish_cloud.png")
```

Not surprisingly, You Enjoy Myself, Phish's pièce de résistance, is also their most performed song. Staples like Possum, Mike's Song, Weekapaug Groove, Reba, Divided Sky, and Tweezer are also prominently displayed. Meanwhile, good luck finding Destiny Unbound in that thing. I guess even in nerd graphics she's a rare beauty.

```{r, echo=FALSE}
# Get data from phish.in API for song info
song_res <- GET('http://phish.in/api/v1/songs.json?per_page=10000')
song_raw <- song_res %>% content %>% .$data
song_counts <- map_df((seq_along(song_raw)), function(x) {
  list(song         = song_raw[[x]]$title,
       song_id      = song_raw[[x]]$id,
       performances = song_raw[[x]]$tracks_count)
})

# pull out all performances for every song in top jams
top_songs <- lapply(song_counts$song_id[song_counts$song %in% top_jams], song_info) %>%
  bind_rows()

# song summary (use to determine which jams to explore)
top_song_summary <- top_songs %>%
  group_by(song) %>%
  summarize(count = length(song),
            avg_length = mean(length, na.rm = T),
            sd_length  = sd(length, na.rm = T)) %>%
  filter(grepl('>', song) == F & is.na(sd_length)==F) %>%
  mutate(type_2 = avg_length + sd_length)
```

There's a lifetime of music represented in the word cloud above, but simply playing the song is not what continues to attract middle-aged accountants, millennial unpaid interns, and your friendly neighborhood purveyor of heady crystals. It's the jams. It's when our favorite guitarist does his best impression of a constipated dentist patient while Mike hits you with the brown note that has kept us coming back and (mostly) speaking their praises for decades.

***

## Set 2: The Jams
Not surprisingly, lists of noteworthy jams are carefully curated by the good folks at [Phish.net](http://phish.net/jamcharts). These "jamcharts", together with performance-specific data from **phish.in** and Spotify, provide a cool opportunity to explore several questions about what makes a good Phish jam.  

### How long does Phish jam?
Whether it's an extended psychedelic jam, a hard-hitting micro jam, or a unique rendition ([4/5/1998 Cavern anyone?](http://phish.in/1998-04-05/cavern)), *IT* can happen at any time. That being said, Phish has deservedly earned a reputation for **"playing the same song forever." - annonymous Chad**. I think it's a fair assumption that most phans crave that moment where Kuroda drops the lights and `[insert song here]` melts into uncharted **"Type 2"** territory. However, whether what emerges from the primal soup tickles your fancy more than when said song stays in it's lane and smothers itself with extra mustard is, like, just your opinion man. With that said, let's look at how the length of a song influences it's chances of being admitted to the hallowed halls of the [Phish.net jamcharts](http://phish.net/jamcharts).    

First, I extracted the jamcharts from Phish.net for a set of jam classics, including Tweezer, Bathtub Gin, David Bowie, and Down with Disease. 

```{r}
# Get table of all jamcharts
jc_all <- GET(paste0(end_point,'jamcharts/all?apikey=',api_key)) %>% content 
jc_all <- jc_all$response[2]$data

# Convert response to data frame
jc_all <- map_df(seq_len(length(jc_all)), function(x) {
  list(song = jc_all[[x]]$song,
       songid = jc_all[[x]]$songid,
       jams   = jc_all[[x]]$items)})

# Get info for top songs
top_jam_ids <- filter(jc_all, song %in% top_jams)

# Pull jamcharts for top_jam_ids
top_jams_charts <- lapply(top_jam_ids$songid, FUN = jam_chart) %>%
  bind_rows() %>%
  left_join(top_jam_ids)
```

Next, I joined this data set of ear candy with data from the **phish.in** API on song length and added labels for jamchart performances as well as for the coveted "Highly Recommended" jams.

```{r}
# join with Phish.in data on performances
jams <- top_songs %>%
  select(-slug) %>%
  left_join(top_jams_charts) %>%
  mutate(recommended = ifelse(is.na(highly_rec)==F, 'Yes', 'Sure')) %>%
  distinct()

# add label for highly recommended jams
jams$recommended[jams$recommended == 'Yes' & jams$highly_rec == 1] <- 'Highly'
```

Now let's look at what our complied data set shows us about the distribution of performances.

```{r, echo = FALSE, fig.align='center', fig.cap= 'Distributions of performance length and recommendation level'}
# plot distributions
jams %>%
  filter(song %in% top_jams) %>%
  left_join(top_song_summary) %>%
  mutate(is_type_2 = length >= type_2) %>%
  ggplot(aes(x = length, fill = fct_relevel(recommended, 'Highly', 'Yes', 'Sure'))) +
  geom_histogram(binwidth = 1) +
  geom_vline(data = filter(top_song_summary, song %in% top_jams),
             aes(xintercept = avg_length),
             linetype = 2) +
  scale_fill_manual(values = wesanderson::wes_palettes$Darjeeling[1:3]) +
  facet_wrap(~song, scales = 'free', ncol = 2) +
  labs(fill = 'Recommended',
       x    = 'Length (min)',
       y    = 'Frequency') +
  theme_bw()

jams %>%
  filter(song %in% 'Lawn Boy') %>%
  left_join(top_song_summary) %>%
  mutate(is_type_2 = length >= type_2) %>%
  ggplot(aes(x = length)) +
  geom_histogram(binwidth = 0.5) +
  
  geom_vline(data = filter(top_song_summary, song %in% 'Lawn Boy'),
             aes(xintercept = avg_length),
             linetype = 2) +
  labs(x    = 'Length (min)',
       y    = 'Frequency') +
  theme_bw()
```

Distinguishing type-1 and type-2 jams is more clear for certain songs than others. The clearest example is the dual peaks to the distribution for Ghost, where a version safely enters type-2 territory after 16 minutes. For other songs, like Bowie and Tweezer, the line between type-1 and type-2 is less clear, and for beloved classics like Reba and You Enjoy Myself you can usually count on surrendering to the flow for around 13 and 18 minutes, respectively (Table 1). Regardless, rest assured that, as each minute passed, you were increasingly likely to be witnessing something special. Also, don't freak out if that Tweezer you were so stoked to catch bails out after only a few minutes, you may be in store for a highly recommended segue-fest. Lastly, if you witness the boys break the 30 minute mark, congratulations; don't let your jaw break when it hit's the floor.

```{r, echo=F}
kable(top_song_summary %>%
        select(-type_2),
      digits = 2,
      col.names = c('Song ','Performances','Average length','Standard deviation'),
      caption = 'Number of performances, average length, and standard deviation of popular Phish songs',
      format = 'html')
```
***
What Figure 3 and Table 1 don't show is how performances of these songs changed over time. This is an easy extension, and Figure 4 plots the average performance length of a song by year, with the number of performances indicated by dot size and the dotted line representing the overall average length of each song. The background colors indicate the different Phish "eras". 

```{r, echo=FALSE, fig.align='center', fig.cap='Average performance length by year'}
# Average length per year lineplot
top_songs %>%
  filter(song %in% top_jams) %>%
  group_by(song, date) %>%
  summarize(length = sum(length, na.rm = T)) %>%
  separate(date, into = c('year', 'month', 'day'), sep = '-') %>%
  mutate(year = as.numeric(year)) %>%
  group_by(song, year) %>%
  summarize(avg_length = mean(length, na.rm = T),
            count      = length(song)) %>%
  complete(year=full_seq(year, period = 1), nesting(song), fill = list(0)) %>%
  group_by(song) %>%
  mutate(avg_length_all = mean(avg_length, na.rm = T)) %>%
  ungroup() %>%
  ggplot(aes(x = year, y = avg_length, group = song)) +
  geom_rect(aes(xmin = min(year, na.rm = T), xmax = 2000, 
                ymin = 0, ymax = max(avg_length, na.rm = T)),
            fill = 'lightgreen', alpha = 0.2) +
  geom_rect(aes(xmin = 2002, xmax = 2004, 
                ymin = 0, ymax = max(avg_length, na.rm = T)),
            fill = 'orange', alpha = 0.2) +
  geom_rect(aes(xmin = 2009, xmax = 2017, 
                ymin = 0, ymax = max(avg_length, na.rm = T)),
            fill = 'lightblue', alpha = 0.2) +
  geom_line() +
  geom_point(aes(size = count), alpha = 0.4) +
  scale_size_area(breaks = c(seq(10,60,by = 10))) +
  geom_line(aes(x = year, y = avg_length_all), linetype = 2) +
  facet_wrap(~song, ncol = 4) +
  labs(y = 'Average Length (minutes)',
       x = 'Year') +
  theme_bw()
```

It'd be a lie to say the Mike's Songs and David Bowie's of today are as exploratory as they were in the '90s. It's not all bad news, however, as Down with Disease and Tweezer continue to regularly mix it up, while it would appear the band has really honed their groove in Reba, You Enjoy Myself, and Bathtub Gin over the years.

### IT According to Spotify
For my final bit of Phish data nerdery (for now...) I decided to examine what Spotify can tell us about **what makes a good Phish jam?** (if you haven't already checked out Charlie Thompson's excellent analysis, [Finding Radiohead's most depressing song](http://rcharlie.com/2017-02-16-fitteR-happieR/), I highly recommend you do so). 
All Spotify tracks receive scores for numerous metrics that, I assume, Spotify uses in their noble quest to help you discover new music (seriously, the Discover Weekly, Your Daily Mix, and Release Radar playlists are fantastic). These metrics include things such as `energy`, `danceability`, `instrumentalness`, `liveness`, `valence` (how "happy" a song sounds), `key`, and `tempo`. 

```{r, eval=TRUE, echo=FALSE}
live_tracks <- read_csv(file = 'data/spotify_phish_tracks.csv')

# Convert duration from milliseconds to minutes
live_tracks <- mutate(live_tracks, minutes = duration_ms / 1000 / 60)

# find how many versions of each song
versions <- live_tracks %>%
  group_by(track_name) %>%
  summarize(versions = n())

live_tracks <- live_tracks %>%
  left_join(versions)
```

As a first step, I extracted and plotted the key metrics for every live Phish song available from the Spotify API (Figure 5).

```{r, echo=FALSE, fig.align='center', fig.cap='Spotify metrics for all Phish live performances available on Spotify'}
live_tracks %>%
  select(track_name, album_name, danceability, valence, instrumentalness, energy, liveness) %>%
  gather(key = 'metric', value = 'value', 3:7) %>%
  ggplot(aes(x = metric, y = value, fill = metric)) +
  geom_boxplot() +
  scale_fill_manual(values = wes_palette(name = 'Zissou', n = 5)) +
  guides(fill = F) +
  labs(x = 'Spotify Metric',
       y = 'Value') +
  theme_bw()
```

Now, I don't work for Spotify and don't know how they score these things, but when I see `instrumentalness` as the lowest scoring metric I immediately think alternative-facts. According to the documentation, which you can read [here](https://developer.spotify.com/web-api/get-audio-features/) if you're curious, `instrumentalness` scores are more of a measure of the presence of vocals and not instrumental prowess. Why they didn't go with a name like, oh I don't know, **vocals** is a bit of a mystery. You may also be wondering why I chose to include `liveness` considering all the tracks are, well, live. The `liveness` metric measures the presence of an audience in a track, which I decided could be a good way to gauge audience reaction to a jam. This likely has problems for set-opening and closing tracks, but oh well. 

Anyway, `instrumentalness` rant aside, the rest of the scores generally make sense to me - high `energy`, below average `danceability` (unless you're dancing style is classic white dude like mine), high audience engagement, and songs that take you on a trip across the entire range of emotions (`valence`). These patterns also seem to be generally consistent across different songs (Figure 6). 

```{r, echo=FALSE, fig.align='center', fig.cap='Spotify metrics for all performances available of select Phish songs'}
# filter out some jam classics to look at 
live_tracks %>%
  filter(track_name %in% c('Reba', 'You Enjoy Myself', 'Harry Hood', 'Tweezer', 'Bathtub Gin', 'Down With Disease')) %>%
  select(track_name, album_name, danceability, valence, instrumentalness, energy, liveness) %>%
  gather(key = 'metric', value = 'value', 3:7) %>%
  ggplot(aes(x = metric, y = value, fill = metric)) +
  geom_boxplot() +
  scale_fill_manual(values = wes_palette(name = 'Zissou', n = 5)) +
  geom_jitter(width = 0.25) +
  guides(fill = FALSE) +
  facet_wrap(~track_name, scales = 'free_x', ncol = 2) +
  labs(x = 'Spotify metric',
       y = 'Score (0-1)') +
  theme_bw() +
  theme(legend.position = 'bottom')

```

At this point, we can combine Spotify's track metrics with our previous data set of jamchart versions of songs to create a data set with which to ask the question **what makes a good Phish jam?** 

> **Disclaimer -** I'll preface this by saying the following analysis is more meaningless than truth and integrity are to Paul Ryan. This data monkey Phish phan knows that no amount of number crunching can capture what makes a particular jam special.

The first step is to flag all Spotify songs that are deemed *Highly Recommended*, which is made easy thanks to the awesome Spotify playlist [Phish.net "Highly Recommended"](https://open.spotify.com/user/1265043326/playlist/3FdZNyZ42vMB5szof9EA3N). If you're a Phish phan and a Spotify user, good luck listening to anything else for the next three months. Again using the Spotify API, I extracted the track list of all "Highly Recommended" songs. I then added a binary variable to the complete data set of Spotify metrics labeling these versions with a 1 and all others with a 0. 

```{r, echo=FALSE}
# Extract tracks from the Spotify playlist - Phish.net "Highly Recommended"
dot_net_a <- get_playlist(user_id = '1265043326', playlist_id = '3FdZNyZ42vMB5szof9EA3N', id = client_id, secret = client_secret, off = 0)

dot_net_b <- get_playlist(user_id = '1265043326', playlist_id = '3FdZNyZ42vMB5szof9EA3N', id = client_id, secret = client_secret, off = 100)

dot_net_c <- get_playlist(user_id = '1265043326', playlist_id = '3FdZNyZ42vMB5szof9EA3N', id = client_id, secret = client_secret, off = 200)

# Combine into single dataframe
dot_net <- bind_rows(dot_net_a, dot_net_b, dot_net_c) %>%
  mutate(track_uri = gsub('spotify:track:', '', track_uri)) # remove "spotify:track:" to match uri in live_tracks

# flag tracks in list of all Phish songs that are highly recommended 
live_tracks <- live_tracks %>%
  mutate(highly_rec = ifelse(track_uri %in% dot_net$track_uri, 1, 0))
```

Next, I constructed a series of binary logistic regression models, which can be used in situations where the observed outcome can have only two possible outcomes, in this case "Highly Recommended" (1) or not (0).  

```{r, echo=FALSE}
# Model 1
phishLM1 <- glm(formula = highly_rec ~ instrumentalness + energy + valence + minutes + liveness + danceability + key + tempo, family = 'binomial', data = live_tracks)
# Model 2
phishLM2 <- glm(formula = highly_rec ~ instrumentalness + energy + valence + minutes + liveness + danceability + key, family = 'binomial', data = live_tracks)
# Model 3
phishLM3 <- glm(formula = highly_rec ~ instrumentalness + energy + valence + minutes + liveness + danceability, family = 'binomial', data = live_tracks)
```

The results of all three models are included below in Table 2, and they make me very happy.

```{r, echo = FALSE, results='asis'}
stargazer(phishLM1, phishLM2, phishLM3, type = 'html')
```

***

So, what makes a good Phish jam? Well, according to Spotify, it's when the Phish from Vermont take a song for an extended and highly instrumental journey. The results also suggest that positive "bliss" jams put a smile on your face for a reason. This is not to say that the dark and nasty grooves don't get any love, just ask the [Dick's No Men In No Man's Land](http://phish.in/2016-09-02/no-men-in-no-mans-land). The `energy` and `danceability` of a song don't preclude it from becoming "Highly Recommended", which should make the lovers of spacey jams happy.

Well, there you have it, my data nerdery is finished getting in the way of my Phish geekery. If you're a data person and have questions about my methods or code, let me know. If you're a Phish phan, I hope you didn't find this to be too sacrilegious. Either way, let me know what you think in the comments. 

I'll end by saying that I strongly believe you can't capture IT in a statistic. Besides, my license plate is currently duck-taped to the bumper of my station wagon, so you probably shouldn't listen to me anyway.    

## Encore
`encore <- as.character(0)`