<a href="https://colab.research.google.com/github/stelmanj/MusicAndLanguage/blob/master/DataPreparation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
#install.packages("quanteda")

In [0]:
library(dplyr)
library(quanteda)
library(stringi)

In [23]:
AFsongDF <- read.csv(
  "https://github.com/stelmanj/MusicAndLanguage/blob/master/AFsongDF.csv?raw=true",
  row.names = 1) 
head(AFsongDF)

Unnamed: 0_level_0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,⋯,track_href,analysis_url,duration_ms,time_signature,title,artist,lyrics,lang,langs,sid
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
0,0.228,0.739,2,-27.313,0,0.0555,0.328,0.99,0.407,0.0112,⋯,https://api.spotify.com/v1/tracks/32AWEAYwQRky8gkAqKBgv1,https://api.spotify.com/v1/audio-analysis/32AWEAYwQRky8gkAqKBgv1,63869,5,Sharp Nighttime Thunder Storms Sounds,Thunderstorm Sound Bank,no lyrics found,,,32AWEAYwQRky8gkAqKBgv1
1,0.708,0.608,10,-8.78,0,0.32,0.0476,7.15e-06,0.127,0.22,⋯,https://api.spotify.com/v1/tracks/3Cqj0VmGsjhmHgqXcnH6Zz,https://api.spotify.com/v1/audio-analysis/3Cqj0VmGsjhmHgqXcnH6Zz,228625,4,Pse Ke Ardh Ktu,Buta,no lyrics found,,,3Cqj0VmGsjhmHgqXcnH6Zz
2,0.816,0.6,6,-9.506,0,0.253,0.195,0.0118,0.127,0.389,⋯,https://api.spotify.com/v1/tracks/26EDvDMoB73VfJtT8pKTVS,https://api.spotify.com/v1/audio-analysis/26EDvDMoB73VfJtT8pKTVS,168425,4,100 Ks,Butch,no lyrics found,,,26EDvDMoB73VfJtT8pKTVS
3,0.525,0.753,1,-7.975,1,0.499,0.44,0.0,0.0753,0.552,⋯,https://api.spotify.com/v1/tracks/0aW657knJXm5pUjG7x6YN1,https://api.spotify.com/v1/audio-analysis/0aW657knJXm5pUjG7x6YN1,215466,4,Ground Zero,Tenk,no lyrics found,,,0aW657knJXm5pUjG7x6YN1
4,0.878,0.792,1,-5.483,1,0.0645,0.181,0.0,0.119,0.513,⋯,https://api.spotify.com/v1/tracks/5KZdZNWX2B0rLY7BEcgryb,https://api.spotify.com/v1/audio-analysis/5KZdZNWX2B0rLY7BEcgryb,161250,4,Bizele,Elinel,no lyrics found,,,5KZdZNWX2B0rLY7BEcgryb
5,0.453,0.656,5,-8.205,0,0.425,0.247,0.0,0.0781,0.566,⋯,https://api.spotify.com/v1/tracks/5lsGmbedEgu7ZxVuPByY1J,https://api.spotify.com/v1/audio-analysis/5lsGmbedEgu7ZxVuPByY1J,203625,4,K.P.T,Finem,no lyrics found,,,5lsGmbedEgu7ZxVuPByY1J


Constructing this data was a process. I'm not going to go into much detail, but the data collection was a four step process and happened in Python. 
1. [This page on everynoise.com](http://everynoise.com/thesoundsofcountries.html) has links to over 100 spotify playlists, each one specific to one country. Through a combination of webscraping and requests to the [Spotify Web API](https://developer.spotify.com/documentation/web-api/), information for all songs on all the playlist this page links to was requested. Those songs for which information was returned were made into a dataframe with columns each piece of their Spotify metadata.
2. Through many requests to the [Musixmatch API](https://developer.musixmatch.com/), only a small portion of which were fruitful, excerpts of the songs' lyrics were collected and placed into the data frame as well.
3. With the help of the [langdetect](https://github.com/Mimino666/langdetect) python library, the languages of a couple thousand of the lyric excerpts were able to be labeled as one of several dozen languages.

Some columns we don't need, like columns 5 and 16 among others. 

Those rows that for songs either whose lyrics couldn't be found or whose lyrics' language couldn't be detected will have NaNs in some columns. That won't do.

In [0]:
dat <- na.exclude(AFsongDF[,c(-5,-16:-12,-23)])
dat$sid <- as.character(dat$sid) #The songid column is a character, not a factor, so make sure that is reflected.

Get the encoding types of each lyric excerpt.

In [0]:
stri_lang_enc = vapply(1:nrow(dat), function(x){
  stri_enc_detect(
    dat$lyrics[x], filter_angle_brackets = F
    )[[1]][c('Language','Encoding','Confidence')][1,] %>%
    unlist()},
    # Language: What language does R think this text is most likely written in (if it has an opinion)
    # Encoding: What type of encoding does R think is most likely right for this text
    # Confidence: We probably won't use, but how confident is R that this is the correct language diagnosis
  FUN.VALUE = c('Language','Encoding','Confidence')) %>% t() %>%
  as.data.frame()

# check it out
head(stri_lang_enc)

Make the text doc df for corpus analysis by sifting through all the ones with weird encoding types, and keeping only the ones we can handle. 

Also, remove sketchy inconsistencies: It's sometimes common for songs in other languages to have parts in English as well. That's just going to make things confusing. So hopefull this will help a little with removing those.

In [38]:
lyrics_df <- dat %>%
  # columns need to be renamed as doc_id and text so that later, the corpus function will know what to do with the data we give it
  select(doc_id = sid, text = lyrics, lang) %>%
  # throw the encoding columns into it )
  cbind(stri_lang_enc) %>%
  mutate_if(is.factor, as.character) %>%
  # get rid of non-ISO encodings
  filter(Encoding %in% c("ISO-8859-1","ISO-8859-9","ISO-8859-2")) %>%
  # if lang.detect said English but R said something else was more likely, get rid of it
  dplyr::filter(ifelse(lang != "en", T, ifelse(lang == Language, T, F)))

langs <- data.frame(lang = c('en', 'es', 'id', 'sw', 'nl'), stringsAsFactors = F)

# only use the languages in the 5 specified
lyrics_df <- lyrics_df %>% 
  inner_join(langs)

# clean up
rm(langs)

head(lyrics_df,2)

Joining, by = "lang"



Unnamed: 0_level_0,doc_id,text,lang,Language,Encoding,Confidence
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,5YPmo05u2Kd2HEGLFqaPXQ,Had lyom iss3id mbark ki wafitik a yamina ya taj lkhwdat khbark lmard libik hlkna w had lyom iss3id mbark ki wafitik a yamina ya taj lkhwdat khbark lmard libik hlkna mali tag 3lik nchofik we nssa9ssi f nass khyana mthawal galbi mndark kan lmola y9bl mna mali tag 3lik nchofik we nssa9ssi f nass khyana mthawal galbi mndark kan lmola y9bl mna hahahaa yamina w m3dm zinik al mglo3a hahahaa yamina w m3dm zinik a yamina blhkma chahrin nwdi hta dibt dwab lmlha kol lila w nhar nadi mayhlach nam lmha ra hobk ssfa mn jassdi w rabi 3ff w sabt raha nd3i lah ana njwjk ana chr3 w lh9 m3ana ... ******* This Lyrics is NOT for Commercial use ******* (1409618588767),sw,hu,ISO-8859-2,0.18
2,5yqiverYlKEA0ly96HPq01,"Since 1992 there is a club which is making history. 7 years later, in 1999, it's still kicking: PONT AERI! When the stars begin to shine ... ******* This Lyrics is NOT for Commercial use ******* (1409618588767)",en,en,ISO-8859-1,0.87


Take a smaller, stratified random sample. Limit to just five languages: English, Spanish, Indonesian, Dutch, and Swahili. Each strata is of 1 language and has 20 songs.

In [39]:
# take a smaller, evenly dispersed sample
set.seed(10)
## take an equal random sample from all five languages of interest
En_ids <- subset(x=lyrics_df,subset = lang == 'en')$doc_id
## n = 20 because there are exactly 22 songs in Dutch, the language of the 5 with the fewest songs
En_ids <- sample(En_ids,20)
eS_ids <- subset(x=lyrics_df,subset = lang == 'es')$doc_id
eS_ids <- sample(eS_ids,20)
Id_ids <- subset(x=lyrics_df,subset = lang == 'id')$doc_id
Id_ids <- sample(Id_ids,20)
Nl_ids <- subset(x=lyrics_df,subset = lang == 'nl')$doc_id
Nl_ids <- sample(Nl_ids,20)
sW_ids <- subset(x=lyrics_df,subset = lang == 'sw')$doc_id
sW_ids <- sample(sW_ids,20)

# replace a misclassified song
En_ids_all <- subset(x=lyrics_df,subset = lang == 'en')$doc_id
En_ids[which(En_ids == "720ZYTSr4vSqcFYq2CTJKN")] <-
  sample(En_ids_all[-which(En_ids_all %in% En_ids)],1)
rm(En_ids_all)
# replace another misclassified song
Id_ids_all <- subset(x=lyrics_df,subset = lang == 'id')$doc_id
Id_ids[which(Id_ids == "7mrxKs2fqNiKBE8zePEP2l")] <-
  sample(Id_ids_all[-which(Id_ids_all %in% Id_ids)],1)
rm(Id_ids_all)

# keep only the songs selected, and the columns we need
lyrics_df2 <- lyrics_df %>% subset(select = names(lyrics_df),
                   subset = doc_id %in% c(
                     En_ids, eS_ids, Id_ids, Nl_ids, sW_ids)) %>%
  select(doc_id, text, lang) # we don't need any of that encoding stuff anymore

head(lyrics_df2,2)


Unnamed: 0_level_0,doc_id,text,lang
Unnamed: 0_level_1,<chr>,<chr>,<chr>
69,5pz1Q9QFHWsUBZiJ73Jx3j,"Ze is niet altijd even vrolijk en dat ligt ook wel eens aan mij. En een beetje aan de weerman, maar die maakt eigenlijk niemand blij. En al zijn miezerige buien, daar heeft zij geen boodschap aan. Zij wil alle dagen zon, en als het moet eens een orkaan. Maar net als hem blijf ik proberen, elke dag een flauwe mop. Plots is daar dan toch die glimlach en dan klaart alles, dan klaart alles hier weer op. Want als ze lacht, breekt de hele hemel open, echt ik waan me in de tropen 't is echt machtig als ze lacht. Als ze lacht, baad ik uren in de zon ik wou dat ik dat voor haar kon wat zij voor mij doet als ze lacht. 'k Ben ook niet altijd even vrolijk, maar dat ligt echt wel niet aan mij. ... ******* This Lyrics is NOT for Commercial use ******* (1409618588767)",nl
70,6eVj2ARRKQiy2Gb9Za7jnf,"Een lach, een groet, een blij gezicht Een vogel zwevend naar het licht. Oh het lijkt zo gewoon maar het is toch een wonder. Een kind dat lacht en naar je zwaait, een fietser die de hoek omdraait. Oh het lijkt zo gewoon maar het is toch een wonder. Het leven gaat zo snel voorbij, dat geld voor jou maar ook voor mij... Oh laat de zon in je hart Ze schijnt toch voor iedereen Geniet van het leven Want het duurt toch maar even. Oh aat de zon in je hart Ze schijnt toch voor iedereen Geniet van het leven ... ******* This Lyrics is NOT for Commercial use ******* (1409618588767)",nl


Create a corpus object out of the lyrics and calculate some metrics on these excerpts. 

In [34]:
# create a corpus of the lyrics
lyrics_corpus <-  corpus(lyrics_df2)

# Let's get some information about our corpus, shall we?

lyrics_sum <- summary(lyrics_corpus, tolower = T, n = 100) %>% 
  select(- Sentences) %>% # we don't care about sentences, song lyrics probably don't follow grammar protocols all that well
  mutate(Text = as.character(Text),
         TTR = Types/Tokens) %>% # add a column for Type-to-Token ratio
  select(-lang, lang) # and move lang to the end

head(lyrics_sum)

Unnamed: 0_level_0,Text,Types,Tokens,TTR,lang
Unnamed: 0_level_1,<chr>,<int>,<int>,<dbl>,<chr>
1,5YPmo05u2Kd2HEGLFqaPXQ,79,139,0.5683453,sw
2,5yqiverYlKEA0ly96HPq01,39,58,0.6724138,en
3,5UBIHAsEPBH2elLRRWb1SM,35,77,0.4545455,en
4,3XmpHN0jwCV1unk8ygs1Ku,46,101,0.4554455,es
5,5ISajXj2M1yvkrC0KsgbR5,36,65,0.5538462,es
6,4bfuNyCaeMEMCdxy4lJdVd,72,125,0.576,en


Let's bind that back together with the original data frame, take only the columns that we need, and make the data frame for Audio_Features_and_Lyric_Spread_Metrics

In [35]:
# dat has a lot of audio features that we want to use, and lyrics_sum has a metric we want to include
esinw_df <- dat %>% 
  dplyr::select(sid, danceability, energy, key, loudness, speechiness, 
                acousticness, valence, tempo, lang) %>%   # we only need some features
  mutate(lang = as.character(lang)) %>%   # remove the factor levels of lang
  mutate(lang = factor(lang)) %>%   # so that they can be reset to have only 5 levels
  inner_join(lyrics_sum %>%   # innerjoin with lyrics_sum, 
              mutate(sid = Text) %>%   # and rename the column "Text" to "sid"
              select(sid, Types, Tokens, TTR),   # and only keep the Text, Tokens, and TTR column 
            by = "sid")

# let's see what we got now
head(esinw_df,5)

Unnamed: 0_level_0,sid,danceability,energy,key,loudness,speechiness,acousticness,valence,tempo,lang,Types,Tokens,TTR
Unnamed: 0_level_1,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>,<int>,<int>,<dbl>
1,5YPmo05u2Kd2HEGLFqaPXQ,0.751,0.813,1,-7.764,0.0388,0.512,0.872,94.793,sw,79,139,0.5683453
2,5yqiverYlKEA0ly96HPq01,0.665,0.967,0,-7.776,0.0729,0.00518,0.389,157.979,en,39,58,0.6724138
3,5UBIHAsEPBH2elLRRWb1SM,0.674,0.394,9,-12.804,0.0409,0.48,0.859,139.184,en,35,77,0.4545455
4,3XmpHN0jwCV1unk8ygs1Ku,0.778,0.813,5,-6.486,0.0304,0.371,0.962,140.031,es,46,101,0.4554455
5,5ISajXj2M1yvkrC0KsgbR5,0.752,0.702,2,-5.024,0.0279,0.498,0.853,100.017,es,36,65,0.5538462


In [0]:
# Let's save this to an .RData object so we can use it in Audio_Features_and_Lyric_Spread_Metrics
save(esinw_df, file = "AFaLSM.RData")

For A_Brief_Analysis_of_Lyric_Text_Metrics_in_5_Languages, we're still not quite done. First, we need to do some agreggating.

In [46]:
# get all the averages and totals we might want, grouping by language
lyrics_sum_avgd <- lyrics_sum %>% group_by(lang) %>% 
  summarise("Avg Tokens per Lyric Excerpt" = round(mean(Tokens)),
            "Avg Types per Lyric Excerpt" = round(mean(Types),2), 
            "Avg Type/Token Ratio (TTR)" = round(mean(TTR),3)) %>%  ungroup() %>%
  select(Language = lang, `Avg Tokens per Lyric Excerpt`, `Avg Types per Lyric Excerpt`, `Avg Type/Token Ratio (TTR)`) %>%
  as.data.frame()

# Tada!
lyrics_sum_avgd

Language,Avg Tokens per Lyric Excerpt,Avg Types per Lyric Excerpt,Avg Type/Token Ratio (TTR)
<chr>,<dbl>,<dbl>,<dbl>
en,136,72.2,0.55
es,83,41.0,0.505
id,56,35.0,0.625
nl,124,68.25,0.56
sw,139,79.0,0.568


Once more, but this time, rather than aggregating and averageing, we'll start by concatenating all excerpts of the same language together, and making a corpus out of that. Then will make a summary table for that hypothetical corpus so that later, we can compare it to the averaged one.

In [43]:
# concatenate songs of the same language together
lyrics_df_cat <- lyrics_df2 %>% group_by(lang) %>%
  summarise("text" = paste(text, collapse = " ")) %>%
  select(doc_id = lang, text)

# And create a new corpus object based on this
lyrics_corpus_cat <- corpus(lyrics_df_cat)

# fetch a summary of the corpus composition
lyrics_sum_cat <- summary(lyrics_corpus_cat, tolower = T) %>% 
  select(- Sentences) %>%
  mutate(Text = as.character(Text),
         TTR = round(Types/Tokens,3)) %>%
  rename(Language = Text)

lyrics_sum_cat

Language,Types,Tokens,TTR
<chr>,<int>,<int>,<dbl>
en,588,2543,0.231
es,564,2167,0.26
id,506,1528,0.331
nl,650,2569,0.253
sw,919,2229,0.412


In [0]:
# Let's save these guys so we can refer back to them later
save(lyrics_sum_avgd, lyrics_sum_cat, file = "ABALTM5L.RData")