<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introducción" data-toc-modified-id="Introducción-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introducción</a></span></li><li><span><a href="#Abrir-el-archivo" data-toc-modified-id="Abrir-el-archivo-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Abrir el archivo</a></span></li><li><span><a href="#Limpiar-caracteres-persistentes" data-toc-modified-id="Limpiar-caracteres-persistentes-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Limpiar caracteres persistentes</a></span></li><li><span><a href="#Black-Metal:-Suecia-y-Noruega" data-toc-modified-id="Black-Metal:-Suecia-y-Noruega-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Black Metal: Suecia y Noruega</a></span><ul class="toc-item"><li><span><a href="#Datasets" data-toc-modified-id="Datasets-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Datasets</a></span><ul class="toc-item"><li><span><a href="#Resumen-de-letras-por-país" data-toc-modified-id="Resumen-de-letras-por-país-4.1.1"><span class="toc-item-num">4.1.1&nbsp;&nbsp;</span>Resumen de letras por país</a></span></li><li><span><a href="#Black-metal:-Norway" data-toc-modified-id="Black-metal:-Norway-4.1.2"><span class="toc-item-num">4.1.2&nbsp;&nbsp;</span>Black metal: Norway</a></span></li><li><span><a href="#Eliminar-canciones-repetidas" data-toc-modified-id="Eliminar-canciones-repetidas-4.1.3"><span class="toc-item-num">4.1.3&nbsp;&nbsp;</span>Eliminar canciones repetidas</a></span></li><li><span><a href="#Construir-corpus" data-toc-modified-id="Construir-corpus-4.1.4"><span class="toc-item-num">4.1.4&nbsp;&nbsp;</span>Construir corpus</a></span></li><li><span><a href="#Frequencies" data-toc-modified-id="Frequencies-4.1.5"><span class="toc-item-num">4.1.5&nbsp;&nbsp;</span>Frequencies</a></span></li></ul></li><li><span><a href="#Visualizaciones" data-toc-modified-id="Visualizaciones-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Visualizaciones</a></span><ul class="toc-item"><li><span><a href="#Log-Freq-vs-Scaled-F-Score" data-toc-modified-id="Log-Freq-vs-Scaled-F-Score-4.2.1"><span class="toc-item-num">4.2.1&nbsp;&nbsp;</span>Log Freq vs Scaled F-Score</a></span></li><li><span><a href="#Log-Frequency-v/s-Log-Odds-Ratio-w/-Uninformative-Prior-(alpha_w=0.01)" data-toc-modified-id="Log-Frequency-v/s-Log-Odds-Ratio-w/-Uninformative-Prior-(alpha_w=0.01)-4.2.2"><span class="toc-item-num">4.2.2&nbsp;&nbsp;</span>Log Frequency v/s Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)</a></span></li><li><span><a href="#Corner-scores" data-toc-modified-id="Corner-scores-4.2.3"><span class="toc-item-num">4.2.3&nbsp;&nbsp;</span>Corner scores</a></span></li></ul></li></ul></li></ul></div>

# Introducción

Este Notebook explora letras de Black Metal noruego y Black Metal sueco. Este subgénero tomó relevancia en estos dos países, y se quiere explorar diferencias en el contenido de las letras del mismo género entre estos dos países.

# Abrir el archivo

In [1]:
import pandas as pd
import en_core_web_sm
nlp = en_core_web_sm.load()
import time
import scattertext as st
from scattertext import produce_scattertext_explorer
import numpy as np
from pprint import pprint

In [2]:
df_en_genre_country = pd.read_csv('./from_scratch/df_en_genre_country_24964lyrics.csv')

# Limpiar caracteres persistentes

In [4]:
df_en_genre_country["lyric_clean"] = df_en_genre_country['lyric'].str.replace('[^\w\s]|[_]|[\/]|[\__]|[&]','')

In [5]:
df_en_genre_country["lyric_clean"] = df_en_genre_country['lyric_clean'].str.replace(' amp | metal ','')

In [8]:
df_en_genre_country["song_title_clean"] = df_en_genre_country['song_title'].str.lstrip('0123456789.- ')

# Black Metal: Suecia y Noruega

## Datasets

### Resumen de letras por país

Cantidad de letras Black Metal, por país, Noruega, Suecia

Norway           2388
Sweden           1517

### Black metal: Norway

In [10]:
countries = ['Sweden','Norway']

In [11]:
df_en_black = df_en_genre_country.loc[df_en_genre_country['genre'] == 'Black Metal']
df_en_sweden_norway = df_en_black[df_en_black['country'].isin(countries)]

In [12]:
df_en_sweden_norway = pd.DataFrame(df_en_sweden_norway)

df_en_sweden_norway.shape = (3905, 17)

### Eliminar canciones repetidas

In [13]:
df_en_sweden_norway = df_en_sweden_norway.drop_duplicates(subset=['song_title_clean'])

df_en_sweden_norway.shape = (3581, 17)

### Construir corpus

In [14]:
start = time.time()

corpus = st.CorpusFromPandas(
    df_en_sweden_norway, category_col='country', text_col='lyric_clean',nlp=nlp).build()

end = time.time()
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
print('Time elapsed generating corpus:')
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))

Time elapsed generating corpus:
00:01:54.30


Time elapsed generating corpus:
00:02:01.16

### Frequencies

In [15]:
term_freq_df = corpus.get_term_freq_df()

term_freq_df['Sweden'] = corpus.get_scaled_f_scores('Sweden')
term_freq_df['Norway'] = corpus.get_scaled_f_scores('Norway')


print("Top 10 Sweden")
pprint(list(term_freq_df.sort_values(by='Sweden', ascending=False).index[:80]))
print("Top 10 Norway")
pprint(list(term_freq_df.sort_values(by='Norway', ascending=False).index[:80]))

Top 10 Sweden
['lord',
 'flames',
 'of god',
 'angels',
 'of blood',
 'thee',
 'behold',
 'holy',
 'forth',
 'son',
 'legions',
 'in blood',
 'shall be',
 'satan',
 'the lord',
 'infernal',
 'gates',
 'breath',
 'the blood',
 'the gates',
 'skies',
 'no more',
 'blood and',
 'divine',
 'flame',
 'thy',
 'blood',
 'damnation',
 'ye',
 'kingdom',
 'angel',
 'wrath',
 'behold the',
 'demons',
 'come forth',
 'shall',
 'eternal',
 'burning',
 'the sky',
 'lord of',
 'wings',
 'god',
 'light of',
 'sky',
 'the black',
 'over the',
 'walk',
 'of death',
 'wide',
 'high',
 'heaven',
 'the flames',
 'shadow',
 'must',
 'oh',
 'light',
 'child',
 'eye',
 'christ',
 'great',
 'ones',
 'the light',
 'cries',
 'of hell',
 'die',
 'the night',
 'to die',
 'let',
 'sacrifice',
 'set',
 'my eyes',
 'rise',
 'burn',
 'over',
 'death',
 'soon',
 'shadows',
 'land',
 'the dark',
 'blood of']
Top 10 Norway
['yourself',
 'thoughts',
 'can not',
 'times',
 'it is',
 'og',
 'mountains',
 'did',
 'future',
 

## Visualizaciones

### Log Freq vs Scaled F-Score

In [16]:
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))

In [19]:
start = time.time()

html = produce_scattertext_explorer(corpus,
                                    category='Norway',
                                    category_name='Norway',
                                    not_category_name='Sweden',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('Norway', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('Norway', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=df_en_sweden_norway['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = './st_sweden_norway_01.html'
open(file_name, 'wb').write(html.encode('utf-8'))

end = time.time()
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
print('Time elapsed generating scattertext file:')
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))

Time elapsed generating scattertext file:
00:00:04.07


Time elapsed generating scattertext file:
00:00:10.02

### Log Frequency v/s Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)

In [20]:
freq_df = corpus.get_term_freq_df().rename(columns={'Norway freq': 'y_bl', 'Sweden freq': 'y_pw'})
a_w = 0.01
y_i, y_j = freq_df['y_bl'].values, freq_df['y_pw'].values

In [21]:
n_i, n_j = y_i.sum(), y_j.sum()
a_0 = len(freq_df) * a_w
delta_i_j = (  np.log((y_i + a_w) / (n_i + a_0 - y_i - a_w))
                 - np.log((y_j + a_w) / (n_j + a_0 - y_j - a_w)))
var_delta_i_j = ( 1./(y_i + a_w) + 1./(y_i + a_0 - y_i - a_w)
                    + 1./(y_j + a_w) + 1./(n_j + a_0 - n_j - a_w))
zeta_i_j = delta_i_j/np.sqrt(var_delta_i_j)
max_abs_zeta = max(zeta_i_j.max(), -zeta_i_j.min())
zeta_scaled_for_charting = ((((zeta_i_j > 0).astype(float) * (zeta_i_j/max_abs_zeta))*0.5 + 0.5)
                            + ((zeta_i_j < 0).astype(float) * (zeta_i_j/max_abs_zeta) * 0.5))

In [22]:
start = time.time()

html = produce_scattertext_explorer(corpus,
                                    category='Norway',
                                    category_name='Norway',
                                    not_category_name='Sweden',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=zeta_scaled_for_charting,
                                    scores=zeta_i_j,
                                    sort_by_dist=False,
                                    metadata=df_en_sweden_norway['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)')
file_name = './st_sweden_norway_02.html'
open(file_name, 'wb').write(html.encode('utf-8'))

end = time.time()
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
print('Time elapsed generating scattertext file:')
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))

Time elapsed generating scattertext file:
00:00:03.54


### Corner scores

In [23]:
start = time.time()

corner_scores = corpus.get_corner_scores('Norway')
html = produce_scattertext_explorer(corpus,
                                    category='Norway',
                                    category_name='Norway',
                                    not_category_name='Sweden',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=corner_scores,
                                    scores=corner_scores,
                                    sort_by_dist=False,
                                    metadata=df_en_sweden_norway['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Corner Scores')
file_name = './st_sweden_norway_03.html'
open(file_name, 'wb').write(html.encode('utf-8'))

end = time.time()
hours, rem = divmod(end-start, 3600)
minutes, seconds = divmod(rem, 60)
print('Time elapsed generating scattertext file:')
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))

Time elapsed generating scattertext file:
00:00:03.94
