<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introducción" data-toc-modified-id="Introducción-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introducción</a></span></li><li><span><a href="#Abrir-el-archivo" data-toc-modified-id="Abrir-el-archivo-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Abrir el archivo</a></span></li><li><span><a href="#Identificar-número-de-canciones-de-Black-y-de-Power" data-toc-modified-id="Identificar-número-de-canciones-de-Black-y-de-Power-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Identificar número de canciones de Black y de Power</a></span></li><li><span><a href="#Abrir-cada-letra-en-una-lista-de-tokens,-usando-Spacy" data-toc-modified-id="Abrir-cada-letra-en-una-lista-de-tokens,-usando-Spacy-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Abrir cada letra en una lista de tokens, usando Spacy</a></span></li><li><span><a href="#Construir-corpus" data-toc-modified-id="Construir-corpus-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Construir corpus</a></span></li><li><span><a href="#Análisis-de-frecuencia-de-términos" data-toc-modified-id="Análisis-de-frecuencia-de-términos-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Análisis de frecuencia de términos</a></span></li><li><span><a href="#Visualizaciones" data-toc-modified-id="Visualizaciones-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Visualizaciones</a></span><ul class="toc-item"><li><span><a href="#Frecuencias-brutas" data-toc-modified-id="Frecuencias-brutas-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Frecuencias brutas</a></span></li><li><span><a href="#Percentil-de-frecuencia" data-toc-modified-id="Percentil-de-frecuencia-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Percentil de frecuencia</a></span></li><li><span><a href="#Coeficientes-de-regresión-L2-penalizados-v/s-log-frecuencia-de-términos." data-toc-modified-id="Coeficientes-de-regresión-L2-penalizados-v/s-log-frecuencia-de-términos.-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Coeficientes de regresión L2-penalizados v/s log-frecuencia de términos.</a></span></li><li><span><a href="#Método-anterior-v/s-método-de-F-escalado." data-toc-modified-id="Método-anterior-v/s-método-de-F-escalado.-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Método anterior v/s método de F-escalado.</a></span></li><li><span><a href="#Razón-de-log-probabilidad-penalizada-(penalized-log-odds-ratio)." data-toc-modified-id="Razón-de-log-probabilidad-penalizada-(penalized-log-odds-ratio).-7.5"><span class="toc-item-num">7.5&nbsp;&nbsp;</span>Razón de log-probabilidad penalizada (penalized log-odds-ratio).</a></span></li><li><span><a href="#Puntajes-&quot;corner-scores&quot;" data-toc-modified-id="Puntajes-&quot;corner-scores&quot;-7.6"><span class="toc-item-num">7.6&nbsp;&nbsp;</span>Puntajes "corner scores"</a></span></li></ul></li></ul></div>

# Introducción

El siguiente Notebook explora letras de canciones de 2 sub-géneros diferentes dentro del Metal, a saber, Power Metal y Black Metal. 

Ambos sub-géneros se diferencian en el carácter anímico de su música y en el contenido de la letras. Mientras que el Power Metal es vaces llamado "Happy Metal" y está cargado de optimismo, el Black Metal se adentra en una región más oscura de la psyche, con una fuerte carga de pesimismo. 

Las visualizaciones y ejercicios fueron realizadas siguiendo el tutorial del autor de la librería disponible en: https://github.com/JasonKessler/scattertext 

# Abrir el archivo

In [1]:
import pandas as pd
import numpy as np
import en_core_web_sm
nlp = en_core_web_sm.load()

%matplotlib inline
import scattertext as st
import time
from scipy.stats import rankdata, hmean, norm
from pprint import pprint
from IPython.display import IFrame
from IPython.core.display import display, HTML
from scattertext import CorpusFromPandas, produce_scattertext_explorer
import spacy

In [2]:
df_en = pd.read_csv('./from_scratch/df_en_genre_country_24964lyrics.csv')

# Identificar número de canciones de Black y de Power

In [3]:
print (df_en['genre'].value_counts())

Power Metal    13568
Black Metal    11396
Name: genre, dtype: int64


Power Metal:    13.568 <br>
Black Metal:    11.396

Preparar power y black como datasets separados.

In [5]:
power = df_en.loc[df_en['genre'] == 'Power Metal']
black = df_en.loc[df_en['genre'] == 'Black Metal']

power_black = df_en

# Abrir cada letra en una lista de tokens, usando Spacy

In [None]:
power_black['parsed'] = power_black.lyric.apply(nlp) 

# Construir corpus

In [8]:
corpus = st.CorpusFromPandas(power_black, 
                             category_col='genre', 
                             text_col='lyric',
                             nlp=nlp).build()

# Análisis de frecuencia de términos

Los términos frecuentes para cada grupo (Power Metal y Black Metal). <br>Un análisis de frecuencia que incluye la media harmónica. 

In [10]:
term_freq_df = corpus.get_term_freq_df()

term_freq_df['black_precision'] = term_freq_df['Black Metal freq'] * 1./(term_freq_df['Black Metal freq'] + term_freq_df['Power Metal freq'])

term_freq_df['black_freq_pct'] = term_freq_df['Black Metal freq'] * 1./term_freq_df['Black Metal freq'].sum()

term_freq_df['black_hmean'] = term_freq_df.apply(lambda x: (hmean([x['black_precision'], x['black_freq_pct']])
                                                                   if x['black_precision'] > 0 and x['black_freq_pct'] > 0 
                                                                   else 0), axis=1)

term_freq_df.sort_values(by='black_hmean', ascending=False).iloc[:10]

Unnamed: 0_level_0,Black Metal freq,Power Metal freq,black_precision,black_freq_pct,black_hmean
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the,133407,168439,0.44197,0.037607,0.069316
of,65975,59355,0.52641,0.018598,0.035927
and,40961,54710,0.428144,0.011547,0.022487
i,39149,78722,0.332134,0.011036,0.021362
to,38019,65023,0.368966,0.010718,0.02083
in,33181,44556,0.426837,0.009354,0.018306
a,28109,46367,0.377424,0.007924,0.015522
my,22883,37076,0.381644,0.006451,0.012687
you,22958,63236,0.266353,0.006472,0.012637
is,20145,28852,0.411148,0.005679,0.011203


In [11]:
def normcdf(x):
    return norm.cdf(x, x.mean(), x.std())
term_freq_df['black_precision_normcdf'] = normcdf(term_freq_df['black_precision'])
term_freq_df['black_freq_pct_normcdf'] = normcdf(term_freq_df['black_freq_pct'])
term_freq_df['black_scaled_f_score'] = hmean([term_freq_df['black_precision_normcdf'], term_freq_df['black_freq_pct_normcdf']])
term_freq_df.sort_values(by='black_scaled_f_score', ascending=False).iloc[:100]

Unnamed: 0_level_0,Black Metal freq,Power Metal freq,black_precision,black_freq_pct,black_hmean,black_precision_normcdf,black_freq_pct_normcdf,black_scaled_f_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ov,502,0,1.000000,0.000142,0.000283,0.841049,0.994059,0.911175
of satan,457,22,0.954071,0.000129,0.000258,0.815667,0.988939,0.893984
satan,1990,255,0.886414,0.000561,0.001121,0.773960,1.000000,0.872579
christian,343,38,0.900262,0.000097,0.000193,0.782906,0.956586,0.861076
nocturnal,309,28,0.916914,0.000087,0.000174,0.793387,0.938295,0.859778
...,...,...,...,...,...,...,...,...
depths,392,166,0.702509,0.000111,0.000221,0.637701,0.974997,0.771076
bestial,127,7,0.947761,0.000036,0.000072,0.811992,0.732763,0.770346
chaos,929,421,0.688148,0.000262,0.000524,0.625938,0.999999,0.769941
desecration,124,6,0.953846,0.000035,0.000070,0.815536,0.727755,0.769149


In [12]:
term_freq_df['black_corner_score'] = corpus.get_corner_scores('Black Metal')
term_freq_df.sort_values(by='black_corner_score', ascending=False).iloc[:100]

Unnamed: 0_level_0,Black Metal freq,Power Metal freq,black_precision,black_freq_pct,black_hmean,black_precision_normcdf,black_freq_pct_normcdf,black_scaled_f_score,black_corner_score
term,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
ov,502,0,1.0,0.000142,0.000283,0.841049,0.994059,0.911175,0.918921
_,229,0,1.0,0.000065,0.000129,0.841049,0.872120,0.856303,0.918919
satanas,190,0,1.0,0.000054,0.000107,0.841049,0.826251,0.833584,0.918918
_ _,160,0,1.0,0.000045,0.000090,0.841049,0.784623,0.811857,0.918917
for satan,96,0,1.0,0.000027,0.000054,0.841049,0.678887,0.751317,0.918910
...,...,...,...,...,...,...,...,...,...
diaboli,24,0,1.0,0.000007,0.000014,0.841049,0.540169,0.657838,0.918759
the mexican,24,0,1.0,0.000007,0.000014,0.841049,0.540169,0.657838,0.918759
culann,24,0,1.0,0.000007,0.000014,0.841049,0.540169,0.657838,0.918759
fra,24,0,1.0,0.000007,0.000014,0.841049,0.540169,0.657838,0.918759


In [13]:
term_freq_df = corpus.get_term_freq_df()
term_freq_df['Power Metal Score'] = corpus.get_scaled_f_scores('Power Metal')
term_freq_df['Black Metal Score'] = corpus.get_scaled_f_scores('Black Metal')
print("Top 10 Black Metal terms")
pprint(list(term_freq_df.sort_values(by='Black Metal Score', ascending=False).index[:100]))
print("Top 10 Power Metal terms")
pprint(list(term_freq_df.sort_values(by='Power Metal Score', ascending=False).index[:100]))

Top 10 Black Metal terms
['ov',
 'of satan',
 'satan',
 'thou',
 'infernal',
 'forth',
 'christ',
 'fucking',
 'funeral',
 '/',
 'thee',
 'christian',
 'existence',
 'worship',
 'blasphemy',
 'abyss',
 'thy',
 'flesh',
 'horns',
 'serpent',
 'art',
 'which',
 'lucifer',
 'o',
 'upon the',
 'wolves',
 'nocturnal',
 'whore',
 "satan 's",
 'storms',
 'behold',
 'soil',
 'of god',
 'fog',
 'forest',
 'of blood',
 'unto',
 'unholy',
 'the abyss',
 'the flesh',
 'wrath',
 'i shall',
 'chaos',
 'of death',
 'shall be',
 'shall',
 'jesus',
 'realm',
 'satanic',
 'upon',
 'north',
 'tongue',
 'mist',
 'lust',
 'rape',
 'torment',
 'woods',
 'corpse',
 'pale',
 'goat',
 'ancient',
 'blood of',
 'death',
 'towards',
 '&',
 'throne',
 'form',
 'cursed',
 'trees',
 'amp',
 '& amp',
 'depths',
 'behold the',
 'the ancient',
 'the blood',
 '-',
 'hatred',
 'of darkness',
 'winds',
 'total',
 'in blood',
 'yet',
 'drink',
 'ye',
 'fuck',
 'essence',
 'blood',
 'beneath',
 'damnation',
 'birth',
 'of c

# Visualizaciones

- Una palabra usada 10 veces en el Power Metal, estará en la posición 10 en el eje x<br>
- Esto no es muy útil. Todo, excepto los términos más frecuentes, quedan apiñados en la esquina inferior-izquierda.<br>
- Los puntajes de corner-distance son mayormente stopwords<br>
- Por defecto, el color de las palabras representa el puntaje F escalado (Scaled F-Score)

## Frecuencias brutas

In [18]:
html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    width_in_pixels=800,
                                    minimum_term_frequency=5,
                                    transform=st.Scalers.scale,
                                    metadata=power_black['band_name'])
file_name = './new_power_black_v4_ugly.html'
open(file_name, 'wb').write(html.encode('utf-8'))

32466015

## Percentil de frecuencia 

Rankear términos por sus percentiles de frecuencia en vez de la frecuencia bruta. <br>
Un término en la mitad del eje x será mencionado en el Power Metal con una frecuencia media. <br>
Esto distribuye los términos de manera más elegante en el espacio. <br>
Haciendo solamente eso, los términos con las mismas frecuencias en las 2 clases, quedarían solapados entre sí. <br>
El punto que se vería sería el primero y los demás quedarían "detrás" de éste impidiendo hacer mouseover.<br>
La solución preferida es llevarlos a orden alfabético entre términos de igual frecuencia

In [19]:
html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    width_in_pixels=800,
                                    minimum_term_frequency=5,
                                    metadata=power_black['band_name'],
                                    term_significance = st.LogOddsRatioUninformativeDirichletPrior())
file_name = './new_power_black_v4_pretty.html'
open(file_name, 'wb').write(html.encode('utf-8'))
#IFrame(src=file_name, width = 1152, height=800)

33072459

## Coeficientes de regresión L2-penalizados v/s log-frecuencia de términos.


In [20]:
def scale(ar): 
    return (ar - ar.min()) / (ar.max() - ar.min())

def zero_centered_scale(ar):
    scores = np.zeros(len(ar))
    scores[ar > 0] = scale(ar[ar > 0])
    scores[ar < 0] = -scale(-ar[ar < 0])
    return (scores + 1) / 2.

frequencies_scaled = scale(np.log(term_freq_df.sum(axis=1).values))

In [21]:
from sklearn.linear_model import LogisticRegression
scores = corpus.get_logreg_coefs('Black Metal',
                                 LogisticRegression(penalty='l2', C=10, max_iter=10000, n_jobs=-1))
scores_scaled = zero_centered_scale(scores)

html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=scores_scaled,
                                    scores=scores,
                                    sort_by_dist=False,
                                    metadata=power_black['band_name'],
                                    x_label='Log frequency',
                                    y_label='L2-Penalized Log Reg Coef')
file_name = './new_L2vsLog_pretty_v2.html'
open(file_name, 'wb').write(html.encode('utf-8'))
#IFrame(src=file_name, width = 1200, height=800)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


32433739

## Método anterior v/s método de F-escalado.

In [22]:
html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=corpus.get_scaled_f_scores('Black Metal', beta=0.5),
                                    scores=corpus.get_scaled_f_scores('Black Metal', beta=0.5),
                                    sort_by_dist=False,
                                    metadata=power_black['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Scaled F-Score')
file_name = './new_power_black_SFSvsLog_pretty_v2.html'
open(file_name, 'wb').write(html.encode('utf-8'))

32438881

## Razón de log-probabilidad penalizada (penalized log-odds-ratio).

Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis.

In [23]:
freq_df = corpus.get_term_freq_df().rename(columns={'Black Metal freq': 'y_bl', 'Power Metal freq': 'y_pw'})
a_w = 0.01
y_i, y_j = freq_df['y_bl'].values, freq_df['y_pw'].values

In [24]:
n_i, n_j = y_i.sum(), y_j.sum()
a_0 = len(freq_df) * a_w
delta_i_j = (  np.log((y_i + a_w) / (n_i + a_0 - y_i - a_w))
                 - np.log((y_j + a_w) / (n_j + a_0 - y_j - a_w)))
var_delta_i_j = ( 1./(y_i + a_w) + 1./(y_i + a_0 - y_i - a_w)
                    + 1./(y_j + a_w) + 1./(n_j + a_0 - n_j - a_w))
zeta_i_j = delta_i_j/np.sqrt(var_delta_i_j)
max_abs_zeta = max(zeta_i_j.max(), -zeta_i_j.min())
zeta_scaled_for_charting = ((((zeta_i_j > 0).astype(float) * (zeta_i_j/max_abs_zeta))*0.5 + 0.5)
                            + ((zeta_i_j < 0).astype(float) * (zeta_i_j/max_abs_zeta) * 0.5))

In [25]:
html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=zeta_scaled_for_charting,
                                    scores=zeta_i_j,
                                    sort_by_dist=False,
                                    metadata=power_black['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Log Odds Ratio w/ Uninformative Prior (alpha_w=0.01)')
file_name = './new_power_black_LOPriorvsLog_pretty_v2.html'
open(file_name, 'wb').write(html.encode('utf-8'))
#IFrame(src=file_name, width = 1200, height=800)

32419236

## Puntajes "corner scores"

In [26]:
corner_scores = corpus.get_corner_scores('Black Metal')
html = produce_scattertext_explorer(corpus,
                                    category='Black Metal',
                                    category_name='Black Metal',
                                    not_category_name='Power Metal',
                                    minimum_term_frequency=5,
                                    width_in_pixels=800,
                                    x_coords=frequencies_scaled,
                                    y_coords=corner_scores,
                                    scores=corner_scores,
                                    sort_by_dist=False,
                                    metadata=power_black['band_name'],
                                    x_label='Log Frequency',
                                    y_label='Corner Scores')
file_name = './new_power_black_CornervsLog_pretty_v1.html'
open(file_name, 'wb').write(html.encode('utf-8'))
#IFrame(src=file_name, width = 1200, height=800)

32431051