## Importando os dados 

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('beyonce_rihanna.csv', index_col=0)
df.head()

Unnamed: 0,Nome da Música,link,album,letra,artista
0,'03 Bonnie & Clyde,/beyonce/03-bonnie-clyde.html,I Am... Yours: An Intimate Performance at Wynn...,Jay-z Uh-uh-uh You ready b? Let's go get 'em. ...,Beyoncé
1,***Flawless (Feat. Chimamanda Ngozi Adichie),/beyonce/flawless-feat-chimamanda-ngozi-adichi...,BEYONCÉ,Your challengers are a young group from Housto...,Beyoncé
2,***Flawless (Feat. Nicki Minaj),/beyonce/flawless-feat-nicki-minaj.html,BEYONCÉ [Platinum Edition],"Dum-da-de-da Do, do, do, do, do, do (Coming do...",Beyoncé
3,1+1,/beyonce/11.html,BEYONCÉ [Platinum Edition],If I ain't got nothing I got you If I ain't go...,Beyoncé
4,6 Inch (Feat. The Weeknd),/beyonce/6-inch-feat-the-weeknd.html,LEMONADE,Six inch heels She walked in the club like nob...,Beyoncé


## Pré-processamentos 

A etapa de pré-processamento consiste basicamente em preparar os dados de modo que possamos fazer uma análise.

Vamos tomar uma música do dataset como exemplo para explicar os pré-processamentos:

In [None]:
#Utilizando uma música como exemplo
exemplo = df['letra'][10]
exemplo

"Here I am Looking in the mirror An open face, the pain erased Now the sky is clearer I can see the sun Now that all is, all is said and done, oh  There you are Always strong when I need you You let me give And now I live, fearless and protected With the one I will love After all is, all is said and done  I once believed that hearts were made to bleed (Inside I once believed that hearts were made to bleed, oh baby) But now I'm not afraid to say I need you, I need you so stay with me  These precious (precious) hours (yeah) Greet each dawn in open arms And dream, into tomorrow  Where there's only love After all is, all is said and done  (Yeah baby) Oh baby (Inside I once believed, That hearts were meant to bleed)  (I'll never) I'll never be afraid to say I need you, I need you, so here  Here we are in the still of this moment Fear is gone, hope lives on  We found our happy ending For there's only love (only love) And this sweet, sweet love After all is, all is said and done  Yeah baby af

### Tokenização 
- Pegar cada palavra do texto e armazenar em uma lista (chamamos as palavras dessa lista de tokens);
- Se a palavra aparece mais de uma vez no texto, aparecerá o mesmo número de vezes na lista.

In [None]:
from nltk.tokenize import word_tokenize

In [None]:
#Tokenizando a primeira música
tokens = word_tokenize(exemplo)
tokens

['Here',
 'I',
 'am',
 'Looking',
 'in',
 'the',
 'mirror',
 'An',
 'open',
 'face',
 ',',
 'the',
 'pain',
 'erased',
 'Now',
 'the',
 'sky',
 'is',
 'clearer',
 'I',
 'can',
 'see',
 'the',
 'sun',
 'Now',
 'that',
 'all',
 'is',
 ',',
 'all',
 'is',
 'said',
 'and',
 'done',
 ',',
 'oh',
 'There',
 'you',
 'are',
 'Always',
 'strong',
 'when',
 'I',
 'need',
 'you',
 'You',
 'let',
 'me',
 'give',
 'And',
 'now',
 'I',
 'live',
 ',',
 'fearless',
 'and',
 'protected',
 'With',
 'the',
 'one',
 'I',
 'will',
 'love',
 'After',
 'all',
 'is',
 ',',
 'all',
 'is',
 'said',
 'and',
 'done',
 'I',
 'once',
 'believed',
 'that',
 'hearts',
 'were',
 'made',
 'to',
 'bleed',
 '(',
 'Inside',
 'I',
 'once',
 'believed',
 'that',
 'hearts',
 'were',
 'made',
 'to',
 'bleed',
 ',',
 'oh',
 'baby',
 ')',
 'But',
 'now',
 'I',
 "'m",
 'not',
 'afraid',
 'to',
 'say',
 'I',
 'need',
 'you',
 ',',
 'I',
 'need',
 'you',
 'so',
 'stay',
 'with',
 'me',
 'These',
 'precious',
 '(',
 'precious',
 ')

### Selecionando apenas as letras 

- Regex (expressões regulares) para selecionar apenas as letras;
    - re.findall já faz isso e a tokenização;
- Converter para minúscula (.lower(), nativo do python).

In [None]:
import re #regex

letras = re.findall(r'\b[A-zÀ-úü]+\b', exemplo.lower())
letras

['here',
 'i',
 'am',
 'looking',
 'in',
 'the',
 'mirror',
 'an',
 'open',
 'face',
 'the',
 'pain',
 'erased',
 'now',
 'the',
 'sky',
 'is',
 'clearer',
 'i',
 'can',
 'see',
 'the',
 'sun',
 'now',
 'that',
 'all',
 'is',
 'all',
 'is',
 'said',
 'and',
 'done',
 'oh',
 'there',
 'you',
 'are',
 'always',
 'strong',
 'when',
 'i',
 'need',
 'you',
 'you',
 'let',
 'me',
 'give',
 'and',
 'now',
 'i',
 'live',
 'fearless',
 'and',
 'protected',
 'with',
 'the',
 'one',
 'i',
 'will',
 'love',
 'after',
 'all',
 'is',
 'all',
 'is',
 'said',
 'and',
 'done',
 'i',
 'once',
 'believed',
 'that',
 'hearts',
 'were',
 'made',
 'to',
 'bleed',
 'inside',
 'i',
 'once',
 'believed',
 'that',
 'hearts',
 'were',
 'made',
 'to',
 'bleed',
 'oh',
 'baby',
 'but',
 'now',
 'i',
 'm',
 'not',
 'afraid',
 'to',
 'say',
 'i',
 'need',
 'you',
 'i',
 'need',
 'you',
 'so',
 'stay',
 'with',
 'me',
 'these',
 'precious',
 'precious',
 'hours',
 'yeah',
 'greet',
 'each',
 'dawn',
 'in',
 'open',
 

### Stopwords
- Stopwords são palavras que aparecem frequentemente em um texto mas carregam pouca relevância semântica, podendo prejudicar nosso modelo. Por isso, removeremos elas.
- Exemplo, lista de stopwords em português:

In [None]:
import nltk

# Fazendo o download das listas de stopwords e importando
nltk.download('stopwords')
from nltk.corpus import stopwords

In [4]:
pt_stops = stopwords.words('portuguese')
pt_stops #lista de stopwords em portugues

['de',
 'a',
 'o',
 'que',
 'e',
 'é',
 'do',
 'da',
 'em',
 'um',
 'para',
 'com',
 'não',
 'uma',
 'os',
 'no',
 'se',
 'na',
 'por',
 'mais',
 'as',
 'dos',
 'como',
 'mas',
 'ao',
 'ele',
 'das',
 'à',
 'seu',
 'sua',
 'ou',
 'quando',
 'muito',
 'nos',
 'já',
 'eu',
 'também',
 'só',
 'pelo',
 'pela',
 'até',
 'isso',
 'ela',
 'entre',
 'depois',
 'sem',
 'mesmo',
 'aos',
 'seus',
 'quem',
 'nas',
 'me',
 'esse',
 'eles',
 'você',
 'essa',
 'num',
 'nem',
 'suas',
 'meu',
 'às',
 'minha',
 'numa',
 'pelos',
 'elas',
 'qual',
 'nós',
 'lhe',
 'deles',
 'essas',
 'esses',
 'pelas',
 'este',
 'dele',
 'tu',
 'te',
 'vocês',
 'vos',
 'lhes',
 'meus',
 'minhas',
 'teu',
 'tua',
 'teus',
 'tuas',
 'nosso',
 'nossa',
 'nossos',
 'nossas',
 'dela',
 'delas',
 'esta',
 'estes',
 'estas',
 'aquele',
 'aquela',
 'aqueles',
 'aquelas',
 'isto',
 'aquilo',
 'estou',
 'está',
 'estamos',
 'estão',
 'estive',
 'esteve',
 'estivemos',
 'estiveram',
 'estava',
 'estávamos',
 'estavam',
 'estivera'

Voltando ao nosso exemplo:

In [None]:
stops = stopwords.words('english')
stops #lista de stopwords em inglês 

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Removendo stopwords:

In [None]:
# Cria lista sem as stopwords
sem_stopwords = [palavra for palavra in letras if palavra not in stops]
# Junta os tokens em uma string
palavras_importantes = " ".join(sem_stopwords)
palavras_importantes

'looking mirror open face pain erased sky clearer see sun said done oh always strong need let give live fearless protected one love said done believed hearts made bleed inside believed hearts made bleed oh baby afraid say need need stay precious precious hours yeah greet dawn open arms dream tomorrow love said done yeah baby oh baby inside believed hearts meant bleed never never afraid say need need still moment fear gone hope lives found happy ending love love sweet sweet love said done yeah baby said done'

### Lematização 
- Simplificação lexical: pegar o lemma de cada palavra, ou seja, sua forma não flexionada;
    - Exemplo: Passar os verbos para o infitivo;
- Usaremos o lematizador da biblioteca Spacy.

In [None]:
import spacy
spc = spacy.load('en_core_web_sm')

In [None]:
spc_letras = spc(palavras_importantes) # instanciando o objeto spacy

# Cria lista lematizada
lemmas = [token.lemma_ if token.pos_ == 'VERB' else str(token) for token in spc_letras]

texto_limpo = " ".join(lemmas)
print(texto_limpo)

look mirror open face pain erase sky clearer see sun say do oh always strong need let give live fearless protect one love say do believe hearts make bleed inside believe hearts make bleed oh baby afraid say need need stay precious precious hours yeah greet dawn open arms dream tomorrow love say do yeah baby oh baby inside believe hearts mean bleed never never afraid say need nee still moment fear go hope lives find happy end love love sweet sweet love say do yeah baby say do


### Juntando tudo

Vamos construir uma função para realizar todos os pré-processamentos: 

In [None]:
def limpar_texto(texto):
    #selecionando apenas as letras e convertendo para minúscula 
    letras =  re.findall(r'\b[A-zÀ-úü]+\b', texto.lower())
    
    #removendo as stopwords 
    stops = set(stopwords.words('portuguese'))
    palavras = [w for w in letras if w not in stops]
    palavras_importantes = " ".join(palavras)
    
    #lematização 
    spc_letras = spc(palavras_importantes)
    lemmas = [token.lemma_ if token.pos_ == 'VERB' else str(token) for token in spc_letras]
    texto_limpo = " ".join(lemmas)
    
    return texto_limpo 

Agora vamos aplicar às letras do nosso dataset:

In [None]:
# Cria nova coluna com o texto limpo
df['Texto Limpo'] = df['letra'].apply(limpar_texto)

In [None]:
df.head()

Unnamed: 0,Nome da Música,link,album,letra,artista,Texto Limpo
0,'03 Bonnie & Clyde,/beyonce/03-bonnie-clyde.html,I Am... Yours: An Intimate Performance at Wynn...,Jay-z Uh-uh-uh You ready b? Let's go get 'em. ...,Beyoncé,jay z uh uh uh you ready b let s go get look y...
1,***Flawless (Feat. Chimamanda Ngozi Adichie),/beyonce/flawless-feat-chimamanda-ngozi-adichi...,BEYONCÉ,Your challengers are a young group from Housto...,Beyoncé,your challengers are young group from houston ...
2,***Flawless (Feat. Nicki Minaj),/beyonce/flawless-feat-nicki-minaj.html,BEYONCÉ [Platinum Edition],"Dum-da-de-da Do, do, do, do, do, do (Coming do...",Beyoncé,dum come down drip candy on the ground it stay...
3,1+1,/beyonce/11.html,BEYONCÉ [Platinum Edition],If I ain't got nothing I got you If I ain't go...,Beyoncé,if i ain t get nothing i get you if i ain t ge...
4,6 Inch (Feat. The Weeknd),/beyonce/6-inch-feat-the-weeknd.html,LEMONADE,Six inch heels She walked in the club like nob...,Beyoncé,six inch heels she walk in the club like nobod...


## Feature Extraction
Antes de treinar o nosso modelo, precisamos organizar os nossos documentos em features que o computador consegue entender, assim, vamos precisamos transformar o nosso texto em algum tipo de representação numérica. Para isso, vamos usar o Bag of Words. 

### Bag of Words 
**O que é o Bag of Words?:** BoW é uma forma de representação de texto que descreve a ocorrência de palavras em um documento. Para o BoW a ordem não importa, essa forma de representação só se importa se as palavras conhecidas ocorrem ou não no documento (literalmente um "saco" de palavras). 

Para implementarmos o Bag of Words, precisamos de três coisas: 
1. Um vocabulário com as palavras conhecidas
2. A ocorrência dessas palavras
3. Formar vetores a partir dos documentos 

**Exemplo**

"to the left to the left everything you own in the box to the left"

1. Construir o vocabulário

    ["to", "the", "left", "everything", "you", "own", "in", "box"]
    

2. Ocorrência das palavras

    {"to": 3, "the": 3, "left":3, "everything":1, "you":1, "own":1, "in":1, "box":1}


3. Vetores

    Considerando que o nosso documento fosse: "to the left to the left"

    Usando o vocabulário que construímos antes, o nosso vetor seria: 

    [2, 2, 2, 0, 0, 0, 0]

### Count Vectorizer 
Felizmente, temos o CountVectorizer! Com ele, conseguimos implementar todos os passos acima de uma maneira bem simples: 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer 

# Bag of words
count_vectorizer = CountVectorizer()
X = count_vectorizer.fit_transform(df['Texto Limpo'])

Olhando o nosso vocabulário: 

In [None]:
count_vectorizer.get_feature_names() #Todas as palavras do nosso vocabulário 

['aa',
 'aaaaaah',
 'aaah',
 'aah',
 'aahhhh',
 'aaron',
 'abandon',
 'abanenkani',
 'abaziyo',
 'abit',
 'abita',
 'able',
 'aboard',
 'about',
 'above',
 'abrasive',
 'absolutely',
 'abstain',
 'abu',
 'abunch',
 'abuse',
 'acabado',
 'acabo',
 'acabó',
 'acaso',
 'accent',
 'accept',
 'accepte',
 'acceptin',
 'access',
 'accidentally',
 'accomodation',
 'accomplishments',
 'account',
 'accountant',
 'accule',
 'accusations',
 'ace',
 'ache',
 'achetant',
 'achieve',
 'achètera',
 'acompañarme',
 'across',
 'act',
 'actin',
 'acting',
 'action',
 'activité',
 'actor',
 'actress',
 'actual',
 'actually',
 'acuerdo',
 'ad',
 'add',
 'addict',
 'addicted',
 'addiction',
 'addictive',
 'address',
 'adichie',
 'adicto',
 'adiós',
 'adjust',
 'adlibs',
 'admire',
 'admit',
 'admittin',
 'adolescent',
 'adonde',
 'adoration',
 'adore',
 'adorent',
 'adrenaline',
 'adult',
 'advance',
 'advanced',
 'advantage',
 'advice',
 'advise',
 'affair',
 'affect',
 'affectin',
 'affection',
 'affectio

In [None]:
count_vectorizer.vocabulary_.get('love')

3543

Exemplo da nossa matriz termo-documento:

In [None]:
df_cv = pd.DataFrame(X.toarray(), columns = count_vectorizer.get_feature_names())
df_cv.head()

Unnamed: 0,aa,aaaaaah,aaah,aah,aahhhh,aaron,abandon,abanenkani,abaziyo,abit,...,égaux,élever,élu,éléverons,état,été,évite,évoque,êt,única
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


No dataframe acima, cada uma das colunas representa uma das palavras do nosso vocabulário, e cada linha, um dos nossos documentos, ou seja, uma das nossas músicas. 

Como são muitas colunas, vamos inspecionar apenas dez delas:  

In [None]:
df_cv.iloc[:, 10:20]

Unnamed: 0,abita,able,aboard,about,above,abrasive,absolutely,abstain,abu,abunch
0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0
3,0,0,0,3,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
503,0,0,0,1,0,0,0,0,0,0
504,0,0,0,14,0,0,0,0,0,0
505,0,0,0,0,0,0,0,0,0,0
506,0,0,0,0,0,0,0,0,0,0


Podemos observar que a palavra "about" aparece 14 vezes no documento 504. Vamos investigar isso: 

In [None]:
df.iloc[504, 5]

'you the one that i dream about all day you the one that i think about always you are the one so i make sure i behave my love is your love your love is my love baby i love you i need you here with all the time ime baby we mean to be you get smile all the time ime cause you know how to give that you know how to pull back when i go runnin runnin tryin to get away from love ya you know how to love hard i win t lie i m fall hard yep i m fall ya but there s nothin wrong with that you the one that i dream about all day you the one that i think about always you are the one so i make sure i behave my love is your love your love is my love you the one that i dream about all day you the one that i think about always you are the one so i make sure i behave my love is your love your love is mine baby come take now hold now make come alive i have you get the sweetest touch i m so happy you come in my life life cause you know how to give that you know how to pull back when i go runnin runnin tryin t

De fato, essa música apr

## Separando em Treino e Teste
- explicar

In [None]:
from sklearn.model_selection import train_test_split

X = X.toarray()
y = df['artista']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Naive Bayes
- Explicar modelo/aplicação 

In [None]:
from sklearn.naive_bayes import MultinomialNB

#Criando o Modelo Naive Bayes 
naive_bayes = MultinomialNB()

#.......Treinando o Modelo.......
naive_bayes.fit(X_train, y_train)

#Fazendo as previsões
naive_bayes_pred = naive_bayes.predict(X_test)

## Métricas 
- Explicar métricas e resultados 

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix

#Calculando a acurácia
acc = accuracy_score(naive_bayes_pred, y_test)

#Matriz de confusão 
cm = confusion_matrix(naive_bayes_pred, y_test)

print("Acurácia do modelo", acc)
print("\nMatriz de confusão: \n", cm)

Acurácia do modelo 0.6764705882352942

Matriz de confusão: 
 [[39  9]
 [24 30]]


## Avaliando as músicas
- Tirei as músicas "Drunk in Love" (Beyoncé) e "Diamonds" (Rihanna) do dataset para testarmos na mão se o modelo consegue prever as cantoras corretamente: 

In [None]:
frase_beyonce = ["i ve been drinking i ve been drinking i get filthy when that liquor gets into i ve been thinking i ve been thinking why can t i keep my fingers off you baby i want you why can t i keep my fingers off you baby i want you cigars on ice cigars on ice feeling like an animal with these cameras all in my grill flashing lights flashing lights you got faded faded faded baby i want you can t keep your eyes off my fatty daddy i want you drunk in love drunk in love we be all night last thing i remember is our beautiful bodies grinding off in that club drunk in love we be all night love love we be all night love we be all night and everything alright complaints cause my body so fluorescent under these lights boy i m drinking walking in my l assemblage i m grubbing on the rope grubbing if you scared call that reverend boy i m drinking get my brain right i m on the cognac gangster wife new sheets he d swear that i like washed rags he wet up boy i m drinking i m sinking on the mic til my boy toys then i fill the tub up halfway then ride it with my surfboard surfboard surfboard graining on that wood graining graining on that wood i m swerving on that swerving swerving on that big body benz serving all this swerv surfing all of this good good drunk in love we be all night last thing i remember is our beautiful bodies grinding off in that club drunk in love we be all night love love we be all night love love"] 
teste_b = count_vectorizer.transform(frase_beyonce)
pred_b = naive_bayes.predict(teste_b)
print(pred_b)

['Beyoncé']


In [None]:
frase_rihanna = ["shine bright like diamond shine bright like diamond find light in the beautiful sea i choose to be happy you and i you and i we re like diamonds in the sky you re shooting star i see vision of ecstasy when you hold i m alive we re like diamonds in the sky i knew that we d become one right away oh right away at first sight i felt the energy of sun rays i saw the life inside your eyes so shine bright tonight you and i we re beautiful like diamonds in the sky eye to eye so alive we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond we re beautiful like diamonds in the sky palms rise to the universe we moonshine and molly feel the warmth we ll never die we re like diamonds in the sky you re shooting star i see vision of ecstasy when you hold i m alive we re like diamonds in the sky at first sight i felt the energy of sun rays i saw the life inside your eyes so shine bright tonight you and i we re beautiful like diamonds in the sky eye to eye so alive we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond so shine bright tonight you and i we re beautiful like diamonds in the sky eye to eye so alive we re beautiful like diamonds in the sky shine bright like diamond shine bright like diamond shine bright like diamond shine bright like diamond shine bright like diamond shine bright like diamond shine bright like diamond"] 
teste_r = count_vectorizer.transform(frase_rihanna)
pred_r = naive_bayes.predict(teste_r)
print(pred_r)

['Beyoncé']
