<a href="https://colab.research.google.com/github/sooking87/NLP_Toy_Proj/blob/master/fine_tunning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset

Genius의 search_song 메서드를 통해서 가사를 불러왔다. 해당 데이터셋(full_lyrics_polarity_dataset.csv) artist, title, year, lyrics, neg_sentiment, neu_sentiment, pos_sentiment, com_sentiment, polarity, textblob_pol 칼럼으로 구성되어 있다. 
 

- `lyrics`: 원본 가사(토큰화 전 가사)

- `polarity`: vader을 통해서 com_sentiment >= 0.05 라면 pos, 아니라면 neg 로 분리하였다. 

- `textblob_pol`: textblob에서 제공하는 polarity 점수를 사용하였다. 

## 1. genius API 사용해서 가사 불러오기 + VADER 사용해서 감정 점수 구하기

In [None]:
!pip install lyricsgenius

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lyricsgenius
  Downloading lyricsgenius-3.0.1-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 4.6 MB/s 
Installing collected packages: lyricsgenius
Successfully installed lyricsgenius-3.0.1


In [None]:
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[K     |████████████████████████████████| 125 kB 13.1 MB/s 
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
import lyricsgenius
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import pandas as pd

# Function to return song lyrics
def get_lyrics(title, artist):
  try:
      return genius.search_song(title, artist).lyrics
  except:
      return 'not found'

# Function to return sentiment score of each song
def get_lyric_sentiment(lyrics):
  sentiment = sid_obj.polarity_scores(lyrics)
  return sentiment


''' 원본 파일에서 가수, 제목을 통해서 가사 + 감정 점수 불러오기'''

genius = lyricsgenius.Genius(
    "DuO42xKa4Ts70InLe_Y_strEpeL_CxowzCtXyAMaiNlbAOVOTfFpt2q5FdP4lo_U")
sid_obj = SentimentIntensityAnalyzer()

lyrics_sentiment_dataset = pd.read_csv('./tcc_ceds_music.csv', encoding='cp949')
lyrics_sentiment_dataset = lyrics_sentiment_dataset.drop(
    ['Unnamed: 0', 'genre',
    'lyrics', 'len', 'dating', 'violence', 'world/life', 'night/time',
    'shake the audience', 'family/gospel', 'romantic', 'communication',
    'obscene', 'music', 'movement/places', 'light/visual perceptions',
    'family/spiritual', 'like/girls', 'sadness', 'feelings', 'danceability',
    'loudness', 'acousticness', 'instrumentalness', 'valence', 'energy',
    'topic', 'age'], axis=1)

lyrics_sentiment_dataset.drop_duplicates(subset='track_name', inplace=True)
lyrics_sentiment_dataset.reset_index(drop=True)

# 노래 가사 불러오기
lyrics = lyrics_sentiment_dataset.apply(lambda row: get_lyrics(
    row['track_name'], row['artist_name']), axis=1)
lyrics_sentiment_dataset['lyrics'] = lyrics

# not found 제거
lyrics_sentiment_dataset = lyrics_sentiment_dataset.drop(
    lyrics_sentiment_dataset[lyrics_sentiment_dataset['lyrics'] == 'not found'].index)

# Use get_lyric_sentiment to get sentiment score for all the song lyrics
sentiment = lyrics_sentiment_dataset.apply(
    lambda row: get_lyric_sentiment(row['lyrics']), axis=1)

for i in lyrics_sentiment_dataset.index.tolist():
    lyrics_sentiment_dataset.loc[i, 'neg_sentiment'] = sentiment[i]['neg']
    lyrics_sentiment_dataset.loc[i, 'neu_sentiment'] = sentiment[i]['neu']
    lyrics_sentiment_dataset.loc[i, 'pos_sentiment'] = sentiment[i]['pos']
    lyrics_sentiment_dataset.loc[i, 'com_sentiment'] = sentiment[i]['compound']

lyrics_sentiment_dataset.to_csv("lyrics_sentiment_dataset.csv", index=False)

## 2. 감정 점수가 하나라도 0점이거나 해당 노래에 맞지 않는 가사인 경우 지우기

In [None]:
import pandas as pd

df = pd.read_csv('./lyrics_sentiment_dataset.csv')

# 하나라도 0점이면 out
filtered_df = df[(df['neg_sentiment'] != 0.000) & (df['neu_sentiment'] != 0.000) & (df['pos_sentiment'] != 0.000) & (df['com_sentiment'] != 0.000)]

# lyrics_len column 추가 -> 길이순 대로 정렬 -> 너무 긴거는 지우기
get_len = []
for i in filtered_df.index.tolist():
    length = len(filtered_df.loc[i, 'lyrics'])
    filtered_df.loc[i, 'lyrics_len'] = length

# lyrics_len 내림차순 정렬
filtered_df.sort_values('lyrics_len', ascending=False, inplace=True)

# 어디까지는 "무조건" 지워야하는지 확인 -> filtered_df 기준 인덱스 90까지 지워야됨
fin_filtered_df = filtered_df.iloc[91:, 0:]

fin_filtered_df.to_csv("full_lyrics_sentiment_dataset.csv")

## 3. com_sentiment, textblob 을 이용해서 polarity 구하기

- `polarity`: vader을 통해서 com_sentiment >= 0.05 라면 pos, 아니라면 neg 로 분리하였다. 

- `textblob_pol`: textblob에서 제공하는 polarity 점수를 사용하였다. 

In [1]:
!pip install vaderSentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[?25l[K     |██▋                             | 10 kB 29.7 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 20.8 MB/s eta 0:00:01[K     |███████▉                        | 30 kB 26.7 MB/s eta 0:00:01[K     |██████████▍                     | 40 kB 19.3 MB/s eta 0:00:01[K     |█████████████                   | 51 kB 16.9 MB/s eta 0:00:01[K     |███████████████▋                | 61 kB 19.3 MB/s eta 0:00:01[K     |██████████████████▏             | 71 kB 20.4 MB/s eta 0:00:01[K     |████████████████████▉           | 81 kB 21.8 MB/s eta 0:00:01[K     |███████████████████████▍        | 92 kB 23.6 MB/s eta 0:00:01[K     |██████████████████████████      | 102 kB 24.3 MB/s eta 0:00:01[K     |████████████████████████████▋   | 112 kB 24.3 MB/s eta 0:00:01[K     |███████████

In [2]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

sid_obj = SentimentIntensityAnalyzer()

def sent_textblob_polarity(lyrics):
    polarity_score = TextBlob(lyrics).sentiment.subjectivity
    if float(polarity_score) > 0:
      return 1
    return 0

def sent_vader_compound(com):
  if float(com) >= 0.05:
    return 1
  return 0

for i in df.index.tolist():
  lyrics = df.loc[i, "lyrics"]
  df.loc[i, "polarity"] = sent_vader_compound(df.loc[i, "com_sentiment"])
  df.loc[i, "textblob_pol"] = sent_textblob_polarity(lyrics)

df.to_csv("full_lyrics_polarity_dataset.csv", index=False)

## 3. 데이터 토큰화

In [5]:
pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import pandas as pd
from nltk import sent_tokenize
import tensorflow.keras as tf
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import re

In [None]:
new_stopwords = stopwords.words('english')
add_stopwords = ["[", "]", "(", ")", ",", "lyrics",
                 "chorus", "!", "?", "``", "oh", "ha", "ah", "-", "yo", "yeah", "uh", "uhh", ":"]
new_stopwords.extend(add_stopwords)
nltk.download('omw-1.4')
nltk.download('wordnet')
text_to_word_sequence = tf.keras.preprocessing.text.text_to_word_sequence

In [None]:
# 쥬피터에 코드있음

# Default

BERT 모델을 사용하기 위해서 토큰화 전 원본 가사가 있는 파일을 사용한다. 

In [6]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.3-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 25.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 65.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 55.5 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.3


# 데이터 불러오기





In [None]:
import pandas as pd

df = pd.read_csv('./full_lyrics_polarity_dataset.csv')
df

Unnamed: 0.1,Unnamed: 0,artist,title,year,lyrics,neg_sentiment,neu_sentiment,pos_sentiment,com_sentiment,polarity,textblob_pol
0,0,frankie laine,i believe,1950,I Believe Lyrics\nI believe for every drop of ...,0.089,0.879,0.032,-0.7506,0.0,1.0
1,1,johnnie ray,cry,1950,Cry Lyrics\nOoh-wah\nOoh-wah\nOoh-wah\n\n\nIf ...,0.110,0.772,0.118,0.2732,1.0,1.0
2,2,p?rez prado,patricia,1950,Patricia LyricsKiss her and your lips will alw...,0.029,0.747,0.224,0.9807,1.0,1.0
3,3,lefty frizzell,if you've got the money i've got the time,1950,If You’ve Got The Money I’ve Got The Time Lyri...,0.038,0.920,0.042,0.0112,0.0,1.0
4,4,lefty frizzell,i want to be with you always,1950,I Want To Be With You Always Lyrics\nI lose my...,0.074,0.702,0.224,0.9835,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...
21739,21739,mack 10,10 million ways,2019,10 Million Ways Lyrics\nCause if you fuck with...,0.133,0.807,0.059,-0.9936,0.0,1.0
21740,21740,m.o.p.,ante up (robbin hoodz theory),2019,Ante Up \nMotherfucker!\nYeah!23Embed,0.180,0.730,0.090,-0.9972,0.0,0.0
21741,21741,nine,whutcha want?,2019,Whutcha Want? Lyrics\nI gets banned if I do ge...,0.101,0.782,0.117,-0.7261,0.0,1.0
21742,21742,will smith,switch,2019,"Switch Lyrics\nYo mic check, mic check, yeah h...",0.041,0.888,0.071,0.9630,1.0,1.0


In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['lyrics'], df['polarity'], test_size = 0.25, random_state = 32)
X_train.shape

(16308,)

# lyrics 토큰화

HuggingFace의 transformers 라이브러리를 활용하였다. 

BertTokenizer를 사용하는 이유는 문맥을 이해하는데 도움을 주기 때문인 것이다. 그렇기 때문에 가사도 토큰화한 데이터셋이 아닌 원본 가사 데이터셋을 사용한 것이다. 

하나의 가사에서 _[CLS] 한 문장 [SEP] 다음 문장 [SEP] ,,, [SEP]_ 으로 쪼갠 후 리스트에 넣는다. 이 리스트를 train_lyrics 리스트에 넣는다. :: train_lyrics 리스트에는 하나의 가사가 리스트 형태로 있는 2차원 리스트이다. 

이렇게 쪼갠 다음 가사 별로 tokenizer를 활용하여 단어 단위로 쪼갠다. 

<br/>

BertTokenizer의 tokenizer 같은 경우는 encode_plus 메소드를 통해서 BERT에 맞게 태그를 추가할 수 있는 프롭이 있다. 하지만 가사의 경우는 "." 으로 문장을 구분하는 것이 아닌 "\n" 으로 구분하므로 태그들을 각 문장별로 넣어주기 위해서는 따로 처리해주어야 할 필요가 있었다. 

### input set

In [None]:
from transformers import BertTokenizer

# BERT에 맞는 Tag 달아주기
train_lyrics = []
test_lyrics = []

for i in X_train:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  train_lyrics.append(one)

for i in X_test:
  one = []
  start = "[CLS] " + str(i)
  split_lyrics = " [SEP]".join(start.split('\n'))
  split_lyrics += " [SEP]"
  one.append(split_lyrics)
  test_lyrics.append(one)


In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased', do_lower_case=True)
'''bert-base-multilingual-cased''' 
'''bert-base-uncased''' 
tokenized_data = []

# 2차원 리스트로 태그를 달아주었으므로 2중 for 문 사용해서 가사에 접근
for each_lyrics in train_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_data.append(tokens)


tokenized_test_data = []
for each_lyrics in test_lyrics:
  for j in each_lyrics:
    tokens = tokenizer.tokenize(j)
    tokenized_test_data.append(tokens)

### target set

In [None]:
y_train_list = y_train.to_list()
y_test_list = y_test.to_list()

# Modeling

input data와 target data를 BERT 모델에 적합하게 데이터 모양을 바꾸어 준다. 

BatchEncoding은 tokenizer의 인코딩 메서드인 encode_plus를 가지고 있으며 dict 형태로 리턴한다. 

In [None]:
import tensorflow_datasets as tfds
import tensorflow as tf
import torch

def convert_example_to_feature(review):
  return tokenizer.encode_plus(review,
                add_special_tokens = False, # add [CLS], [SEP]
                max_length = 256, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )
  
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):
  return {
      "input_ids": input_ids,
      "token_type_ids": token_type_ids,
      "attention_mask": attention_masks,
  }, label

def encode_examples(data):
  # prepare list, so that we can build up final TensorFlow dataset from slices.
  input_ids_list = []
  token_type_ids_list = []
  attention_mask_list = []
  target_list = []
  
  for DATA_COL, LABEL_COL in data.to_numpy():
    bert_input = convert_example_to_feature(DATA_COL)
    input_ids_list.append(bert_input['input_ids'])
    token_type_ids_list.append(bert_input['token_type_ids'])
    attention_mask_list.append(bert_input['attention_mask'])
    target_list.append([LABEL_COL])
  return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, target_list)).map(map_example_to_dict)

In [None]:
train = pd.DataFrame({'DATA_COL' : tokenized_data, 'LABEL_COL' : y_train_list})
test = pd.DataFrame({'DATA_COL' : tokenized_test_data, 'LABEL_COL' : y_test_list})

In [None]:
# train dataset
train_encoded = encode_examples(train).shuffle(100).batch(16)
# test dataset
test_encoded = encode_examples(test).batch(16)

### Initialize Model

In [None]:
from transformers import TFBertForSequenceClassification
import tensorflow as tf
# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 3
# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-multilingual-cased')

Downloading tf_model.h5:   0%|          | 0.00/1.01G [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# choosing Adam optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate, epsilon=1e-08)
# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

In [None]:
import torch, gc
gc.collect()
torch.cuda.empty_cache()

model.fit(train_encoded, epochs=number_of_epochs, validation_data=test_encoded)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7febee42c2d0>

# Model Predict

In [None]:
def model_predict(test_sentence):
  predict_input = tokenizer.encode(test_sentence, truncation=True, padding=True, return_tensors="tf")
  tf_output = model.predict(predict_input)[0]
  tf_prediction = tf.nn.softmax(tf_output, axis=1)
  labels = ['Negative','Positive'] # (0: negative, 1: positive)
  label = tf.argmax(tf_prediction, axis=1)
  label = label.numpy()
  tf_prediction = tf_prediction.numpy()

  print(test_sentence, ":: ", "{:.2f}%".format(abs(tf_prediction.take(label[0])) * 100), labels[label[0]])

In [None]:
test_sentence_1 = '''
There once was a king named Midas who did a good deed for a Satyr. 
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold. 
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
'''
model_predict(test_sentence_1)

test_sentence_2 = '''
But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
'''
model_predict(test_sentence_2)



There once was a king named Midas who did a good deed for a Satyr. 
And he was then granted a wish by Dionysus, the god of wine.
For his wish, Midas asked that whatever he touched would turn to gold. 
Despite Dionysus’ efforts to prevent it, Midas pleaded that this was a fantastic wish, and so, it was bestowed.
Excited about his newly-earned powers, Midas started touching all kinds of things, turning each item into pure gold.
 ::  96.57% Positive

But soon, Midas became hungry. As he picked up a piece of food, he found he couldn’t eat it. It had turned to gold in his hand.
Hungry, Midas groaned, “I’ll starve! Perhaps this was not such an excellent wish after all!”
Seeing his dismay, Midas’ beloved daughter threw her arms around him to comfort him, and she, too, turned to gold. “The golden touch is no blessing,” Midas cried.
 ::  55.15% Negative


In [None]:
lyrics = '''
And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
'''
model_predict(lyrics)


And I know we weren't perfect
But I've never felt this way for no one, oh
And I just can't imagine how you could be so okay now that I'm gone
I guess you didn't mean what you wrote in that song about me
'Cause you said forever, now I drive alone past your street
 ::  53.11% Positive
