# Text Classification on Song Lyrics

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# Part I : Web Scraping & HTML Parsing

## 1. Introduction

This project aims at building a text classification model on song lyrics. The task is to predict the artist from song text. Training such a model requires first of all that we collect our own lyrics dataset. We will focus on two artist from the "Heavy Metal" genre: Ronnie James Dio (Dio) and Ozzy Osbourne (Ozzy).

In this part of the project, we will make use of the website:  http://www.darklyrics.com for collecting the dataset. Through webscraping we will download for each artist a HTML page with links to his albums, from whch we will extract album hyperlinks by HTML parsing. Then, we can again download HTML pages for all the albums, extracting song lyrics from each one of them.

### 1.1 Load Packages

In [1]:
# data processing libraries
import numpy as np
import pandas as pd

In [2]:
# webscraping annd HTML parsing libraries
import requests
import re
from bs4 import BeautifulSoup

In [3]:
# other libraries
import time

### 1.2 URL & Artist Link

In [4]:
# main url
url = 'http://www.darklyrics.com/'

#header for request
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}

#artist sublink
artist_sublink = {'Dio':'d/dio.html', 'Ozzy':'o/ozzyosbourne.html'}

## 2. User Defined Functions

In [5]:
# function for extracting album hyperlinks for an artist

def album_links(artist):
    """
    This function returns all the album sublinks of an artist. 
    """
    artist_link = url + artist_sublink[artist]
    html_content = requests.get(artist_link,headers=headers).text
    html_soup = BeautifulSoup(html_content,'html.parser')
    
    sublinks = []
    html_links = html_soup.find_all('div', class_= 'album')
    for html_link in html_links:
        sublink = html_link.a['href']
        sublinks.append(re.findall('(lyrics.+)#\de{,2}',sublink)[0])
    
    originals = [link for link in sublinks if 'tribute' not in link and 'live.html' not in link]

    df_album = pd.DataFrame(data={'sublink':originals})
    return df_album

In [6]:
# function for collecting all the song lines of an album

def album_song_lines(album_link):
    """
    This function returns song lines for an album link 
    """
    album_content = requests.get(album_link,headers=headers).text
    album_soup = BeautifulSoup(album_content,'html.parser')

    lyrics_tn = album_soup.find('div', class_= 'lyrics').text
    
    note_div = album_soup.find('div',class_='note')
    if note_div!=None:
        note = note_div.text
        lyrics_t = re.sub(note,'',lyrics_tn)
    else:
        lyrics_t = lyrics_tn
    
    thank_div = album_soup.find('div',class_='thanks')
    if thank_div!=None:
        thanks = thank_div.text
        lyrics = re.sub(thanks,'',lyrics_t)
    else:
        lyrics = lyrics_t

    title_pattern = '<h3><a name="\d{,2}">(.+)</a></h3>'
    title_list = re.findall(title_pattern,album_content)
    album_lines = [line for line in lyrics.split('\n')[:-2] if line not in title_list and len(line)>0 and '[' not in line] 

    return album_lines

In [7]:
# function for for collecting all song lines from all albums of an artist

def all_song_lines(artist):
    """
    This function downloads all song lines of an artist
    """
    albums = album_links(artist)
    song_lines = []
    for sublink in albums['sublink']:
        album_link = url + sublink
        album_lines = album_song_lines(album_link)
        song_lines.extend(album_lines)
        time.sleep(5)
    
    return song_lines

## 3. Dataset Preparation

#### Song Line Collection

In [8]:
# Ronnie James Dio
dio_song_lines = all_song_lines('Dio')

# Ozzy Osbourne
ozzy_song_lines = all_song_lines('Ozzy')

In [11]:
# count respective total song lines
print(f'Dio: {len(dio_song_lines)}, Ozzy: {len(ozzy_song_lines)}')

Dio: 3612, Ozzy: 5399


#### Corpus Creation

In [12]:
# stack individual song lines
corpus = dio_song_lines.copy()
corpus.extend(ozzy_song_lines)
print(f'corpus: {len(corpus)} lines')

corpus: 9011 lines


#### Label Creation

In [13]:
# create individual labels
dio_label  = ['Dio' for _ in range(len(dio_song_lines))]
ozzy_label = ['Ozzy' for _ in range(len(ozzy_song_lines))]

# count respective labels
print(f'Dio: {len(dio_label)}, Ozzy: {len(ozzy_label)}')

Dio: 3612, Ozzy: 5399


In [14]:
# stack individual labes
artist = dio_label.copy()
artist.extend(ozzy_label)
print(f'label: {len(artist)} lines')

label: 9011 lines


#### Dataframe

In [15]:
# create dataframe
df = pd.DataFrame(data={'line':corpus, 'artist':artist})

In [16]:
# dataframe quick check: head
df.head()

Unnamed: 0,line,artist
0,It's the same old song,Dio
1,you gotta be somewhere at sometime,Dio
2,and they'll never let you fly,Dio
3,It's like broken glass,Dio
4,you get cut before you see it,Dio


In [17]:
# dataframe quick check: tail
df.tail()

Unnamed: 0,line,artist
9006,And I don't walk on water (oh no),Ozzy
9007,I don't walk on water (oh no),Ozzy
9008,My dromedary dreams as wet as oceans,Ozzy
9009,With sand dunes bearing seeds she set in motion,Ozzy
9010,My dromedary dreams my dromedary dreams my dro...,Ozzy


#### CSV File

In [19]:
# save dataset as csv file
df.to_csv('songlines.csv',index=False)

## 4. Fun Explorations

#### Total Dio albums

In [20]:
dio_albums = album_links('Dio')
print(f'Dio albums: {dio_albums.shape[0]}')

Dio albums: 14


#### Total Ozzy albums

In [21]:
ozzy_albums = album_links('Ozzy')
print(f'Ozzy albums: {ozzy_albums.shape[0]}')

Ozzy albums: 16


Comment: We have excluded 'Live' and 'Tribute' albums to avoid song repetation