__FRONT_MATTER__

---
title: "Using Lyrics to Predict Genre"

we'll implement naive Bayes, to predict a genre given some lyrics.

---

---

## Contents

• [Getting the Data](#Getting-the-Data)<br/>
• [Loading the Data](#Loading-the-Data)<br/>
• [Splitting the Data](#Splitting-the-Data)<br/>
• [Training the Model](#Training-the-Model)<br/>
• [Top Hip Hop Songs](#Top-Hip-Hop-Songs)<br/>


## Problem Statement

Majority of the people are not aware of the various Genres of music. And even if they are are it very tough to actually determine the genre of a particular song.

We aim to Solve this issue by creating an app in which the user simply has to paste a part of the song lyrics , and they get the result.

## Business Impact
if combined with audio recognition aswell..it can be used in song Recommendation systems .By Showing songs of similar genres

## Problems faced
getting the data in the desired format was a task .
secondly selecting which all genres to consider for this project. cause many genres are very similar to each other which may cause problems with the predictions.


## Loading the Data

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('/Users/yashsawant/Downloads/cypher-master/notebooks/lyrics1.csv')

df['ranker_genre'] = np.where(
    (df['ranker_genre'] == 'screamo')|
    (df['ranker_genre'] == 'punk rock')|
    (df['ranker_genre'] == 'heavy metal'), 
    'alt rock', 
    df['ranker_genre']
)

The data is available as one lyric per row. To train our classifier, we'll need to transform it into one *song* per row. We'll also go ahead and convert the data to lowercase with `.apply(lambda x: x.lower())`. To do that, we do the following:

In [3]:
group = ['song', 'year', 'album', 'genre', 'artist', 'ranker_genre']
lyrics_by_song = df.sort_values(group)\
        .groupby(group).lyric\
        .apply(' '.join)\
        .apply(lambda x: x.lower())\
        .reset_index(name='lyric')

lyrics_by_song["lyric"] = lyrics_by_song['lyric'].str.replace(r'[^\w\s]','')

## Splitting the Data

Next we'll split our data into a training set and a testing set using only Country, Alt Rock, and Hip Hop.

In [4]:
from sklearn.utils import shuffle
from nltk.corpus import stopwords

genres = [
    'Country', 'alt rock', 'Hip Hop',
]

LYRIC_LEN = 400 # each song has to be > 400 characters
N = 6000 # number of records to pull from each genre
RANDOM_SEED = 200 # random seed to make results repeatable

train_df = pd.DataFrame()
test_df = pd.DataFrame()
for genre in genres: # loop over each genre
    subset = lyrics_by_song[ # create a subset 
        (lyrics_by_song.ranker_genre==genre) & 
        (lyrics_by_song.lyric.str.len() > LYRIC_LEN)
    ]
    train_set = subset.sample(n=N, random_state=RANDOM_SEED)
    test_set = subset.drop(train_set.index)
    train_df = train_df.append(train_set) # append subsets to the master sets
    test_df = test_df.append(test_set)
    
train_df = shuffle(train_df)
test_df = shuffle(test_df)

## Training the Model

Next, we'll train a model using word frequencies and `sklearn`'s `CountVectorizer`. The `CountVectorizer` is a quick and dirty way to train a language model by using simple word counts.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# define our model
text_clf = Pipeline(
    [('vect', CountVectorizer()),
     ('clf', MultinomialNB(alpha=0.1))])

# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)  

# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)

0.8803284361531444


Word frequencies work fine here, but let's see if we can get a better model by using the `TfidfVectorizer`.

So let's train a model using `tf-idf` scores as features.

In [6]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

# define our model
text_clf = Pipeline(
    [('vect', TfidfVectorizer()),
     ('clf', MultinomialNB(alpha=0.1))])

# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)  

# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)

0.8813351370056941

tuning a few hyperparameters, lemmatizing our data, customizing our tokenizer a bit, and filtering our words with `nltk`'s builtin stopword list.

In [7]:

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

stop = list(set(stopwords.words('english'))) # stopwords
wnl = WordNetLemmatizer() # lemmatizer

def tokenizer(x): # custom tokenizer
    return (
        wnl.lemmatize(w) 
        for w in word_tokenize(x) 
        if len(w) > 2 and w.isalnum() # only words that are > 2 characters
    )                                 # and is alpha-numeric

# define our model
text_clf = Pipeline(
    [('vect', TfidfVectorizer(
        ngram_range=(1, 2), # include bigrams
        tokenizer=tokenizer,
        stop_words=stop,
        max_df=0.4, # ignore terms that appear in more than 40% of documents
        min_df=4)), # ignore terms that appear in less than 4 documents
     ('tfidf', TfidfTransformer()),
     ('clf', MultinomialNB(alpha=0.1))])

# train our model on training data
text_clf.fit(train_df.lyric, train_df.ranker_genre)  

# score our model on testing data
predicted = text_clf.predict(test_df.lyric)
np.mean(predicted == test_df.ranker_genre)

  'stop_words.' % sorted(inconsistent))


0.8740680152263504

Confusion Matrix


In [8]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
mat = confusion_matrix(test_df.ranker_genre, predicted)
sns.heatmap(
    mat.T, square=True, annot=True, fmt='d', cbar=False,
    xticklabels=genres, 
    yticklabels=genres
)
plt.xlabel('true label')
plt.ylabel('predicted label');

Given this confusion matrix, we can calculate precision, recall, and f-score.

<b>Recall</b> is the ability of the classifier to find all the positive results. That is, to clasify a rap song *as* a rap song. 

<b>Precision</b> is the ability of the classifier to not label a negative result as a positive one. That is, to not classify a country song as a rap song.

<b>F-score</b> is the [harmonic mean](https://en.wikipedia.org/wiki/Harmonic_mean) of precision and recall.

To compute recall, precision, and f-score, we'll use `precision_recall_fscore_support` from `sklearn.metrics`.

In [9]:
from sklearn.metrics import precision_recall_fscore_support

precision, recall, fscore, support = precision_recall_fscore_support(test_df.ranker_genre, predicted)

for n,genre in enumerate(genres):
    genre = genre.upper()
    print(genre+'_precision: {}'.format(precision[n]))
    print(genre+'_recall: {}'.format(recall[n]))
    print(genre+'_fscore: {}'.format(fscore[n]))
    print(genre+'_support: {}'.format(support[n]))
    print()

COUNTRY_precision: 0.9411030826012372
COUNTRY_recall: 0.8900250725136424
COUNTRY_fscore: 0.9148516852797007
COUNTRY_support: 20341

ALT ROCK_precision: 0.26256983240223464
ALT ROCK_recall: 0.9454022988505747
ALT ROCK_fscore: 0.41099312929419113
ALT ROCK_support: 348

HIP HOP_precision: 0.8277418783747897
HIP HOP_recall: 0.8425842494143089
HIP HOP_fscore: 0.8350971198928332
HIP HOP_support: 11098



<b>Support</b> is the number of each class in the actual true set. And the first thing I notice is that there aren't many alt rock songs being scored. Adding more alt rock songs could possibly improve our model. 

We do a good job all around on classifying hip hop and country songs. For alt rock songs, the recall score is great; that is, when it's actually an alt rock song, the model classifies it as an alt rock song 94% of the time. But, as we can see from our alt rock precision score and confusion matrix, the model classifies many hip hop songs as alt rock (963, to be exact), which is the main reason this score is so low.

Let's throw some new data at our model and see how well it does predicting what genre these lyrics belong to. 

In [14]:
text_clf.predict(
    [
        "i stand for the red white and blue",
        "flow so smooth they say i rap in cursive", #bars *insert fire emoji*
        "take my heart and carve it out",
        "there is no end to the madness",
        "sitting on my front porch drinking sweet tea",
        "sitting on my front porch sippin on cognac",
        "dog died and my pick up truck wont start",
        "im invisible and the drugs wont help",
        "i hope you choke in your sleep thinking of me",
        "i wonder what genre a song about data science and naive bayes and hyper parameters and maybe a little scatter plots would be"
    ]
)

array(['Country', 'Hip Hop', 'Country', 'alt rock', 'Country', 'Hip Hop',
       'Country', 'alt rock', 'alt rock', 'Hip Hop'], dtype='<U8')

In [20]:
text = input()
text_clf.predict([text])

hi


array(['Country'], dtype='<U8')

In [13]:
import pickle
pickle_out= open("genrepredict.pkl","wb")
pickle.dump(text_clf, pickle_out)
pickle_out.close()



In [22]:
lr_model = open('genrepredict.pkl','rb')
classifier = pickle.load(lr_model)

In [23]:
classifier.predict(["hey"])

  'stop_words.' % sorted(inconsistent))


array(['alt rock'], dtype='<U8')

In [30]:
import streamlit as st


SyntaxError: invalid syntax (<ipython-input-30-1c8b826b6769>, line 2)

In [20]:
#Headings for Web Application
st.title("Genre Classification of songs Using Lyrics")
st.subheader("What type of NLP service would you like to use?")

2021-04-08 18:43:26.115 
  command:

    streamlit run /Users/yashsawant/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py [ARGUMENTS]


<streamlit.delta_generator.DeltaGenerator at 0x7fc06208e410>

In [22]:
text = st.text_input('Enter song lyrics')

In [23]:
st.header("Results")

<streamlit.delta_generator.DeltaGenerator at 0x7fc06208e410>

In [27]:
result = text_clf.predict([text])

''

In [28]:
st.write(result)

In [31]:
streamlit hello

SyntaxError: invalid syntax (<ipython-input-31-c7a1b683aa76>, line 1)