#Author: Srikaran Elakurthy

#Description
*   Import the processed cleaned lyrics dataset and divide into train set and test set and store them into csv files for further use of them in Naive Bayes model.
*   Implementing the logistic multi class classification on the above dataset.


*   Performing Cross validation with kfolds =5 and extract  average accuracy 
*   Implementing class weighted approach by using class_weighted as balanced feature determining the model to give high priority to minority classes and low priority to majority classes.

Detailed description is provided all along the program. Please read through the comments for a detailed picture.

# Command to Run 

> Open the ipynb notebook in Jupyter Lab and go to the menu bar on the top, click on 'Run' and from the dropdown select the 'Run All' option to run all the cells in the notebook.

# Inputs and Outputs

Inputs:

> processed_lyics.csv - It contains the preprocessed lyrics generated by the data pre-processing script.

Ouputs: 

>finallogiundersamp_results_class_weighted.txt - It contains the Classification Report generated by the trained model over the test set.

> reslogiclassweighted.csv - It contains the actual and predicted values of the test set stored into a csv.


> *The inputs to the program must be in the same folder as the script.

Input<- The input to the program is processed_lyics.csv
Output<-
report:
              precision    recall  f1-score   support

        Rock       0.60      0.26      0.37      2863
     Country       0.86      0.03      0.06      1407
         Pop       1.00      0.04      0.08       356
       Metal       0.85      0.76      0.80      4618
        Jazz       0.00      0.00      0.00       587
         R&B       0.53      0.16      0.24      1449
     Hip-Hop       0.74      0.56      0.64      4285
     Electronic    0.21      0.01      0.01       764
        Folk       0.52      0.32      0.40      6985
       Indie       0.77      0.03      0.05       651
       Other       0.59      0.90      0.71     20165

    accuracy                           0.62     44130
    macro avg       0.61      0.28      0.31     44130
    weighted avg       0.62      0.62      0.57     44130

Importing the necessary python packages and reading the preprocessed lyrics dataset. 

In [None]:
import numpy as np
import pandas as pd
import re
from nltk import pos_tag
import nltk
#nltk.download('averaged_perceptron_tagger')
from langdetect import detect
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import emoji
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.utils import shuffle
from sklearn.externals import joblib 
from sklearn.preprocessing import Normalizer
from scipy.sparse import hstack
from sklearn.metrics import classification_report
from statistics import mean
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
df=pd.read_csv("processed_lyics.csv")




In [None]:
df.head()

Unnamed: 0.1,Unnamed: 0,artist,genre,index,is_eng,lyrics,song,year
0,0,beyonce-knowles,Pop,0,1.0,oh babi how you do you know i m gonna cut righ...,ego-remix,2009
1,1,beyonce-knowles,Pop,1,1.0,playin everyth so easi it s like you seem so s...,then-tell-me,2009
2,2,beyonce-knowles,Pop,2,1.0,if you search for tender it isn t hard to find...,honesty,2009
3,3,beyonce-knowles,Pop,3,1.0,oh oh oh i oh oh oh i vers if i wrote a book a...,you-are-my-rock,2009
4,4,beyonce-knowles,Pop,4,1.0,parti the peopl the peopl the parti it s pop n...,black-culture,2009


Removing the unnamed column which we obtained when we are importing the data.

In [None]:
df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)


Divison of train and test dataset and storing into two dataframes with test size as 20% fo whole data.

In [None]:
X=df['lyrics']
Y=df['genre']

In [None]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = 0.20, random_state = 42)
print(Xtrain.shape,Ytrain.shape)
print(Xtest.shape,Ytest.shape)

(176520,) (176520,)
(44130,) (44130,)


In [None]:
frame = { 'lyrics': Xtrain, 'genre': Ytrain } 
traindf = pd.DataFrame(frame)

In [None]:
traindf.to_csv("nonsampling_train.csv")

In [None]:
testdf.to_csv("nonsampling_test.csv")

In [None]:
frame = { 'lyrics': Xtest, 'genre': Ytest } 
  
testdf = pd.DataFrame(frame)

In [None]:
lyrics=Xtrain
genre=Ytrain

In [None]:
genre

173586       Rock
192873    Country
196702    Hip-Hop
34320         Pop
77537        Jazz
           ...   
119879        Pop
103694       Rock
131932        Pop
146867       Rock
121958       Rock
Name: genre, Length: 176520, dtype: object

Performing cross validation with 5 splits and implementing tfidf vectorization inside the cross validation to avoid any data leakage.



*   TFidf Vectorization will consider parameters stop_words= english specifying that it will be removing any english words and says to consider both unigrams and bigrams.
*   We are specifying the parameters on logistic model as multiclass as multinonial specifying we are seeking a multi class problem and optimization algorithm as 'sag'. Using sag so that the it is best for converging fastly on large datasets.


*   Storing the models in pickle files using joblib  and results with actual and predicted values into a csv file.
*   Storing the accuracy of every cross validation split into a list.





In [None]:
skf = StratifiedKFold(n_splits=5, random_state=0)
i=0
logirep=[]
logiscore=[]
for train_index, test_index in skf.split(lyrics, genre):
    print(train_index)
    print(test_index)
    x_train1, x_test1 = lyrics.iloc[train_index], lyrics.iloc[test_index]
    y_train, y_test = genre.iloc[train_index], genre.iloc[test_index]
    i=i+1
    tfidf = TfidfVectorizer(stop_words="english",ngram_range=(1,2))
    x_train = tfidf.fit_transform(x_train1)
    x_test = tfidf.transform(x_test1)
    print("tfidf"+str(i))
    
    clf = LogisticRegression(multi_class='multinomial',solver='sag')
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    joblib.dump(clf, 'logi'+str(i)+'.pkl')
    score = accuracy_score(y_test, y_pred)
    rep=classification_report(y_test, y_pred, target_names=df.genre.unique())
    logirep.append(rep)
    dat={'Actual':y_test,'pred':y_pred}
    resdf=pd.DataFrame(dat)
    resdf.to_csv('reslogi'+str(i)+'.csv')
    logiscore.append(score)
    print(score)



[ 33443  33455  33468 ... 176517 176518 176519]
[    0     1     2 ... 36103 36246 36370]
tfidf1


  _warn_prf(average, modifier, msg_start, len(result))


0.6131033310673012
[     0      1      2 ... 176517 176518 176519]
[33443 33455 33468 ... 71608 71687 71705]
tfidf2


  _warn_prf(average, modifier, msg_start, len(result))


0.612338545207342
[     0      1      2 ... 176517 176518 176519]
[ 67218  67221  67329 ... 107216 107234 107296]
tfidf3


  _warn_prf(average, modifier, msg_start, len(result))


0.6135282121006118
[     0      1      2 ... 176517 176518 176519]
[102550 102551 102738 ... 145107 145206 145391]
tfidf4


  _warn_prf(average, modifier, msg_start, len(result))


0.6150011330160888
[     0      1      2 ... 145107 145206 145391]
[139006 139414 139416 ... 176517 176518 176519]
tfidf5


  _warn_prf(average, modifier, msg_start, len(result))


0.611941989576252


Creating a file to write our classification report for our 5 results of Cross validation models and computing the average accuracy and writing them into the file.

The classification report is computed by using the results csv files containing the actual and predicted values of ech split.

In [None]:
f=open("logiundersamp_results.txt","a")

for i in range(1,6):
    f.write("\n report"+str(i)+":\n")
    dfr=pd.read_csv("reslogi"+str(i)+".csv")
    dfr.drop(dfr.columns[dfr.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
    f.write(classification_report(dfr.Actual, dfr.pred, target_names=df.genre.unique()))
f.write("\nThe average score for this model is")
f.write(str(mean(logiscore)))
f.close()

  _warn_prf(average, modifier, msg_start, len(result))


Declaring our logistic model 

In [None]:
finalmod = LogisticRegression(multi_class='multinomial',solver='sag')


Perform tfidf vectorization to extraxt tfidf matrix on whole training data and use the fitted vectorizer to transform the test data to tfidf matrix.

Tfidf matrix considers:
*   Removing stop words
*   Considering both unigrams and bigrams



In [None]:
tfidf = TfidfVectorizer(stop_words="english",ngram_range=(1,2))
trainvec = tfidf.fit_transform(Xtrain)
testvec = tfidf.transform(Xtest)

Fitting the logistic model with the tfidf train matrix and train target variable(genre's).

In [None]:
finalmod.fit(trainvec, Ytrain)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=None, solver='sag', tol=0.0001, verbose=0,
                   warm_start=False)

Predicting the test data taking input as test tfidf matrix

In [None]:
y_pred = finalmod.predict(testvec)

Computing the classification report and writing the results into a file.

In [None]:
f=open("finallogiundersamp_results.txt","a")
f.write("\n report:\n")
f.write(classification_report(Ytest, y_pred, target_names=df.genre.unique()))
f.close()

  _warn_prf(average, modifier, msg_start, len(result))


Computing the class weights according to the bias nature of classes. This is performed by using 'balanced' as parameter. Specifying the importance to given to each class based on the frequency of each genre.

In [None]:
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(Ytrain),
                                                 Ytrain)

In [None]:
class_weights

array([ 1.40889137,  2.86302814, 11.25334693,  0.87132935,  6.71996345,
        2.718034  ,  0.93065434,  4.97744191,  0.57653491,  5.94343434,
        0.19962026])

Fitting a logistic regression model with previous parameters and extra parameter class_weights='balanced' to give priority to low bias classes and less priority to high bias classes. 

In [None]:
finalmod = LogisticRegression(multi_class='multinomial',solver='sag',class_weight='balanced')

Fitting the logistic model

In [None]:
finalmod.fit(trainvec, Ytrain)





LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=100, multi_class='multinomial', n_jobs=None,
                   penalty='l2', random_state=None, solver='sag', tol=0.0001,
                   verbose=0, warm_start=False)

Predicting the test data

In [None]:
y_pred = finalmod.predict(testvec)

Computing the classification report and storing the results into a file

In [None]:
f=open("finallogiundersamp_results_class_weighted.txt","a")
f.write("\n report:\n")
f.write(classification_report(Ytest, y_pred, target_names=df.genre.unique()))
f.close()

Storing the actual and predicted values into a csv file.

In [None]:
dat={'Actual':Ytest,'pred':y_pred}
resdf=pd.DataFrame(dat)
resdf.to_csv('reslogiclassweighted.csv')