#Author: Srikaran Elakurthy

#Description 

Importing the train set and test set for using in Naive Bayes model.
Implementing the Multinomial Naive bayes classification on the above dataset.
Performing Cross validation with kfolds=5 and extract average accuracy
Training the naive bayes model on whole train data and testing it
storing the results into files

>*Detailed description is written for many parts of the code below. Please read through for the same.

# Command to Run 

> Open the ipynb notebook in Jupyter Lab and go to the menu bar on the top, click on 'Run' and from the dropdown select the 'Run All' option to run all the cells in the notebook.


#Input and Output

Input files: 
>nonsampling_train.csv - It consists of the preprocessed and unsampled  version of lyrics from the dataset. It is only for training the models.

>nonsampling_test.csv - It consists of the preprocessed and unsampled  version of lyrics from the dataset. It is only for testing the models.

Ouputs: 
>finalmodelNaiveBayes_undersampled.pkl - Trained Naive Bayes model on the whole train set.

>finalNaiveundersamp_results.txt - Cross validation Results of the trained model on the non sampled data set.

>NaiveBayes.pkl - Models generated during the K-Fold cross validation

>Naiveundersamp_results.txt - Writing the cross validation metrics to the txt file.

Input<- The inputs to the code are nonsampling_train.csv and nonsampling_test.csv
Output<-
 report:
              precision    recall  f1-score   support

        Rock       0.00      0.00      0.00      2863
     Country       0.00      0.00      0.00      1407
     Hip-Hop       0.00      0.00      0.00       356
         Pop       0.93      0.24      0.38      4618
        Jazz       0.00      0.00      0.00       587
         R&B       0.00      0.00      0.00      1449
       Metal       1.00      0.00      0.01      4285
      Electronic   0.00      0.00      0.00       764
       Other       0.81      0.00      0.01      6985
        Folk       0.00      0.00      0.00       651
       Indie       0.47      1.00      0.64     20165

    accuracy                            0.48     44130
    macro avg       0.29      0.11      0.09     44130
    weighted avg    0.54      0.48      0.33     44130


Importing th required packages and train, test data

In [None]:
import numpy as np
import pandas as pd
import re
from nltk import pos_tag
import nltk
#nltk.download('averaged_perceptron_tagger')
from langdetect import detect
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
import emoji
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report
from sklearn.utils import shuffle
from sklearn.externals import joblib 
from sklearn.preprocessing import Normalizer
from scipy.sparse import hstack
from sklearn.metrics import classification_report
from statistics import mean
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
traindf=pd.read_csv("nonsampling_train.csv")
testdf=pd.read_csv("nonsampling_test.csv")



In [None]:
traindf.head()

Unnamed: 0.1,Unnamed: 0,lyrics,genre
0,173586,i wait for the pain it alway come again and i ...,Rock
1,192873,as i hear the mock bird i rememb the word when...,Country
2,196702,gab yeah yeah x lateef blackalici lateef the t...,Hip-Hop
3,34320,well the sky broke in two i found you danc alo...,Pop
4,77537,when madam pompadour wa on a ballroom floor sa...,Jazz


Dropping the unnamed column which has been created when we are importing the csv files.

In [None]:
traindf.drop(traindf.columns[traindf.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)

testdf.drop(testdf.columns[testdf.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)



In [None]:
traindf.genre.unique()

array(['Rock', 'Country', 'Hip-Hop', 'Pop', 'Jazz', 'R&B', 'Metal',
       'Electronic', 'Other', 'Folk', 'Indie'], dtype=object)

In [None]:
lyrics=traindf['lyrics']
genre=traindf['genre']

Performing cross validation with 5 splits and implementing tfidf vectorization inside the cross validation to avoid any data leakage.
TFidf Vectorization will consider parameters stop_words= english specifying that it will be removing any english words and says to consider both unigrams and bigrams.
We are taking Multinomial Naive bayes model specifying we are seeking a multi class problem and alpha =1 specifying smoothing is allowed. 
Storing the models in pickle files using joblib and results with actual and predicted values into a csv file.
Storing the accuracy of every cross validation split into a list.

In [None]:
skf = StratifiedKFold(n_splits=5, random_state=0)
i=0
logirep=[]
logiscore=[]
for train_index, test_index in skf.split(lyrics, genre):
    print(train_index)
    print(test_index)
    x_train1, x_test1 = lyrics.iloc[train_index], lyrics.iloc[test_index]
    y_train, y_test = genre.iloc[train_index], genre.iloc[test_index]
    i=i+1
    tfidf = TfidfVectorizer(stop_words="english",ngram_range=(1,2))
    x_train = tfidf.fit_transform(x_train1)
    x_test = tfidf.transform(x_test1)
    print("tfidf"+str(i))
    
    clf = MultinomialNB(alpha=1.0)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    joblib.dump(clf, 'NaiveBayes'+str(i)+'.pkl')
    score = accuracy_score(y_test, y_pred)
    rep=classification_report(y_test, y_pred, target_names=traindf.genre.unique())
    logirep.append(rep)
    dat={'Actual':y_test,'pred':y_pred}
    resdf=pd.DataFrame(dat)
    resdf.to_csv('resNaivebayes'+str(i)+'.csv')
    logiscore.append(score)
    print(score)



[ 33443  33455  33468 ... 176517 176518 176519]
[    0     1     2 ... 36103 36246 36370]
tfidf1


  _warn_prf(average, modifier, msg_start, len(result))


0.4803138454566055
[     0      1      2 ... 176517 176518 176519]
[33443 33455 33468 ... 71608 71687 71705]
tfidf2


  _warn_prf(average, modifier, msg_start, len(result))


0.4797190120099705
[     0      1      2 ... 176517 176518 176519]
[ 67218  67221  67329 ... 107216 107234 107296]
tfidf3


  _warn_prf(average, modifier, msg_start, len(result))


0.47929413097665985
[     0      1      2 ... 176517 176518 176519]
[102550 102551 102738 ... 145107 145206 145391]
tfidf4


  _warn_prf(average, modifier, msg_start, len(result))


0.4795490595966463
[     0      1      2 ... 145107 145206 145391]
[139006 139414 139416 ... 176517 176518 176519]
tfidf5


  _warn_prf(average, modifier, msg_start, len(result))


0.4803421708588262


Creating a file to write our classification report for our 5 results of Cross validation models and computing the average accuracy and writing them into the file.
The classification report is computed by using the results csv files containing the actual and predicted values of ech split.

In [None]:
f=open("Naiveundersamp_results.txt","a")

for i in range(1,6):
    f.write("\n report"+str(i)+":\n")
    dfr=pd.read_csv("resNaivebayes"+str(i)+".csv")
    dfr.drop(dfr.columns[dfr.columns.str.contains('unnamed',case = False)],axis = 1, inplace = True)
    f.write(classification_report(dfr.Actual, dfr.pred, target_names=traindf.genre.unique()))
f.write("\nThe average score for this model is")

f.close()

Declaring the naive bayes model with alpha =1 specifying smoothing is allowed.

In [None]:
finalmod = MultinomialNB(alpha=1.0)

Perform tfidf vectorization to extraxt tfidf matrix on whole training data and use the fitted vectorizer to transform the test data to tfidf matrix.

Tfidf matrix considers:
*   Removing stop words
*   Considering both unigrams and bigrams

In [None]:
tfidf = TfidfVectorizer(stop_words="english",ngram_range=(1,2))
trainvec = tfidf.fit_transform(traindf['lyrics'])
testvec = tfidf.transform(testdf['lyrics'])

Fitting the model with train tfidf matrix and train target variable(genre's)

In [None]:
finalmod.fit(trainvec, traindf['genre'])

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Storing the model

In [None]:
joblib.dump(finalmod, 'finalmodelNaiveBayes_undersampled.pkl')

['finalmodelNaiveBayes_undersampled.pkl']

Predicting the test data using test tfidf matrix

In [None]:
y_pred = finalmod.predict(testvec)

Computing the classification report and writing the results into a file.

In [None]:
f=open("finalNaiveundersamp_results.txt","a")
f.write("\n report:\n")
f.write(classification_report(testdf['genre'], y_pred, target_names=traindf.genre.unique()))
f.close()

  _warn_prf(average, modifier, msg_start, len(result))
