## Question 4: We are tasked to classify text into a given category(1,0). We choose `decision tree` and `multinomial naive bayes` as our classifiers

The plan to do so is:
1. Put the text data into a dataframe
2. Clean the text data by removing stop words, numbers, punctuation, and by lemmatizing each word, as well as converting it to lower case
3. Convert each observation into a vector which is of length n where n is the amount of unique words in our dataset. Each element of the vector will correspond to each unique word in the dataset, showing the frequency of said word in the given observation. 
4. Split the dataset into train and test (80:20)
5. Test on each classifier
6. Calculate the performance of each classifier with accuracy, precision, recall, and F1
7. Time permitting, also try with step 3 replaced by word embeddings

In [210]:
import pandas as pd
import numpy as np
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

Step 1

In [211]:
data = pd.read_csv("Data/musical.tsv", sep="\t")

Step 2

In [212]:
en_stops = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
#Remove stop words
data["Review"] = data["Review"].\
    apply(lambda x : ' '.join([word.lower() for word in x.split() if word not in (en_stops)]))
#Remove non-alpha or space characters
data["Review"] = data["Review"].str.replace('[^a-z\s]', '')
#Lemmatize words
data["Review"] = data["Review"].apply(lambda x : ' '.join([lemmatizer.lemmatize(x) for x in x.split()]))

  data["Review"] = data["Review"].str.replace('[^a-z\s]', '')


In [213]:
data

Unnamed: 0,Review,Score
0,this second set strap lock ive owned they litt...,1
1,first i want say i love tube amp distortion ov...,1
2,bought idea full version behringers sequence p...,0
3,if like me probably bought hook xlr microphone...,1
4,didnt know expect proved worth gamblethis cabl...,1
...,...,...
995,it really pain give anything star review bos p...,1
996,it decent unit stopped working completely mont...,0
997,i bought cable order able run longer cable run...,1
998,well made work should however seem getting lit...,1


Step 3

In [214]:
from sklearn.feature_extraction.text import CountVectorizer

docs = data["Review"]
vec = CountVectorizer()
bow = vec.fit_transform(data["Review"])
bow = pd.DataFrame(bow.toarray(), columns=vec.get_feature_names())



In [215]:
processed_data = pd.concat([data["Score"], bow], axis=1)

In [216]:
X = processed_data.loc[:, processed_data.columns != "Score" ]
y = processed_data["Score"]

In [217]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Naive bayes

Multinomial Naive bayes is a classification algorithm which will classify an input as the class with the most likely posterior where bayesian probability formula is $p(L_k)=\frac{p(L_k)*p(W|L_k)}{P(W)}$ where L is the label of the data(score in this case), W is an observation, and k is relative to the kth observation. This can also be thought of as $posterior=prior*\frac{likelihood}{evidence}$

In [218]:
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB()

In [219]:
nb_clf.fit(X_train, y_train)

MultinomialNB()

In [220]:
nb_preds = nb_clf.predict(X_test)

In [221]:
def performance(preds, truths):
    truths = np.array(truths)
    accuracy = (truths == preds).sum()/len(truths)
    tp = ((truths == 1) & (preds == 1)).sum()
    fp = ((truths == 0) & (preds == 1)).sum()
    tn = ((truths == 0) & (preds == 0)).sum()
    fn = ((truths == 1) & (preds == 0)).sum()
    precision = tp/(tp+fp)
    recall = tp/(tp+fn)
    f1 = 2*precision*recall/(precision+recall)
    print(f"Accuracy: {accuracy}\nPrecision: {precision}\nRecall: {recall}\nF1 Score: {f1}")

In [222]:
performance(nb_preds, y_test)

Accuracy: 0.805
Precision: 0.8504672897196262
Recall: 0.7982456140350878
F1 Score: 0.823529411764706


## Decision Tree

In [223]:
from sklearn import tree

In [224]:
tree_clf = tree.DecisionTreeClassifier()

In [225]:
tree_clf = tree_clf.fit(X_train, y_train)

In [226]:
tree_preds = tree_clf.predict(X_test)

Decision Tree Performance Metrics

In [227]:
performance(tree_preds, y_test)

Accuracy: 0.665
Precision: 0.7373737373737373
Recall: 0.6403508771929824
F1 Score: 0.6854460093896713


Naive Bayes Performance Metrics

In [228]:
performance(nb_preds, y_test)

Accuracy: 0.805
Precision: 0.8504672897196262
Recall: 0.7982456140350878
F1 Score: 0.823529411764706
