# NLP Experiment

In [609]:
# This code makes plots appear inline in this document rather than in a new window
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (5, 4) # set default size of plots

# Some more magic so that the notebook will reload external python modules
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [610]:
# This line imports the Pandas Library
import pandas as pd

# This line imports defaultdict() from collections
from collections import defaultdict

import numpy as np

In [611]:
# Storing the data (index, text, label) in the variable "data"
df = pd.read_excel('Data_file.xlsx', index_col=0, header=0, usecols="A:B, F", names = ["Index", "Text", "Label"], sheet_name="Data")

# Correcting the "NA" labels, since Pandas imports them as NaN objects of type float
for i in range(len(df.Label)):
    if type(df.Label[i])==float:
        df.Label[i] = "NA"

In [612]:
# Labels that will be used
MOVIE_LABEL = "MOV" #Any content related to movies
TECHNOLOGY_LABEL = "TEC" #Any content related to technology: websites, NLP, text-to-speech, and more
COLLEGE_LABEL = "COL" #Anything related to UMass Amherst/College Life
NOT_APPLICABLE_LABEL = "NA" #Not Applicable 

## Tokenization

For tokenization, I am using the default tokenizer that is provided inside the TfidfVectorizer by SkLearn. I am using its default tokenizer because it splits documents on white-spaces so splits the documents into words, then it separates hyphenated words like "prize-winning" and "half-asleep", and it splits the document at any punctuation. Since we have a small data set each word concatenated by punctuation might hold its own value for labeling. For example, my dataset contains words like "pre-trained" and "Burton-y" where the word 'trained' by itself may tell us that the text is probably about technology and 'Burton' might indicate it probably about a movie. To take into consideration 2 or more words (concatenated by punctuation) like "Text-to-speech" that might hold value, I will use n-gram tokenization in the range (1, 4). 

In [613]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', 
                             stop_words= 'english',ngram_range=(1,2), min_df=2, max_df=150)

## BOW LogReg

### Feature preprocessing 

For preprocessing all the words are lowercased: it will allow instances of 'Capitalized' (words) at the beginning of a sentence to match with a query of 'capitalized' (words). Further, I remove tokens that are stop words (English), since stop words almost always exist in all labels and are redundant. Words that are unique to just one document as removed using the parameter min_df = 2, since a word that is unique to a document will probably not help with classification. Words in more than 150 documents are removed as well since if a word exists in more than 150 documents than the chances of the word being unique to a label are less since the highest document frequency for a label (NA) is 96.

In [614]:
tfidf = (vectorizer.fit_transform(df['Text'])).toarray()

words = vectorizer.get_feature_names()
vocab = set(words)

In [615]:
print("First 20 Tokens after sorting:", words[:20])

First 20 Tokens after sorting: ['20', '200', 'actually', 'advice', 'ago', 'algebra', 'amazing', 'answer', 'api', 'ask', 'best', 'better', 'bit', 'book', 'called', 'calls', 'campus', 'class', 'classes', 'cola']


### Results

In [616]:
x = tfidf 
y = np.zeros(250)

for i in range(len(df.Label)):
    if df.Label[i]==MOVIE_LABEL:
        y[i] = 1.0
    elif df.Label[i]==TECHNOLOGY_LABEL:
        y[i] = 2.0
    elif df.Label[i]==COLLEGE_LABEL:
        y[i] = 3.0

In [617]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

model = (LogisticRegression(verbose=1, solver='lbfgs', penalty='l2', max_iter=500).fit(x, y))

accuracy_scores = cross_val_score(model, x, y, cv=5)

sum_accuracy_scores = 0
for num in accuracy_scores:
    sum_accuracy_scores+=num

accuracy = sum_accuracy_scores/len(accuracy_scores)
print("Overall Accuracy:", accuracy)

Overall Accuracy: 0.48


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished


In [618]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify=df['Label'])

In [619]:
confidence = model.predict_proba(x_test)

y_predictions = []
for i in range(len(confidence)):
    max_index  = -1
    max_value = -1
    for j in range(len(confidence[i])):
        if confidence[i][j]>max_value:
            max_index = j 
            max_value = confidence[i][j]
    
    y_predictions.append(max_index)   

In [620]:
from sklearn.metrics import classification_report

target_names=[NOT_APPLICABLE_LABEL, MOVIE_LABEL, TECHNOLOGY_LABEL, COLLEGE_LABEL]
print(classification_report(y_predictions, y_test, target_names=target_names))

              precision    recall  f1-score   support

          NA       1.00      0.59      0.75        32
         MOV       0.67      1.00      0.80        10
         TEC       0.44      1.00      0.62         4
         COL       0.57      1.00      0.73         4

    accuracy                           0.74        50
   macro avg       0.67      0.90      0.72        50
weighted avg       0.85      0.74      0.74        50



### Model Insight

In [621]:
bow = tfidf
word_weights = {MOVIE_LABEL: defaultdict(),
             COLLEGE_LABEL: defaultdict(),
             TECHNOLOGY_LABEL: defaultdict(),
             NOT_APPLICABLE_LABEL: defaultdict()}

for i in range(len(bow)):
    row = bow[i]
    for j in range(len(row)):
        word_weight = row[j]
        word = words[j]
        if word not in word_weights[df.Label[i]]:
            word_weights[df.Label[i]][word] = word_weight
        else:
            word_weights[df.Label[i]][word] += word_weight

for label in word_weights:
    word_weights[label] = sorted(word_weights[label].items(),key=lambda x: x[1], reverse=True)

In [622]:
def top_ten(label, label_dict):
    print("Top-10 highest weighted words for", label)
    print()
    for i in range(0, 10):
        print(f"{i+1} Word: {label_dict[i][0]}, Weight: {label_dict[i][1]}")

In [623]:
top_ten(MOVIE_LABEL, word_weights[MOVIE_LABEL])

Top-10 highest weighted words for MOV

1 Word: movie, Weight: 5.032352423717666
2 Word: movies, Weight: 4.516040889968569
3 Word: great, Weight: 4.0
4 Word: good, Weight: 2.867553228266578
5 Word: remake, Weight: 2.32902111955974
6 Word: watching, Weight: 2.146617289968964
7 Word: film, Weight: 2.1283315480618112
8 Word: just, Weight: 2.1168833703103616
9 Word: way, Weight: 2.0925094279870295
10 Word: ve, Weight: 1.8664992184525673


In [624]:
top_ten(COLLEGE_LABEL, word_weights[COLLEGE_LABEL])

Top-10 highest weighted words for COL

1 Word: umass, Weight: 3.831318937143369
2 Word: major, Weight: 2.6957913571321424
3 Word: frats, Weight: 2.354513921946645
4 Word: class, Weight: 1.918284212689718
5 Word: cs, Weight: 1.8805116547216971
6 Word: student, Weight: 1.7735471301073815
7 Word: classes, Weight: 1.72487490286209
8 Word: 200, Weight: 1.530413386131268
9 Word: coming, Weight: 1.5087539783908794
10 Word: getting, Weight: 1.4124444450940081


In [625]:
top_ten(TECHNOLOGY_LABEL, word_weights[TECHNOLOGY_LABEL])

Top-10 highest weighted words for TEC

1 Word: ve, Weight: 2.612578951359384
2 Word: tokens, Weight: 2.0
3 Word: nlp, Weight: 1.6633090166873892
4 Word: cola, Weight: 1.5983009867135576
5 Word: real, Weight: 1.5862947002085048
6 Word: faster, Weight: 1.5773502691896257
7 Word: tts, Weight: 1.5538077857179051
8 Word: word, Weight: 1.5495284403793137
9 Word: data, Weight: 1.4345414181697302
10 Word: python, Weight: 1.2844570503761734


In [626]:
top_ten(NOT_APPLICABLE_LABEL, word_weights[NOT_APPLICABLE_LABEL])

Top-10 highest weighted words for NA

1 Word: like, Weight: 4.981282367169192
2 Word: just, Weight: 3.915425096018624
3 Word: thanks, Weight: 3.652315376953827
4 Word: heard, Weight: 2.816410653164552
5 Word: good, Weight: 2.689898793878429
6 Word: pretty, Weight: 2.5601707836895775
7 Word: maybe, Weight: 2.5480692476890043
8 Word: does, Weight: 2.095094359716256
9 Word: time, Weight: 2.095044042684684
10 Word: writing, Weight: 1.9237925173309651


For the first three categories, the top 10 weighted words make a lot of sense. For the first three categories, it is clear as to why the top 10 words have the highest weights since they are literally/directly related to the label and so were present in most of the text (under that label). However, the top 10 words for the NA (not applicable) category don't make much sense (could be because it is a general 4th category). I think this is also a reason why the accuracy was so low because words like, heard, and good can easily be present when someone is talking about a movie or a class. These weights indicate the strong relation of the labels with their top 10 highest weighted words.