# Natural Language Processing(NLP)

I observed that NLP has been a hot topic for quantitative finance these days. Hence, I want to do a small project to matches the market's demand

In [5]:
import nltk
from nltk.corpus import movie_reviews
import random
import numpy as np
import pandas as pd
#from sklearn.model_selection import KFold
import string
from nltk.stem import PorterStemmer
#from sklearn.model_selection import train_test_split
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB

nltk is the main library associated with the NLP topic(at least in the domain of my knowledge). Today, we use the 'movie_reviews' data attached in the nltk library to analyze the sentiment of some movie reviews. The ultimate goal is to build a model that can predict sentiments if a piece of review is given.

In [18]:
nltk.download('movie_reviews')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Put the (review, sentiment) pair into a list. Also, pandas dataframe is pretty handy for data processing.

In [14]:
# input
documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

# dataframe of documents
documents_df = pd.DataFrame(documents, columns=['review', 'sentiment'])


Quick look of raw data

In [15]:
documents_df.head(10)

Unnamed: 0,review,sentiment
0,"[a, follow, -, up, to, disney, ', s, live, -, ...",neg
1,"[what, if, one, of, our, cities, became, the, ...",pos
2,"[contact, is, a, nobly, intentioned, but, ulti...",pos
3,"[vampire, lore, and, legend, has, always, been...",pos
4,"[it, ', s, a, rare, treat, when, a, quality, h...",neg
5,"[a, month, ago, i, wrote, that, speed, 2, was,...",neg
6,"[capsule, :, annoyingly, unentertaining, ,, ob...",neg
7,"[a, number, of, critics, have, decided, that, ...",neg
8,"[shakespeare, in, love, is, quite, possibly, t...",neg
9,"[the, "", italian, hitchcock, "", and, acknowled...",pos


The following functions clean the data by removing some unwanted elements of raw data, such as punctuations and stopwords. They do not have any effect on the sentiment.

In [16]:
# step 1: punctuation
def remove_punctuation(text):
    pun_lst = string.punctuation
    no_punct = [words for words in text if words not in pun_lst]
    #words_wo_punct=''.join(no_punct)
    return no_punct

documents_df['review'] = documents_df['review'].apply(lambda x: remove_punctuation(x))

In [19]:
# step 2: stopwords
stopword = nltk.corpus.stopwords.words('english')
def remove_stopwords(text):
    text = [word for word in text if word not in stopword]
    return text
  
documents_df['review'] = documents_df['review'].apply(lambda x: remove_stopwords(x))

In [20]:
# step 3: numbers
def remove_numbers(text):
    text = [word for word in text if not word.isnumeric()]
    return text

documents_df['review'] = documents_df['review'].apply(lambda x: remove_numbers(x))

My favorite step is to extract the 'root' of words. For example, 'have', 'had', 'having' will all be converted to 'have'. This step is not an intuitive one to me because there is no change of verbs in my first language, Chinese. Now I have learned this elegant step for processing words.

In [21]:
# step 4: stemming
ps =PorterStemmer()
def stemming(text):
    text = list(map(lambda x: ps.stem(x), text))
    # after stemming, remove duplicates
    text = list(np.unique(text))
    return text

documents_df['review'] = documents_df['review'].apply(lambda x: stemming(x))

Quick look at documents_df after transformation

In [23]:
documents_df.head(10)

Unnamed: 0,review,sentiment
0,"[--, act, action, actor, actress, ad, addit, a...",neg
1,"[--, abus, across, action, activ, actual, admi...",pos
2,"[abandon, acknowledg, across, actor, actress, ...",pos
3,"[action, ad, adam, adher, advantag, agre, aid,...",pos
4,"[abil, alli, allow, anim, apolog, attack, atta...",neg
5,"[--, action, adapt, ago, art, asid, attent, b,...",neg
6,"[--, actor, air, alreadi, alway, ammo, annoyin...",neg
7,"[act, actor, affair, alik, america, ampli, ano...",neg
8,"[act, actor, ado, affleck, almost, also, alway...",neg
9,"[acknowledg, aid, argento, band, beauti, becko...",pos


We can convert dataframe back to a list of tuples. Then, we pick the 2000 most frequent words in the movie reviews. We call these words 'features'. 

In [24]:
# turn dataframe into a list of tuples
documents_lst = [tuple(r) for r in documents_df.to_numpy()]

# # the rank of frequencies of all words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
# # we call the 2000 most frequent words 'features'
word_features = list(all_words)[:2000]

Check if a word of one review exists in features. 



In [26]:
def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['has({})'.format(word)] = (word in document_words)
    return features

Lastly, we format the data so that it fits the Naive Bayes classifier. 

In [27]:
labelsets = [(document_features(d), c) for (d,c) in documents_lst]
# train test split
training_set = labelsets[:1500]
testing_set = labelsets[1500:]
# classifier
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MultinomialNB accuracy percent:",nltk.classify.accuracy(MNB_classifier, testing_set))


MultinomialNB accuracy percent: 0.808


According to the output, the MNB model can successfully predicts the sentiment of a given review 80% of the time. A possible improvement of this project is to add more data cleaning approaches. 