### Data cleaning and preprocessing

In [1]:
# load data and take a quick look
import pandas as pd
raw_data = pd.read_csv('coursework1_train.csv')
raw_data.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,Enjoy the opening credits. They're the best th...,neg
1,1,"Well, the Sci-Fi channel keeps churning these ...",neg
2,2,It takes guts to make a movie on Gandhi in Ind...,pos
3,3,The Nest is really just another 'nature run am...,neg
4,4,Waco: Rules of Engagement does a very good job...,pos


In [2]:
# check the size of the data and its class distribution
all_text = raw_data['text'].tolist()
all_lables = raw_data['sentiment'].tolist()

print('number of reviews:', len(all_text))
print('number of pos reviews:', len([l for l in all_lables if l=='pos']))
print('number of neg reviews:', len([l for l in all_lables if l=='neg']))

number of reviews: 40000
number of pos reviews: 20000
number of neg reviews: 20000


The first piece of preprocessing I performed is converting the text to lowercase, in case I have to remove stopwords later (nltk's stopwords are all in lowercase).

In [3]:
# convert text to lowercase
raw_data['text'] = raw_data['text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

Following this, I lemmatized the text so I can correctly identify the frequency of words later (e.g. movie and movies should not be counted as two separate tokens)

In [4]:
# lemmatize text
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
raw_data['text'] = raw_data['text'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(y) for y in x.split()]))

I then removed all punctuation, in order to make clean and word-only tokens.

In [5]:
# remove punctuation
import re
raw_data['text'] = raw_data['text'].apply(lambda x: re.sub('[^\w\s]', '', x))

The final step to start analyzing the text is tokenizing it. I performed this in order to find the most frequent words in the reviews.

In [6]:
# tokenize text
from nltk import word_tokenize
raw_data['review_tokens'] = raw_data['text'].apply(lambda x: nltk.word_tokenize(x))

In [7]:
# take another quick look at the data to make sure preprocessing is working
raw_data.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment,review_tokens
0,0,enjoy the opening credits theyre the best thin...,neg,"[enjoy, the, opening, credits, theyre, the, be..."
1,1,well the scifi channel keep churning these tur...,neg,"[well, the, scifi, channel, keep, churning, th..."
2,2,it take gut to make a movie on gandhi in india...,pos,"[it, take, gut, to, make, a, movie, on, gandhi..."
3,3,the nest is really just another nature run amo...,neg,"[the, nest, is, really, just, another, nature,..."
4,4,waco rule of engagement doe a very good job of...,pos,"[waco, rule, of, engagement, doe, a, very, goo..."


I decided to take a look at the most frequent words in a sample review to get a bigger picture of the data, and in order to determine whether any further preprocessing is required.

In [8]:
# get frequency distribution of words
tokens = raw_data['review_tokens'].tolist()
fdist = nltk.FreqDist(tokens[10])
print(fdist.most_common(5))

[('the', 34), ('a', 25), ('and', 21), ('of', 19), ('to', 18)]


Looks like all the top frequent words are stopwords! It's important to remove these words from the text in order to improve the accuracy of tf-idf implemented later, and the machine learning model too.

In [9]:
# remove stopwords
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))

def remove_stopwords(text):
    return [x for x in text if x not in sw]

cleaned_text = raw_data['review_tokens'].apply(lambda x: remove_stopwords(x))

In [10]:
# recreate the reviews in the correct format for developing the model
final_text = [' '.join(x) for x in cleaned_text]
final_text[0]

'enjoy opening credits theyre best thing secondrate inoffensive timekiller feature passable performance like eric robert martin kove main part however go newcomer tommy lee thomas look bit diminutive kind action nevertheless occasionally manages project bantyrooster kind belligerence first time see hes barechested sweaty engaged favorite beefcake activity chopping wood ha seven scene without shirt including one hes hanged wrist zapped electricity la mel gibson lethal weapon could use better script however since manner expose truth corruption violence inside prison never convincing theres also talk million dollar apparently tied investigation never explained pluses though sending john woodrow undercover john wilson amusing play presidential name costar jody ross nolan show promise inmate early proceedings shown hanged wrist getting punched burly guard one final note movies low budget painfully responsible lack extras despite impressive size prison seems hold 12 inmates note cast credit 

The cleaned review looks ideal for analysis, with lemmatization done and stopwords removed.

Now it's time to split the data. I chose an 80/20 split for the train and test sets as this is known to be a fair split in machine learning processes.

In [11]:
# split data into train and test sets
train_text = final_text[:32000]
train_labels = all_lables[:32000]
test_text = final_text[32000:]
test_labels = all_lables[32000:]

### Developing and training the model 

Tf-idf is an ideal method to use for feature extraction in this text analysis as it would outline the key words required to correctly categorize the data.
I had initially decided not to set a max feature number, as limiting the features produced results with a lower accuracy. However, I realized that this led to overfitting of the data, so I used a max feature number of 2000.

I believe logistic regression is a good algorithm to use in this case because the labels are binary - pos and neg. Since logistic regression is one of the most common methods for binary classification, this is a safe bet.

In [12]:
# tf-idf and logistic regression
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 2000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(solver='lbfgs').fit(train_vecs, train_labels)

# make prediction
test_pred_lr = clf_lr.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc_lr = accuracy_score(test_labels, test_pred_lr)
pre_lr, rec_lr, f1_lr, _ = precision_recall_fscore_support(test_labels, test_pred_lr, average='macro')
print('accuracy:', acc_lr)
print('precision:', pre_lr)
print('rec:', rec_lr)
print('f1:', f1_lr)

accuracy: 0.86925
precision: 0.8692932727465921
rec: 0.8692857942885943
f1: 0.8692499264530836


An accuracy of 0.87 is good for this data, but I'm also interested in seeing how other algorithms perform, so I will try tf-idf with Naive Bayes next as this is another common algorithm for text classification. I use the same feature limitation in tf-idf so as to standardize performance of the two algorithms.

In [13]:
# tf-idf and naive bayes
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 2000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)


from sklearn.naive_bayes import MultinomialNB
clf_nb = MultinomialNB().fit(train_vecs, train_labels)

# make prediction
test_pred_nb = clf_nb.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc_nb = accuracy_score(test_labels, test_pred_nb)
pre_nb, rec_nb, f1_nb, _ = precision_recall_fscore_support(test_labels, test_pred_nb, average='macro')
print('accuracy:', acc_nb)
print('precision:', pre_nb)
print('rec:', rec_nb)
print('f1:', f1_nb)

accuracy: 0.841625
precision: 0.8416383585551622
rec: 0.8416470492937747
f1: 0.8416247005716995


With an accuracy rate of 0.84, Naive Bayes is slightly less effective than logistic regression. Therefore, logistic regression is the better machine learning algorithm for this data.

### Save the trained model 

In [14]:
import pickle

# save model and other necessary modules
info_save = {
    'model': clf_lr,
    'vectorizer': TfidfVectorizer(vocabulary=train_vectorizer.vocabulary_)
}
save_path = open("trained_model.pickle","wb")
pickle.dump(info_save, save_path)