# Data Preparation
In this exercise we will work with the IMDB sentiment dataset. This dataset contains movie reviews, each with a positive or negative sentiment (quantized by 1 for positive and 0 for negative). 

Reading and preprocessing the data
To import the tsv file, it is recommended to use the pandas package. The provided file can be imported as follows

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from keras.datasets import imdb

Decompress data

In [2]:
import zipfile
import os

path = os.getcwd()
f = zipfile.ZipFile("./labeledTestData.zip",'r') 
for file in f.namelist():
    f.extract(file,path)
f.close

f = zipfile.ZipFile("./labeledTrainData.zip",'r') 
for file in f.namelist():
    f.extract(file,path)
f.close

<bound method ZipFile.close of <zipfile.ZipFile filename='./labeledTrainData.zip' mode='r'>>

In [3]:
# load data as pandas dataframe
train = pd.read_csv('labeledTrainData.tsv', 
                    header=0,
                    delimiter="\t", 
                    quoting=3 )
test = pd.read_csv('labeledTestData.tsv',
                   header=0,
                   delimiter="\t",
                   quoting=3 )

Print the first 10 samples, which contain id, sentiment and review.

The text strings contain HTML tags, which have to be removed. Then print the first two reviews without HTML tags.

In [4]:
from bs4 import BeautifulSoup

example1 = BeautifulSoup(train['review'][0], 'lxml').get_text()
print(example1)

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 mi

In [5]:
example2 = BeautifulSoup(train['review'][2], 'lxml').get_text()
print(example2)

"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like ¨Jurassik Park¨, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against on

Furthermore, remove punctuation, numbers and all (common) words by using the regular (re) package and converts to lower case.

In [6]:
import re
# Use regular expressions to do a find-and-replace
lowerletter_only = re.sub('[^a-zA-Z]',                       # The pattern to search for charaters not 'a' to 'z' nor 'A' to 'Z'
                     ' ',                               # The pattern to replace with ' '
                     example1 ).lower()                 # The text to search example1 and converts to lower case
print(lowerletter_only)

 with all this stuff going down at the moment with mj i ve started listening to his music  watching the odd documentary here and there  watched the wiz and watched moonwalker again  maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  some of it has subtle messages about mj s feeling towards the press and also the obvious message of drugs are bad m kay visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring  some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him the actual feature film bit when it finally starts is only on for    mi

To split the strings into individual words. 

In [7]:
words = lowerletter_only.split()
print(words)

['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 

Now, we'd like to remove common words that do not carry much meaning, such as 'a', 'the' or 'is'. These are often referred to as stop words.
A list of stop words can be obtained with the NLTK package:

In [8]:
import nltk 
nltk.download('stopwords')         # Download text data sets, including stop words
from nltk.corpus import stopwords   # Import the stop word list
stops = stopwords.words('english')
print(stops)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\96389\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Write a function review_prepro 
which includes the above steps and return a preprocessed review.

In [9]:
# function for preprocessing the data
def review_prepro (data):

    review_text = BeautifulSoup(data, 'lxml').get_text()
    lowerletters_only = re.sub('[^a-zA-Z]', ' ', review_text )
    all_words = lowerletters_only.lower().split()
    en_stops = set(stopwords.words('english'))
    filtered_review = []
    
    for word in all_words: 
        if word not in en_stops:
            filtered_review.append(word)
    
    return ' '.join(filtered_review)

In [10]:
# preprocess train data
num_reviews = train['review'].size
filtered_train_reviews = []

for i in range(num_reviews):
    filtered_train_reviews.append(review_prepro(train['review'][i]))





# Creating Features from a Bag of Words

For generating a bag of words model, we will use the scikit-learn packpage. Use the following code

In [11]:
from sklearn.feature_extraction.text import CountVectorizer

# define the vectorizer
vectorizer = CountVectorizer(analyzer = 'word', 
                             tokenizer = None,  
                             preprocessor = None,
                             stop_words = None,
                             max_features = 50 )

#fit the vectorizer to the data
train_data_features = vectorizer.fit_transform(filtered_train_reviews)
# convert to numpy array
train_data_features = train_data_features.toarray()


# Black box classifier

Use a prebuilt classifier and train it on the traning data, then evaluate the learned classifier on the test data

In [12]:
num_test_reviews = test['review'].size
filtered_test_reviews = []
for i in range(num_test_reviews):
    filtered_test_reviews.append(review_prepro(test['review'][i]))

test_data_features = (vectorizer.transform(filtered_test_reviews)).toarray()

To train a classifier with logistic regression use the following code

In [13]:
# Method: logistic regression from sklearn.linear_model
from sklearn.linear_model import LogisticRegression as LR

logistic_model = LR()
logistic_model.fit(train_data_features, train['sentiment'])
p = logistic_model.predict_proba( test_data_features)[:,1] #Probability of sentiment from test data
output = pd.DataFrame( data={'id':test['id'], 'sentiment':p})
print(output[:10])

          id  sentiment
0  "12081_1"   0.101423
1   "3951_2"   0.726068
2  "10492_1"   0.678182
3   "3350_3"   0.477912
4   "9495_8"   0.293705
5   "4656_4"   0.485859
6   "9983_3"   0.392802
7   "4439_4"   0.496138
8   "8516_2"   0.891603
9  "11232_1"   0.340738


# Evaluate result 

To use Area Under Curve(AUC - TPR vs FPR curve) metric and score of sum of squres between true and redicted values show the performance. An AUC score of 0.5 is the same as a random classifier, the closer to 1 the score is the better.

In [14]:
from sklearn.metrics import roc_auc_score as AUC
score = logistic_model.score(test_data_features,test['sentiment'])
auc = AUC(test['sentiment'].values, p)
print('Sum of squares',score)
print('AUC score:',auc)

Sum of squares 0.705
AUC score: 0.7743503436135539
