# Bag of Words meet Bag of Popcorn

This is a solution of **Bag of Words Model** for The **$Kaggle$** competition **Bag of Words meet Bag of Popcorn**

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

# File descriptions

**labeledTrainData** - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  

**testData** - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one.

**unlabeledTrainData** - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 

**sampleSubmission** - A comma-delimited sample submission file in the correct format.


In [4]:
import pandas as pd

Let us first load the dataset. Dataset is provided in a .tsv format file so we will open dataset via pandas library.

In [6]:
train = pd.read_csv('labeledTrainData.tsv',delimiter = '\t')

In [7]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


Let's see the first review

In [8]:
train.review[0]

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

As we can see the review seems to be in html format. We need to convert the html format into plain text. We can do this via Regular Expression or via BeautifulSoup. BeautifulSoup should be preffered for such tasks because in regex we have to check various conditions.

In [9]:
from bs4 import BeautifulSoup

In [11]:
BeautifulSoup(train['review'][2]).get_text()

u"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xa8Jurassik Park\xa8, and some scientists resurrect one of nature's most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight aga

We can see that all the html tags have been removed. Now its time for us to remove all the '(' , '\', ',' etc. from the text since all these things have no meaning in our simple model of bag of words. To remove these we will use regular expression library called re ( we can't use beautiful soup because its for html tags) . Regular expression can be used for removing, replacing or substituting text.

In [25]:
text = BeautifulSoup(train['review'][2]).get_text()
import re
text = re.sub('[^a-zA-Z]',' ',text)
text

u'The film starts with a manager  Nicholas Bell  giving welcome investors  Robert Carradine  to Primal Park   A secret project mutating a primal animal using fossilized DNA  like  Jurassik Park   and some scientists resurrect one of nature s most fearsome predators  the Sabretooth tiger or Smilodon   Scientific ambition turns deadly  however  and when the high voltage fence is opened the creature escape and begins savagely stalking its prey   the human visitors   tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger   In addition   a security agent  Stacy Haiduk  and her mate  Brian Wimmer  fight hardly against the carnivorous Smilodons  The Sabretooths  themselves   of course  are the real star stars and they are astounding terrifyingly though not convincing  The giant animals savagely are stalking its prey and the group run afoul and fight against o

The words like 'the', 'of','and' etc. are called **stopwords** . These words are generally removed because they occur in every scentece and have less significance. 

In [26]:
from nltk.corpus import stopwords
text = text.lower().split(' ')
text = [w for w in text if w not in stopwords.words('english')]
text

[u'film',
 u'starts',
 u'manager',
 u'',
 u'nicholas',
 u'bell',
 u'',
 u'giving',
 u'welcome',
 u'investors',
 u'',
 u'robert',
 u'carradine',
 u'',
 u'primal',
 u'park',
 u'',
 u'',
 u'secret',
 u'project',
 u'mutating',
 u'primal',
 u'animal',
 u'using',
 u'fossilized',
 u'dna',
 u'',
 u'like',
 u'',
 u'jurassik',
 u'park',
 u'',
 u'',
 u'scientists',
 u'resurrect',
 u'one',
 u'nature',
 u'fearsome',
 u'predators',
 u'',
 u'sabretooth',
 u'tiger',
 u'smilodon',
 u'',
 u'',
 u'scientific',
 u'ambition',
 u'turns',
 u'deadly',
 u'',
 u'however',
 u'',
 u'high',
 u'voltage',
 u'fence',
 u'opened',
 u'creature',
 u'escape',
 u'begins',
 u'savagely',
 u'stalking',
 u'prey',
 u'',
 u'',
 u'human',
 u'visitors',
 u'',
 u'',
 u'tourists',
 u'scientific',
 u'meanwhile',
 u'youngsters',
 u'enter',
 u'restricted',
 u'area',
 u'security',
 u'center',
 u'attacked',
 u'pack',
 u'large',
 u'pre',
 u'historical',
 u'animals',
 u'deadlier',
 u'bigger',
 u'',
 u'',
 u'addition',
 u'',
 u'',
 u'securi

Joining the text again we will get sentence which contain words other than stopwords

In [27]:
' '.join(text)

u'film starts manager  nicholas bell  giving welcome investors  robert carradine  primal park   secret project mutating primal animal using fossilized dna  like  jurassik park   scientists resurrect one nature fearsome predators  sabretooth tiger smilodon   scientific ambition turns deadly  however  high voltage fence opened creature escape begins savagely stalking prey   human visitors   tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger   addition   security agent  stacy haiduk  mate  brian wimmer  fight hardly carnivorous smilodons  sabretooths    course  real star stars astounding terrifyingly though convincing  giant animals savagely stalking prey group run afoul fight one nature fearsome predators  furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading  hair raising chills full scares sabretooths appear mediocre special effects story provides e

Lets store everthing in function so that we can use it.

In [29]:
def data_cleaning(sentence):
    text = BeautifulSoup(sentence).get_text()
    text = re.sub('[^a-zA-Z]',' ',text)
    text = text.lower()
    text = [w for w in text.split(' ') if w not in stopwords.words('english')]
    text = ' '.join(text)
    return text

In [30]:
length = train.review.shape[0]
for i in range(length):
    train.review[i] = data_cleaning(train.review[i])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


# Bag of Words

Bag of Words (BoW) is an algorithm that counts how many times a word appears in a document. The word count allow us to compare the documents and find their similarities.
Example:- There are 2 sentences
1. Ram is a boy
2. Shyam is a good boy.

Distinct words here are:-
{Ram,is,a,boy,Shyam,good}

Bag of word will count their occurances in each sentece
1. First Sentence :- {1,1,1,1,0,0}
2. Second Sentence :- {0,1,1,1,1,1}


In [32]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",tokenizer = None,preprocessor = None,stop_words = None,max_features = 5000)


In [35]:
reviews_vectorized = vectorizer.fit_transform(train.review.values)

In [36]:
reviews_vectorized.shape

(25000, 5000)

Loading Test Set and converting it in same form as of Training set.

In [39]:
test = pd.read_csv('testData.tsv',delimiter='\t')
test.head()

Unnamed: 0,id,review
0,12311_10,Naturally in a film who's main themes are of m...
1,8348_2,This movie is a disaster within a disaster fil...
2,5828_4,"All in all, this is a movie for kids. We saw i..."
3,7186_2,Afraid of the Dark left me with the impression...
4,12128_7,A very accurate depiction of small time mob li...


In [40]:
length = train.review.shape[0]
for i in range(length):
    test.review[i] = data_cleaning(test.review[i])

In [41]:
test_reviews_vectorized = vectorizer.fit_transform(test.review.values)
test_reviews_vectorized.shape

(25000, 5000)

Now, We have shaped both training data as well as testing data. Its time to apply Machine learning now. Lets apply random Forest.
We will be using Sklearn library for machine learning.

In [42]:
from sklearn.ensemble import RandomForestClassifier

In [72]:
#First we create a classifier
classifier = RandomForestClassifier(n_estimators= 100,n_jobs=3)

In [53]:
#Lets fit the Random Forest Classifier to our training dataset
classifier.fit(reviews_vectorized,train.sentiment)
#Testing the classifier on test dataset
predictions = classifier.predict(test_reviews_vectorized)

In [55]:
output = pd.DataFrame( data={"id":test["id"], "sentiment":predictions} )
output.to_csv( "Bag_of_Words_model.csv", index=False, quoting=3 )

##### 60 percent accuracy is achieved.


# Lemmatization
To increase accuracy , lets do tokenization of the data

In [57]:
train.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff going moment mj started listening music ...
1,2381_9,1,classic war worlds timothy hines entertaini...
2,7759_3,0,film starts manager nicholas bell giving wel...
3,3630_4,0,must assumed praised film greatest filmed op...
4,9495_8,1,superbly trashy wondrously unpretentious ex...


In [58]:
from spacy.en import English
parser = English()

In [65]:
length = train.review.shape[0]
for i in range(length):
    parsedData = parser(unicode(train.review[i]))
    word = [token.lemma_ for token in parsedData]
    sentence = ' '.join(word)
    train.review[i] = sentence

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [66]:
length = test.review.shape[0]
for i in range(length):
    parsedData = parser(unicode(test.review[i]))
    word = [token.lemma_ for token in parsedData]
    sentence = ' '.join(word)
    test.review[i] = sentence

In [69]:
reviews_vectorized = vectorizer.fit_transform(train.review.values)
test_reviews_vectorized = vectorizer.fit_transform(test.review.values)

In [73]:
#Lets fit the Random Forest Classifier to our training dataset
classifier.fit(reviews_vectorized,train.sentiment)
#Testing the classifier on test dataset
predictions = classifier.predict(test_reviews_vectorized)

In [75]:
output = pd.DataFrame( data={"id":test["id"], "sentiment":predictions} )
output.to_csv( "Bag_of_Words_model1.csv", index=False, quoting=3 )

#### 54% accuracy

This accuracy is less than the accuracy I achieved above which was 60 percent. This accuracy can be increased by selecting features via cross validation and applying Machine learning with better parameters.I will be working on that to increase accuracy and try different models.