In this notebook, I'll try to understand basic NLP, such as bag of words model doing kaggle tutorial.
URL of the tutorial is https://www.kaggle.com/c/word2vec-nlp-tutorial

In [1]:
# importing librarie
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
from operator import itemgetter

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier

In [2]:
# load data using pandas
train = pd.read_csv('input/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('input/testData.tsv', header=0, delimiter='\t', quoting=3)
dataset = [train, test]

In [3]:
train.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
test.head()

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


In [5]:
train.info()
print('=======================')
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
id        25000 non-null object
review    25000 non-null object
dtypes: object(2)
memory usage: 390.7+ KB


In [6]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [7]:
# define a function to clean review
def review_to_words(raw_review):
    # input : raw_string (e.g. train['review'][0])
    # output : string consisting of words (preprocessed review)
    
    # remove HTML tags with BeautifulSoup
    review_text = BeautifulSoup(raw_review).get_text()
    
    # remove numbers and punctuation 
    letters_only = re.sub('[^a-zA-Z]', ' ', review_text)
    
    # split string into list of words 
    letters_only = letters_only.lower().split()
    
    # remove stop words 
    stops = set(stopwords.words('english'))
    meaningful_words = [word for word in letters_only if word not in stops]
    
    # join the words into single string and return it as a result
    cleaned_review = ' '.join(meaningful_words)
    
    return cleaned_review

In [8]:
review_to_words(train['review'][0])



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


'stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate workin

In [9]:
# apply the function to all raw reviews
reviews_number = train.shape[0]
print('cleaning up raw reviews...')
for data in dataset:
    data['cleaned_reviews'] = data['review'].apply(review_to_words)
print('finished')



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "html5lib")

  markup_type=markup_type))


cleaning up raw reviews...
finished


In [10]:
train['cleaned_reviews'].head()

0    stuff going moment mj started listening music ...
1    classic war worlds timothy hines entertaining ...
2    film starts manager nicholas bell giving welco...
3    must assumed praised film greatest filmed oper...
4    superbly trashy wondrously unpretentious explo...
Name: cleaned_reviews, dtype: object

In [11]:
vectorizer = CountVectorizer(analyzer='word', tokenizer=None, 
                             preprocessor=None, stop_words=None, 
                             max_features=5000)

cleaned_reviews = train['cleaned_reviews'].tolist()
train_data_features = vectorizer.fit_transform(cleaned_reviews)
train_data_features = train_data_features.toarray()

In [12]:
# enumerate counts of each word in the all reviews in descending order

vocab = vectorizer.get_feature_names()
counts = np.sum(train_data_features, axis=0)

word_count = list()
for word, count in zip(vocab, counts):
        word_count.append([word, count])
wordcount = sorted(word_count, key=itemgetter(1))[::-1]

In [13]:
wordcount[:5]

[['movie', 44031],
 ['film', 40147],
 ['one', 26788],
 ['like', 20274],
 ['good', 15140]]

In [14]:
# setting training data and label
x_train = train_data_features
y_train = train['sentiment']

In [15]:
# random forest
rf = RandomForestClassifier(n_estimators=300,max_depth=6)
rf.fit(x_train, y_train)
rf.score(x_train, y_train)

0.83652000000000004

In [16]:
test_data_features = vectorizer.transform(test['cleaned_reviews'])
test_data_features = test_data_features.toarray()
x_test = test_data_features

In [17]:
prediction = rf.predict(x_test)

In [18]:
# submission
submission = pd.DataFrame({'id' : test['id'],
                          'sentiment' : prediction}, index=None)
submission.to_csv('submission/submission_rf.csv', index=None, quoting=3)