Author : [Vu Tran](https://github.com/tranlyvu). Other info is on [github](https://github.com/tranlyvu/kaggle/tree/master/Bag%20of%20Words%20Meets%20Bags%20of%20Popcorn)


#Kaggle Competition: Bag of Words Meets Bags of Popcorn

*   Info from Competition Site
    *   Description
    *   Evaluation
    *   Data Set
*  First attempt: 
    *  Working with data: exploring labeled Data Set
    *  Feature 'review'
        *  Processing raw text
        *  Transforming feature 'review': bag-of-words model
        *  Extending bag-of-words with TF-IDF weights
        *  Dimensionality reduction
    *  Training Naive Bayes
    *  Predicting with Naive Bayes
    *  Preparing for kaggle submission
    *  Performance Evaluation 
        *  Splitting train data set
        *  Evaluating performance using splitted data set
        *  Plotting ROC curve
    *  Hyperparameters (in progress)

#Info from Competition Site

##[Description](https://www.kaggle.com/c/word2vec-nlp-tutorial)

![image](https://raw.githubusercontent.com/tranlyvu/kaggle/master/Bag%20of%20Words%20Meets%20Bags%20of%20Popcorn/image/popcorn%20cvc3.jpg)

  
##[Evaluation](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/evaluation)


Submissions are judged on area under the ROC curve. 

###Submission Instructions

You should submit a comma-separated file with 25,000 row plus a header row. There should be 2 columns: "id" and "sentiment", which contain your binary predictions: 1 for positive reviews, 0 for negative reviews. For an example, see "sampleSubmission.csv" on the Data page. 

id,sentiment
123_45,0 
678_90,1
12_34,0
...
 
##[Data Set](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

###Data Files

File Name 	      | Available Formats
------------------|------------------
sampleSubmission  |	.csv (276.17 kb)
unlabeledTrainData.tsv |	.zip (25.98 mb)
testData.tsv 	|.zip (12.64 mb)
labeledTrainData.tsv |	.zip (12.96 mb)

###Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

###File descriptions

*  labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  
*  testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. 
*  unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 
*  sampleSubmission - A comma-delimited sample submission file in the correct format.

###Data fields

*  id - Unique ID of each review
*  sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
*  review - Text of the review


#First attempt

##Working with Data

We will first explore  the labled DataSet.

In [2]:
import pandas as pd
#reading  labeled train dataset:
train_data=pd.read_csv("C:/Users/vutran/Desktop/github/kaggle/Bag of Words Meets Bags of Popcorn/data/labeledTrainData.tsv", header=0,delimiter="\t", quoting=3)
train_data.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [3]:
train_data.tail()

Unnamed: 0,id,sentiment,review
24995,"""3453_3""",0,"""It seems like more consideration has gone int..."
24996,"""5064_1""",0,"""I don't believe they made this film. Complete..."
24997,"""10905_3""",0,"""Guy is a loser. Can't get girls, needs to bui..."
24998,"""10194_3""",0,"""This 30 minute documentary Buñuel made in the..."
24999,"""8478_8""",1,"""I saw this movie as a child and it broke my h..."


Notice that 'sentiment' is binary

In [4]:
train_data.dtypes

id           object
sentiment     int64
review       object
dtype: object

Type 'object' is a string for pandas. We shall later convert to number representation,maybe using typical bag-of-words or word2vec

Starting getting basic information of data:

In [5]:
train_data.info(0)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 781.2+ KB


Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning

##Feature 'review'

###Processing raw text

We will start wrting function for analyzing and cleaning the deature 'review', using first review as a point of illustration

In [6]:
train_data.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

Before we can transform text into number representation, we need to process raw text. Let's first remove HTML and puctuation

In [7]:
import nltk
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(train_data.review[0]).get_text()
letters_only = re.sub("[^a-zA-Z]"," ",soup )
letters_only

u' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    

Now we start stemming and lemmatizing the text, but it is generally better to first create the pos tagger as we only want to lemmatize verb and noum

In [8]:
tokens=nltk.word_tokenize(letters_only.lower())
tagged_words=nltk.pos_tag(tokens)
tagged_words[0:5]

[(u'with', 'IN'),
 (u'all', 'DT'),
 (u'this', 'DT'),
 (u'stuff', 'NN'),
 (u'going', 'VBG')]

Stemming the text: There are genrally 2 stemmers available in nltk, porter and lancaster

In [11]:
porter=nltk.PorterStemmer()
def lemmatize_with_potter(token,tag):
    if tag[0].lower in ['v','n']:
        return  porter.stem(token)
    return token
stemmed_text_with_potter=[lemmatize_with_potter(token,tag) for token,tag in tagged_words]

lancaster=nltk.LancasterStemmer()
def lemmatize_with_lancaster(token,tag):
    if tag[0].lower in ['v','n']:
        return  lancaster.stem(token)
    return token
stemmed_text_with_lancaster=[lemmatize_with_lancaster(token,tag) for token,tag in tagged_words]

In [12]:
stemmed_text_with_potter[0:10]

[u'with',
 u'all',
 u'this',
 u'stuff',
 u'going',
 u'down',
 u'at',
 u'the',
 u'moment',
 u'with']

In [13]:
stemmed_text_with_lancaster[0:10]

[u'with',
 u'all',
 u'this',
 u'stuff',
 u'going',
 u'down',
 u'at',
 u'the',
 u'moment',
 u'with']

Observing that the word 'going' has been stemmed with porter but not with lancaster, I'll choose porter for this task. 

let's lemmatizing

In [14]:
tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
wnl = nltk.WordNetLemmatizer()
def lemmatize_with_WordNet(token,tag):
    if tag[0].lower in ['v','n']:
        return wnl.lemmatize(token)
    return token
stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
stemmed_and_lemmatized_text[0:10]

[u'with',
 u'all',
 u'this',
 u'stuff',
 u'going',
 u'down',
 u'at',
 u'the',
 u'moment',
 u'with']

text cleanning summary

In [34]:
import re
from bs4 import BeautifulSoup
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

porter=nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def stemmatize_with_potter(token,tag):
    if tag[0].lower() in ['v','n']:
        return  porter.stem(token)
    return token


def lemmatize_with_WordNet(token,tag):
    if tag[0].lower() in ['v','n']:
        return wnl.lemmatize(token)
    return token

def corpus_preprocessing(corpus):
    preprocessed_corpus = []
    for sentence in corpus:	
        #remove HTML and puctuation
        soup=BeautifulSoup(sentence).get_text()
        letters_only = re.sub("[^a-zA-Z]"," ",soup )

        #Stemming
        tokens=nltk.word_tokenize(letters_only.lower())
        tagged_words=nltk.pos_tag(tokens)
        stemmed_text_with_potter=[stemmatize_with_potter(token,tag) for token,tag in tagged_words]

        #lemmatization
        tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
        stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
        
        #join all the tokens
        clean_review=" ".join(w for w in  stemmed_and_lemmatized_text)
        preprocessed_corpus.append(clean_review)

    return preprocessed_corpus


###Transforming feature 'review': bag-of-words model

Let's transform feature 'review' into numerical representation to feed into machine learning. The common representation of text is the [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model)

in Sklearn, we can use class CountVectorize to transform the data. We shall also use stop-words to reduce the dimension of feature space. Let's now first 5 data from train Dataset to be test_corpus

In [39]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(stop_words='english')
test_corpus=train_data.review[0:5]
test_corpus= corpus_preprocessing(test_corpus)
test_corpus=vectorizer.fit_transform(test_corpus)
print test_corpus.todense()

[[0 0 1 ..., 0 0 0]
 [0 0 0 ..., 1 0 0]
 [0 1 0 ..., 0 1 1]
 [0 1 0 ..., 0 0 0]
 [1 1 1 ..., 1 0 0]]


###Extending bag-of-words with TF-IDF weights

We could extend the bag-of-words representation with [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to reflect how important a word to a document in a corpus

tdf-idf can be applied with class TfidfVectorizer in sklearn

In [46]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer= TfidfVectorizer(stop_words='english')
test_corpus=train_data.review[0:5]
test_corpus= corpus_preprocessing(test_corpus)
test_corpus=vectorizer.fit_transform(test_corpus)
print test_corpus.todense()

[[ 0.          0.          0.04186337 ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.07981727  0.          0.        ]
 [ 0.          0.03912565  0.         ...,  0.          0.05842164
   0.05842164]
 [ 0.          0.0451275   0.         ...,  0.          0.          0.        ]
 [ 0.06656027  0.04457619  0.0537004  ...,  0.0537004   0.          0.        ]]


###Dimensionality reduction

Using stop_words was one technique to reduce dimensionality. We can further reduce the dimensinality by using [latent sematic analysis](https://en.wikipedia.org/wiki/Latent_semantic_analysis)

In sklearn, we can apply class TruncatedSVD into tf-idf matrix 

In [47]:
from sklearn.decomposition import TruncatedSVD 
tsvd=TruncatedSVD(100)
tsvd.fit(corpus)
test_corpus=tsvd.transform(test_corpus)
test_corpus

array([[ 0.59302279,  0.10204894,  0.05459333, -0.20719533,  0.76941514],
       [ 0.42258876,  0.80330892,  0.20243974,  0.0516001 , -0.36396305],
       [ 0.46017006, -0.1727521 , -0.39830345,  0.77359306, -0.0361713 ],
       [ 0.48451063, -0.23827071, -0.49382131, -0.56305644, -0.38416724],
       [ 0.4196631 , -0.48859817,  0.72588024,  0.0426256 , -0.23756188]])

##Training Naive Bayes

Sklearn provides several kinds of Naives classifiers: GaussianNB, MultinomialNB and BernoulliNB. We will choose MultinomialNB for this task

In [None]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()

Fitting the training data

In [None]:
#features from train set
train_features=train_data.review

#pro-processing train features
train_features=corpus_preprocessing(train_features)
vectorizer= TfidfVectorizer(stop_words='english')
train_features=vectorizer.fit_transform(train_features)
tsvd=TruncatedSVD(100)
tsvd.fit(train_features)
train_features=tsvd.transform(train_features)

#target from train set 
train_target=train_data.sentiment

#fitting the model
model.fit(train_features,train_target)

##Predicting with Naive Bayes

In [None]:
#reading test data
test_data=train_data=pd.read_csv("https://github.com/tranlyvu/kaggle/tree/master/Bag%20of%20Words%20Meets%20Bags%20of%20Popcorn/data/data/testData.tsv", header=0,delimiter="\t", quoting=3)

#features from test data
test_features=test_data.review

#pre-processing test features
test_features=corpus_preprocessing(test_features)
test_features=vectorizer.transform(test_features)
test_features=tsvd.transform(test_features)

#predicting the sentiment for test set
prediction=model.predict(test_features)

##Preparing for kaggle submission

In [None]:
#writing out submission file
pd.DataFrame( data={"id":test_data["id"], "sentiment":prediction} ).to_csv("first_attempt.csv", index=False, quoting=3 )

##Performance Evaluation

A variety of metrics exist to evaluate the performance for binary classifiers, i.e accuracy, precision, recall, F1 measure, ROC AUC score. We shall use ROC AUC score for this task as specified by competition site.

### Splitting train data set

We first splitting the train data set for cross validation, let's choose 80% for split_train set and 20% for split test_set

In [None]:
from sklearn.cross_validation import train_test_split
# Split 80-20 train vs test data
split_train_features, split_test_features, split_train_target, split_test_target= train_test_split(train_features, 
                                                                                                   train_target, 
                                                                                                   test_size=0.20, 
                                                                                                   random_state=0)

###Evaluating model using splitted data set

[ ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) illustrates the classifier's performance for all values of the discrimination threshold. 

In [None]:
from sklearn.metric import roc_auc_score,roc_curve

#pre-processing split train 
vectorizer= TfidfVectorizer(stop_words='english')
split_train_features=corpus_preprocessing(split_train_features)
split_train_features=vectorizer.fit_transform(split_train_features)
tsvd=TruncatedSVD(100)
tsvd.fit(split_train_features)
split_train_features=tsvd.transform(split_train_features)

#pre-processing split test features
split_test_features=corpus_preprocessing(split_test_features)
split_test_features=vectorizer.transform(split_test_features)
split_test_features=tsvd.transform(split_test_features=)

#fit and predict using split data
model.fit(split_train_features,split_train_target)
split_prediction=model.predict(split_test_features)
score=roc_auc_score(split_test_target, split_predict)
print (score(split_test_target, split_predict))

###Plotting ROC curve

ROC curves plot the classifier's recall against its fall-out.

In [None]:
import matplotlib as plt

false_positive_rates ,recall,thresholds=roc_curve(split_test_target,split_prediction)
plt.title('Receiver Operating Charisteristic')
plt.plot(false_positive_rates,recall,'r', label='AUC = %0.2f' %score)
plt.legend(loc='lower right')
plt.ylable('Recall')
plt.xlable('False positive rate')
plt.show()

The source code of the first attempt can be found [here](https://github.com/tranlyvu/kaggle/blob/master/Bag%20of%20Words%20Meets%20Bags%20of%20Popcorn/main/first_attempt.py)