
#Kaggle Competition: Bag of Words Meets Bags of Popcorn

*   Info from Competition Site
    *   Description
    *   Evaluation
    *   Data Set
*  First attempt: 
    *  Working with data: exploring labeled Data Set
    *  Feature 'review'
    *  Features pre-processing summary

#Info from Competition Site

##[Description](https://www.kaggle.com/c/word2vec-nlp-tutorial)

![image](http://localhost:8888/files/image/popcorn%20cvc3.jpg)
###Use Google's Word2Vec for movie reviews

In this tutorial competition, we dig a little "deeper" into sentiment analysis. Google's Word2Vec is a deep-learning inspired method that focuses on the meaning of words. Word2Vec attempts to understand meaning and semantic relationships among words. It works in a way that is similar to deep approaches, such as recurrent neural nets or deep neural nets, but is computationally more efficient. This tutorial focuses on Word2Vec for sentiment analysis.

Sentiment analysis is a challenging subject in machine learning. People express their emotions in language that is often obscured by sarcasm, ambiguity, and plays on words, all of which could be very misleading for both humans and computers. There's another Kaggle competition for movie review sentiment analysis. In this tutorial we explore how Word2Vec can be applied to a similar problem.

Deep learning has been in the news a lot over the past few years, even making it to the front page of the New York Times. These machine learning techniques, inspired by the architecture of the human brain and made possible by recent advances in computing power, have been making waves via breakthrough results in image recognition, speech processing, and natural language tasks. Recently, deep learning approaches won several Kaggle competitions, including a drug discovery task, and cat and dog image recognition.

###Tutorial Overview

This tutorial will help you get started with Word2Vec for natural language processing. It has two goals: 

Basic Natural Language Processing: Part 1 of this tutorial is intended for beginners and covers basic natural language processing techniques, which are needed for later parts of the tutorial.

Deep Learning for Text Understanding: In Parts 2 and 3, we delve into how to train a model using Word2Vec and how to use the resulting word vectors for sentiment analysis.

Since deep learning is a rapidly evolving field, large amounts of the work has not yet been published, or exists only as academic papers. Part 3 of the tutorial is more exploratory than prescriptive -- we experiment with several ways of using Word2Vec rather than giving you a recipe for using the output.

To achieve these goals, we rely on an IMDB sentiment analysis data set, which has 100,000 multi-paragraph movie reviews, both positive and negative. 

###Acknowledgements

This dataset was collected in association with the following publication:

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). "Learning Word Vectors for Sentiment Analysis." The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011). (link)

Please email the author of that paper if you use the data for any research applications. The tutorial was developed by Angela Chapman during her summer 2014 internship at Kaggle.
  
##[Evaluation](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/evaluation)


Submissions are judged on area under the ROC curve. 

###Submission Instructions

You should submit a comma-separated file with 25,000 row plus a header row. There should be 2 columns: "id" and "sentiment", which contain your binary predictions: 1 for positive reviews, 0 for negative reviews. For an example, see "sampleSubmission.csv" on the Data page. 

id,sentiment
123_45,0 
678_90,1
12_34,0
...
 
##[Data Set](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

###Data Files

File Name 	      | Available Formats
------------------|------------------
sampleSubmission  |	.csv (276.17 kb)
unlabeledTrainData.tsv |	.zip (25.98 mb)
testData.tsv 	|.zip (12.64 mb)
labeledTrainData.tsv |	.zip (12.96 mb)

###Data Set

The labeled data set consists of 50,000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1. No individual movie has more than 30 reviews. The 25,000 review labeled training set does not include any of the same movies as the 25,000 review test set. In addition, there are another 50,000 IMDB reviews provided without any rating labels.

###File descriptions

*  labeledTrainData - The labeled training set. The file is tab-delimited and has a header row followed by 25,000 rows containing an id, sentiment, and text for each review.  
*  testData - The test set. The tab-delimited file has a header row followed by 25,000 rows containing an id and text for each review. Your task is to predict the sentiment for each one. 
*  unlabeledTrainData - An extra training set with no labels. The tab-delimited file has a header row followed by 50,000 rows containing an id and text for each review. 
*  sampleSubmission - A comma-delimited sample submission file in the correct format.

###Data fields

*  id - Unique ID of each review
*  sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
*  review - Text of the review


#First attempt

##Working with Data

We will first explore  the labled DataSet.
Reading original labeled train dataset:

In [None]:
import pandas as pd
train_data=pd.read_csv("/data/labeledTrainData.tsv", header=0,delimiter="\t", quoting=3)
train_data.head()

In [None]:
train_data.tail()

Notice that 'sentiment' is binary

In [8]:
train_data.dtypes

id           object
sentiment     int64
review       object
dtype: object

Type 'object' is a string for pandas. We shall later convert to number representation,maybe using typical bag-of-words or word2vec

Starting getting basic information of data:

In [9]:
train_data.info(0)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 781.2+ KB


Now that we already have general idea of Data Set. We next clean, transform data to create useful features for machine learning

##Feature 'review'

We will start wrting function for analyzing and cleaning the deature 'review', using first review as a point of illustration

In [10]:
train_data.review[0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

Before we can transform text into number representation, we need to process raw text. Let's first remove HTML and puctuation

In [None]:
import nltk
from bs4 import BeautifulSoup
import re
soup=BeautifulSoup(train_data.review[0]).get_text()
letters_only = re.sub("[^a-zA-Z]"," ",soup )
letters_only

Now we start stemming and lemmatizing the text, but it is generally better to first create the pos tagger as we only want to lemmatize verb and noum

In [None]:
tokens=nltk.word_tokenize(letters_only.lower())
tagged_words=nltk.pos_tag(tokens)
tagged_words[0:5]

Stemming the text: There are genrally 2 stemmers available in nltk, porter and lancaster

In [None]:
def lemmatize_with_potter(token,tag):
    porter=nltk.PorterStemmer()
    if tag[0] in ['v','n','V','N']:
        return  porter.stem(token)
    return token
stemmed_text_with_potter=[lemmatize_with_potter(token,tag) for token,tag in tagged_words]

def lemmatize_with_lancaster(token,tag):
    lancaster=nltk.LancasterStemmer()
    if tag[0] in ['v','n','V','N']:
        return  lancaster.stem(token)
    return token
stemmed_text_with_lancaster=[lemmatize_with_lancaster(token,tag) for token,tag in tagged_words]

In [None]:
stemmed_text_with_potter[0:10]

In [None]:
stemmed_text_with_lancaster[0:10]

Observing that the word 'going' has been stemmed with porter but not with lancaster, I'll choose porter for this task. 

let's lemmatizing

In [None]:
tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
def lemmatize_with_WordNet(token,tag):
    wnl = nltk.WordNetLemmatizer()
    if tag[0][0] in ['v','n','V','N']:
        return wnl.lemmatize(token)
    return token
stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
stemmed_and_lemmatized_text[0:10]

##Transforming feature 'review': bag-of-words model

Let's transform feature 'review' into numerical representation to feed into machine learning. The common representation of text is the [bag-of-words model](https://en.wikipedia.org/wiki/Bag-of-words_model)

in Sklearn, we can use class CountVectorize to transform the data. We shall also use stop-words to reduce the dimension of feature space

In [13]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer(stop_words='english')
corpus=train_data.review[0:5]
print vectorizer.fit_transform(corpus).todense()

[[0 0 0 ..., 0 0 0]
 [0 0 0 ..., 0 0 0]
 [1 1 0 ..., 0 1 1]
 [0 0 1 ..., 0 0 0]
 [0 0 0 ..., 1 0 0]]


##Extending bag-of-words with TF-IDF weights

We could extend the bag-of-words representation with [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to reflect how important a word to a document in a corpus

tdf-idf can be applied with class TfidfVectorizer in sklearn

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer= TfidfVectorizer(stop_words='english')
corpus=train_data.review[0:5]
print vectorizer.fit_transform(corpus).todense()

[[ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.06063824  0.06063824  0.         ...,  0.          0.06063824
   0.06063824]
 [ 0.          0.          0.0672503  ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.06649444  0.          0.        ]]


##Features pre-processing summary

In [None]:
import re
from bs4 import BeautifulSoup
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer

porter=nltk.PorterStemmer()
wnl = nltk.WordNetLemmatizer()

def stemmatize_with_potter(token,tag):
    if tag[0].lower() in ['v','n']:
        return  porter.stem(token)
    return token


def lemmatize_with_WordNet(token,tag):
    if tag[0].lower() in ['v','n']:
        return wnl.lemmatize(token)
    return token

def corpus_preprocessing(corpus):
    preprocessed_corpus = []
    for sentence in corpus:	
        #remove HTML and puctuation
        soup=BeautifulSoup(sentence).get_text()
        letters_only = re.sub("[^a-zA-Z]"," ",soup )

        #Stemming
        tokens=nltk.word_tokenize(letters_only.lower())
        tagged_words=nltk.pos_tag(tokens)
        stemmed_text_with_potter=[stemmatize_with_potter(token,tag) for token,tag in tagged_words]

        #lemmatization
        tagged_words_after_stem=nltk.pos_tag(stemmed_text_with_potter)
        stemmed_and_lemmatized_text=[lemmatize_with_WordNet(token,tag) for token,tag in tagged_words_after_stem]
        
        #join all the tokens
        clean_review=" ".join(w for w in  stemmed_and_lemmatized_text)
        preprocessed_corpus.append(clean_review)

    return preprocessed_corpus

#features from training set
train_features=train_data.review

#pro-processing train features
train_features=corpus_preprocessing(train_features)
vectorizer= TfidfVectorizer(stop_words='english')
train_features=vectorizer.fit_transform(train_features)


##Training Naive Bayes

Sklearn provides several kinds of Naives classifiers: GaussianNB, MultinomialNB and BernoulliNB. We will choose MultinomialNB for this task

In [None]:
from sklearn.naive_bayes import MultinomialNB
model=MultinomialNB()

Fitting the training data

In [None]:
#target from trainings set 
train_target=train_data.sentiment

#fitting the model
model.fit(train_features,train_target)

##Predicting with Naive Bayes

In [None]:
#reading test data
test_data=train_data=pd.read_csv("/data/testData.tsv", header=0,delimiter="\t", quoting=3)

#features from test data
test_features=test_data.review

#pre-processing test features
test_features=corpus_preprocessing(test_features)
test_features=vectorizer.fit_transform(test_features)

#predicting the sentiment for test set
prediction=model.predict(test_features)

##Preparing for kaggle submission

In [None]:
pd.DataFrame( data={"id":test_data["id"], "sentiment":prediction} ).to_csv("first_attempt.csv", index=False, quoting=3 )