# Using Random Forest + Bag of Words/Word2Vec on Bag of Words Meets Bags of Popcorn Competition
In this competition, I will be using two types of models to predict the sentiments of movie reviews:
1. The Bag of Words representation and a Random Forest model.
2. Word2Vec and the Word Vector representation, and a Random Forest model.

Here are the basic imports.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
!pip install beautifulsoup4
from bs4 import BeautifulSoup # for removing HTML tags
import re # removing punctuation and numbers
import nltk # removing stop words
import gensim # word vectors (word2vec)
import cython # speeding up training

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Data Imports
I'll first import the necessary data.

In [None]:
# Importing data (with column header, tab spacing, and ignoring double quotes)
train_data = pd.read_csv("../input/word2vec-nlp-tutorial/labeledTrainData.tsv.zip", \
                         header=0, delimiter = "\t", quoting=3)
test_data = pd.read_csv("../input/word2vec-nlp-tutorial/testData.tsv.zip", \
                        header=0, delimiter = "\t", quoting=3)
unlabeled_train_data = pd.read_csv("../input/word2vec-nlp-tutorial/unlabeledTrainData.tsv.zip", \
                                   header=0, delimiter = "\t", quoting=3)
print(train_data.shape, test_data.shape)
print(train_data.columns.values)
train_data.head()

## Data Cleaning
Here, I'll do the data preprocessing (removing HTML tags, removing punctuation, removing unnecessary newlines/spaces and lower casing, removing stop words, stemming, and lemmatization).

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
# inspired by tutorial
def review_to_wordlist(review, remove_stop_words=False):
    '''Creates a clean review by removing
    HTML tags, punctuation, numbers, and stop words'''
    # remove HTML tags
    review1 = BeautifulSoup(review).get_text()
    # remove punctuation
    review2 = re.sub("[^a-zA-Z0-9]", " ", review1)
    # removing unneccesary newlines/spaces + lower case
    review3 = review2.lower().split()
    if remove_stop_words:
        # removing stop words
        stops = set(stopwords.words("english")) # faster search through set than list
        review4 = [word for word in review3 if word not in stops]
        # lemmatization and stemming
        lemmatizer = WordNetLemmatizer()
        porterStemmer = PorterStemmer()
        review5 = [lemmatizer.lemmatize(porterStemmer.stem(word)) for \
                   word in review4]
        # return final review (joined by spaces)
        return " ".join(review5)
    else:
        return review3

In [None]:
clean_train_reviews = []
clean_test_reviews = []
for i in range(0, train_data["review"].size):
    clean_train_reviews.append(review_to_wordlist(train_data["review"][i],True))
    clean_test_reviews.append(review_to_wordlist(test_data["review"][i],True))
    if (i+1)%1000 == 0:
        print(f"{i+1} finished for train and test data")

## Creating Bag of Words Representation
Here, I will create the bag of words representation of the train and test data with CountVectorizer.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# creating the vectorizer for creating the bag of words
vectorizer = CountVectorizer(analyzer = "word", \
                             tokenizer = None, \
                             preprocessor = None, \
                             stop_words = None, \
                             max_features = 5000)

train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features = train_data_features.toarray()
test_data_features = vectorizer.transform(clean_test_reviews)
test_data_features = test_data_features.toarray()
train_data_features.shape

## Using RandomForestClassifier on Bag of Words
Finally, for the Bag of Words data, I will train a RandomForest model and use it to predict the outcomes on the test data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

model_bow = RandomForestClassifier(n_estimators=100)
model_bow.fit(train_data_features, train_data["sentiment"])

In [None]:
# Final predictions with the bag of words and XGBoost model
predictions_bow = model_bow.predict(test_data_features)
output_bow = pd.DataFrame(data={"id":test_data["id"], "sentiment": predictions_bow})
output_bow.to_csv("submissionBoW.csv", index=False, quoting=3)
print("Submission file created!")

This, from the submission, seems to get a final accuracy of 84.5%.

## Creating Word Vector Model
Word vectors are a way of representing words as a list of numbers, which allow computers to find useful relationships between words, such as (quoted from the tutorial):
> "king - man + woman = queen"

Here, I will be creating word vectors as the second type of representation.

First, I'll create a sentence splitter to get the data ready for Word2Vec.

In [None]:
# Load punkt tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

# Function to split review into sentences
def review_to_sentences(review, tokenizer, remove_stopwords=False):
    '''Creates a clean review split into sentences'''
    # Using tokenizer to split paragraph into sentences
    sentences1 = tokenizer.tokenize(review.strip())
    # go over each sentence
    sentences = []
    for sentence in sentences1:
        if len(sentence) > 0:
            # call review_to_wordlist
            sentences.append(review_to_wordlist(sentence, remove_stopwords))
    return sentences

Next, I'll apply the sentence splitter to the reviews to get the sentences to train the word2vec model.

In [None]:
sentences = []

# adding train sentences together in a list for training the word2vec model (not removing stop words for better model training)
for i in range(0, train_data["review"].size):
    sentences += review_to_sentences(train_data["review"][i], tokenizer, False)
    if (i+1)%1000 == 0:
        print(f"{i+1} sentences finished for train data")
        
# adding unlabeled train sentences in a list for training the word2vec model
for i in range(0,unlabeled_train_data["review"].size):
    sentences += review_to_sentences(unlabeled_train_data["review"][i], tokenizer, False)
    if (i+1)%1000 == 0:
        print(f"{i+1} sentences finished for unlabeled train data")

In [None]:
# Taking a look at a sentence in sentences
print(sentences[0])

I'll be using the following final parameters for the word2vec model:
1. Architecture: skip-gram (default)
2. Training Algorithm: Hierarchical Softmax (default)
3. Downsampling of Frequent Words: 0.0005
4. Word vector dimensionality: 400
5. Context/window size: 15
6. Worker threads: 5
7. Minimum word count: 40

In [None]:
# Import the built-in logging module and configure it so that Word2Vec creates nice output messages (from tutorial)
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',\
                    level=logging.INFO)

# Setting values for parameters
num_features = 400     # Word vector dimensionality                      
min_word_count = 40    # Minimum word count                        
num_workers = 5        # Number of threads to run in parallel
context = 15           # Context window size                                                                                    
downsampling = 0.0005  # Downsample setting for frequent words

# initializing and training the model
from gensim.models import word2vec
model_w2v = word2vec.Word2Vec(sentences, workers=num_workers, vector_size=num_features, min_count=min_word_count, \
                              window=context, sample=downsampling)

In [None]:
# some fun experimentation with the methods of the model
print(model_w2v.wv.doesnt_match("where when what why who is".split())) # should be is
print(model_w2v.wv.doesnt_match("king queen servant".split())) # should be servant
print(model_w2v.wv.most_similar("move")) # mostly seem correct, either move as in walking or move as in pushing (though furiou doesn't really match)

## Using Average Word Vectors
If we average the individual vectors in a text, the final vector can usually give good results when training (though the sentence order is lost). Therefore, we will be using average word vectors to convert our reviews.

In [None]:
def review_to_vector(review, model, num_features):
    '''Function to average all word vectors in a review'''
    # initializing an empty np array
    feature_vector = np.zeros((num_features,),dtype="float32")
    num_words = 0
    # set of words in model's vocabulary
    set_of_words = set(model.wv.index_to_key)
    for word in review:
        if word in set_of_words:
            num_words += 1
            feature_vector = np.add(feature_vector,model.wv[word])
    feature_vector = np.divide(feature_vector, num_words)
    return feature_vector

def get_feature_vectors(reviews, model, num_features):
    '''Function to create feature vectors for all reviews'''
    count = 0
    review_vectors = np.zeros((len(reviews),num_features),dtype="float32")
    for review in reviews:
        review_vectors[count] = review_to_vector(review, model, num_features)
        count += 1
        if count%1000 == 0:
            print(f"{count} reviews converted")
    return review_vectors

In [None]:
clean_train_reviews_vec = []
clean_test_reviews_vec = []
for i in range(0, train_data["review"].size):
    clean_train_reviews_vec.append(review_to_wordlist(train_data["review"][i],False))
    clean_test_reviews_vec.append(review_to_wordlist(test_data["review"][i],False))
    if (i+1)%1000 == 0:
        print(f"{i+1} finished for train and test data")

train_data_vectors = get_feature_vectors(clean_train_reviews_vec, model_w2v, num_features)
test_data_vectors = get_feature_vectors(clean_test_reviews_vec, model_w2v, num_features)

## Using RandomForestClassifier on Word Vectors
Finally, for the Word Vector data, I will train a RandomForest model and use it to predict the outcomes on the test data.

In [None]:
model_vec = RandomForestClassifier(n_estimators = 100)
model_vec.fit(train_data_vectors, train_data["sentiment"])

In [None]:
predictions_vec = model_vec.predict(test_data_vectors)
output_vec = pd.DataFrame(data={"id":test_data["id"], "sentiment": predictions_vec})
output_vec.to_csv("submissionW2V.csv", index=False, quoting=3)
print("Submission file created!")

This representation seems to get a lower accuracy of about 77.2%. Possible reasons might be:
1. The Word2Vec model did not have much data (much less than a billion words), and thus it did not have sufficient training.
2. There were not enough estimators in the RandomForestClassifier.
3. There were too many features for the feature vectors (or other parameters were not the best that they could be).

### Thanks for reading!