# Bag of Words using Random Forest

### In this notebook we will learn how we can train a Random Forest using Bag of words approach.

<img src="https://3.bp.blogspot.com/-4pxORQAgAFI/XMNZhEssXtI/AAAAAAAAGmA/SuQGsp-GyT4jKlUZieg_A5lnTza_GujfwCLcBGAs/s1600/bag_of_words.png">

## Introduction
The bag-of-words model is a way of representing text data when modeling text with machine learning algorithms. The bag-of-words model is simple to understand and implement and has seen great success in problems such as language modeling and document classification.

**According to [Wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model#:~:text=The%20bag%2Dof%2Dwords%20model,word%20order%20but%20keeping%20multiplicity.):** The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision.

In this tutorial, you will discover the bag-of-words for training the Random Forest model to predict the sentiment of a sentence.

##  Importing Libraries

<div style="text-align: justify;"><div style="font-size: 16px;">For understanding the concept of bag-of-word, let us setup the environment and import the necessary libraries such as:</div><br>
1. <b>pandas: </b> for reading and understanding the data,<br>
2. <b>numpy: </b> for doing numerical computations on the data,<br>
3. <b>BeautifulSoup: </b> for pulling data out of HTML and XML files and remove the unnessary tags and helps in navigating, searching, and modifying the parse tree data,<br>
4. <b>re: </b> is the library for regular expression and we are going to use it to clean out data and based on pattern matching using regular expressions,<br>
5. <b>nltk: </b> is a natural language toolkit library used to do text processing for classification, tokenization, stemming, tagging, parsing, semantic reasoning, etc,<br>
6. <b>sklearn: </b> is used for all mahine learning tasks such as here we are using it to training a Random Forest model and predicting it's performance.</div>


In [None]:
import pandas as pd     
import numpy as np
from bs4 import BeautifulSoup
import re
import nltk
# nltk.download()
from nltk.corpus import stopwords # Import the stop word list

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Reading the data (Training & Testing data)

In [None]:
df_train = pd.read_csv("/kaggle/input/kumarmanoj-bag-of-words-meets-bags-of-popcorn/labeledTrainData.tsv", 
                              header=0, 
                              delimiter="\t", 
                              quoting=3)

df_test = pd.read_csv("/kaggle/input/kumarmanoj-bag-of-words-meets-bags-of-popcorn/testData.tsv",
                             header=0, 
                             delimiter="\t", 
                             quoting=3)

In [None]:
print(df_train.shape)
print(df_test.shape)

In [None]:
df_train.info()

In [None]:
print(df_train.columns.values)
print(df_test.columns.values)

In [None]:
df_train['review'][0]

## PreProcessing data for one item. 
###### beautifying the text of HTML and XML data


In [None]:

bs_data = BeautifulSoup(df_train["review"][0])
print(bs_data.get_text())

In [None]:
letters_only = re.sub("[^a-zA-Z]", " ", bs_data.get_text() )
print(letters_only)

In [None]:
lower_case = letters_only.lower()  
words = lower_case.split()  
print(words)

In [None]:
print(stopwords.words("english") )

In [None]:
words = [w for w in words if not w in stopwords.words("english")]
print(words)

##  PreProcessing data for all of the training data

In [None]:
training_data_size = df_train["review"].size
testing_data_size = df_test["review"].size

print(training_data_size)
print(testing_data_size)

In [None]:
def clean_text_data(data_point, data_size):
    review_soup = BeautifulSoup(data_point)
    review_text = review_soup.get_text()
    review_letters_only = re.sub("[^a-zA-Z]", " ", review_text)
    review_lower_case = review_letters_only.lower()  
    review_words = review_lower_case.split() 
    stop_words = stopwords.words("english")
    meaningful_words = [x for x in review_words if x not in stop_words]
    
    if( (i)%2000 == 0 ):
        print("Cleaned %d of %d data (%d %%)." % ( i, data_size, ((i)/data_size)*100))
        
    return( " ".join( meaningful_words)) 
    

In [None]:
# clean_train_data_list = []
# clean_test_data_list = []

##### cleaning training data.

In [None]:
df_train.head()

In [None]:
for i in range(training_data_size):
    df_train["review"][i] = clean_text_data(df_train["review"][i], training_data_size)
print("Cleaning training completed!")

##### cleaning testing data.

In [None]:
for i in range(testing_data_size):
    df_test["review"][i] = clean_text_data(df_test["review"][i], testing_data_size)
print("Cleaning validation completed!")

## Getting the features ready to be trained 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

In [None]:
X_train, X_cv, Y_train, Y_cv = train_test_split(df_train["review"], df_train["sentiment"], test_size = 0.3, random_state=42)

### Converting the train, validation and test data to vectors

In [None]:
X_train = vectorizer.fit_transform(X_train)
X_train = X_train.toarray()
print(X_train.shape)

In [None]:
X_cv = vectorizer.transform(X_cv)
X_cv = X_cv.toarray()
print(X_cv.shape)

In [None]:
X_test = vectorizer.transform(df_test["review"])
X_test = X_test.toarray()
print(X_test.shape)

In [None]:
vocab = vectorizer.get_feature_names()
print(f"Printing first 100 vocabulary samples:\n{vocab[:100]}")

In [None]:
distribution = np.sum(X_train, axis=0)

print("Printing first 100 vocab-dist pairs:")

for tag, count in zip(vocab[:100], distribution[:100]):
    print(count, tag)

## Training Random Forest model

<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png">

In [None]:
forest = RandomForestClassifier() 
forest = forest.fit( X_train, Y_train)

## Testing the model

In [None]:
predictions = forest.predict(X_cv) 
print("Accuracy: ", accuracy_score(Y_cv, predictions))

## Creating the output submission file

In [None]:
result = forest.predict(X_test) 
output = pd.DataFrame( data={"id":df_test["id"], "sentiment":result} )
output.to_csv( "submission.csv", index=False, quoting=3 )


### That marks the end of this notebook, hope it was worth reading!