# Part 1. For Beginners Bag of Words

[https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words](https://www.kaggle.com/c/word2vec-nlp-tutorial/overview/part-1-for-beginners-bag-of-words)

<br>

# What is NLP?

NLP (Natural Language Processing) is a set of techniques for approaching text problems. This page will help you get started with loading and cleaning the IMDB movie reviews, then applying a simple [Bag of Words](https://en.wikipedia.org/wiki/Bag-of-words_model) model to get suprisingly accurate predictions of whether a review is thumbs-up or thumbs-down.

<br>

# Before you get started

This tutorial is in Python. If you haven't used Python before, we suggest heading over to the [Titanic Competition Python Tutorials](https://www.kaggle.com/c/titanic) to get your feetwet (check out the Random Forest intro while you're there).

If you are already comfortable with Python and with basic NLP techniques, you may want to skip to Part 2.

**This part of the tutorial is not platform dependent.** Throughout this tutorial we'll be using various Python modules for text processing, deep learning, random forests, and other applications. See the **Setting Up Your System** page for more details.

There are many good tutorials, and indeed [entire books](http://www.nltk.org/book/) written about NLP and text processing in Python. This tutorial is in no way meant to be exhaustive - just to help get you started with the movie reviews.

<br>

# Code

The tutorial code for Part 1 lives [here](https://github.com/wendykan/DeepLearningMovies/blob/master/BagOfWords.py).

<br>

# Reading the Data

The necessary files can be downloaded from the Data page. The first file that you'll need is **unlabeledTrainData**, which contains 25,000 IMDB movie reviews, each with a positive or negative sentiment label.

Next, read the tab-delimitetd file into Python. To do this, we can use the **pandas** package, introduced in the Titanic tutorial, which provides the `read_csv` function for easily reading and writing data files. If you haven't used pandas before, you may need to install it.

In [2]:
# Import the pandas package, then use the "read_csv" function to read
# the labeled training data
import pandas as pd

train = pd.read_csv("../input/labeledTrainData.tsv",
                    header=0,
                    delimiter="\t",
                    quoting=3)

Here `"header=0"` indicates that the first line of the file contains column names, `"delimiter=\t"` indicates that the fields are separated by tabs, and `quoting=3` tells Python to ignore doubled quotes, otherwise you may encounter errors trying to read the file.

We can make sure that we read 25,000 rows and 3 columns as follows:

In [3]:
train.shape

(25000, 3)

In [4]:
train.columns.values

array(['id', 'sentiment', 'review'], dtype=object)

The three columns are called "id", "sentiment", and "array". Now that you've read the training set, take a look at a few reviews:

As a reminder, this will show you the first movie review in the column named "review". You should see a review that starts like this.

In [5]:
print(train['review'][0])

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

There are HTML tags such as `""`, abbreviations(약어), punctuation(구두점) - all common issues when processing text from online. Take some time to look through other reviews in the training set while you're at it - the next section will deal with how to tidy up the text for mahchine learning.

<br>

# Data Cleaning and Text Preprocessing

Removing HTML Markup: The BeautifulSoup Package

First, we'll remove the HTML tags. For this purpose, we'll use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) library. If you don't have Beautiful soup installed, do:

```
$ sudo pip install BeautifulSoup4
```

from the command line (NOT from within Python). Then, from within Python, load the package and use it to extract the text from a review:

In [7]:
# Import BeautifulSoup into your workspace
from bs4 import BeautifulSoup

# Initialize the BeautifulSoup object on a single movie review
example1 = BeautifulSoup(train['review'][0])

# Print the raw review and then the output of get_text(), for comparision
print(train['review'][0])
print()
print(example1.get_text())

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

Calling `get_text()` gives you the text of the review, without tags or markup. If you browse the BeautifulSoup documentation, you'll see that it's a very powerful library - more powerful than we need for this dataset. However, it is not considered a reliable practice to remove markup using regular expressions, so even for an application as simple as this, it's usually best to use a package like BeautifulSoup.