## Learning Objectives

At the end of the experiment, you will be able to:

* understand the use of NLTK library
* perform text pre-processing using NLTK such as removing html strips and noise text, removing special characters, lemmatization, stemming, tokenization, removing stop words

## Introduction

**NLTK** (Natural Language Toolkit) is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

It is a free, open source, community-driven project.

## Dataset Description

The IMDB movie review dataset can be downloaded from [here](http://ai.stanford.edu/~amaas/data/sentiment/). This dataset for binary sentiment classification contains around 50k movie reviews with the following attributes:

* **review:** text based review of each movie
* **sentiment:** positive or negative sentiment value


### Setup Steps:

### Importing required packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk                                                                        # platform for building Python programs to process natural language
nltk.download('stopwords')                                                         # to download the stop words
nltk.download('punkt')                                                             # tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences
nltk.download('wordnet')                                                           # to lemmatize word using WordNet's built-in function
nltk.download('averaged_perceptron_tagger')                                        # for part-of-speech tagger
from nltk.corpus import stopwords                                                  # importing the NTLK stopwords to remove articles, preposition and other words that are not actionable
from nltk.stem.porter import PorterStemmer                                         # process of reducing a word to its word stem that affixes to suffixes and prefixes or to the roots of words
from nltk.tokenize import word_tokenize                                            # allows to create individual objects from a bag of words
from bs4 import BeautifulSoup                                                      # Python library for pulling data from HTML and XML files
import re                                                                          # regular expression (or RE) specifies a set of strings that matches it
import string
import warnings
warnings.filterwarnings('ignore')

### Load the dataset

In [None]:
# read the dataset
df = # YOUR CODE HERE to read 'IMDB_Dataset.csv'
print(df.shape)
df.head(10)      # first 10 rows

In [None]:
# Let us view one of the reviews
# YOUR CODE HERE

### Exploratory Data Analysis

In [None]:
# summary of the dataset
# YOUR CODE HERE

Now, we will look at the sentiment count by category.

In [None]:
# sentiment count category-wise
# YOUR CODE HERE

In [None]:
# Visualize the postive and negative sentiments in a bar plot
# YOUR CODE HERE

We can see that the dataset is balanced.

Now, we will do the text cleaning of the reviews.



## Text Preprocessing

The data scraped from the website is mostly in the raw text form. This data needs to be cleaned before analyzing it or fitting a model to it. Cleaning up the text data is necessary to highlight the attributes that we are going to want our machine learning system to pick up on.

**Removing noisy text**

Sample noise removal tasks could include:

* removing text file headers, footers
* removing HTML, XML, etc. markup and metadata
* extracting valuable data from other formats, such as JSON

In [None]:
# removing the html strips
def strip_html(text):
    # BeautifulSoup is a useful library for extracting data from HTML and XML documents
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# removing the square brackets
def remove_between_square_brackets(text):
    text = text.replace('[', '')
    # YOUR CODE HERE to remove ']'
    return text
    #return re.sub('\[[^]]*\]', '', text)          # to remove the text also present within square brackets

# removing the noisy text
def denoise_text(text):
    # YOUR CODE HERE to update the text by applying strip_html() function
    # YOUR CODE HERE to update the text by applying remove_between_square_brackets() function
    return text

In [None]:
txt = """<body>
<h1>This is a test heading.</h1>
<p>Starting a paragraph here...</p>
</body>"""

# YOUR CODE HERE to call strip_html() on txt

In [None]:
txt = "I bought [apple, banana, orange] yesterday."

# YOUR CODE HERE to call remove_between_square_brackets() on txt

In [None]:
# apply denoise_text() function on review column
# YOUR CODE HERE

**Removing special characters**

Special characters typically include any character that is not a letter or number, such as punctuation. Removing special characters from a string results in a string containing only letters and numbers.

We can use the `re` python library for Regular expression operations.

To know more about Regular expressions, click [here](https://realpython.com/regex-python/).

In [None]:
# define function for removing special characters
def remove_special_characters(text):
    pattern = r'[^a-zA-Z0-9\s]'
    text = re.sub(pattern, '', text)
    return text

In [None]:
txt = "Hi There! How are you?"

# YOUR CODE HERE to call remove_special_characters() on txt

In [None]:
# apply remove_special_characters() function on review column
# YOUR CODE HERE

**Lemmatization**

Lemmatization is a text pre-processing technique used to break a word down to its root meaning or word (called lemma) to identify similarities.

For example, a lemmatization algorithm would reduce the word ***better*** to its root word, or lemme, ***good***.

In [None]:
# Lemmatize word using WordNet's built-in function
# pos: The Part Of Speech tag.
#      Valid options are "n" for nouns, "v" for verbs, "a" for adjectives,
#                        "r" for adverbs and "s" for satellite adjectives.

lemmatizer = nltk.stem.WordNetLemmatizer()
print("better :", lemmatizer.lemmatize("better", pos ="a", ))

**Text Stemming**

Stemming, also called suffix stripping, is a technique used to reduce text dimensionality. Stemming is also a type of text normalization that enables you to standardize some words into specific expressions also called stems.

In [None]:
# stemming the text
def simple_stemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

In [None]:
txt = "He likes lemons bananas and oranges. He goes to market."

# YOUR CODE HERE to call simple_stemmer() on txt

In [None]:
# apply function on review column
# YOUR CODE HERE to apply 'simple_stemmer' function on reviews

**Tokenization**

Tokenization is the process of splitting paragraphs and sentences into smaller understandable parts (words).

For example:

In [None]:
tokens = word_tokenize('I enjoy playing football in the rain.')
tokens

**Part-of-speech tagging**

Part-of-speech (POS) tagging is the task of determining the word class of a token. This is crucial for *disambiguation*, because different parts of speech may have similar forms.

In [None]:
tagged = nltk.pos_tag(tokens)
tagged

**Removing stop words**

Stop words are English words that do not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc.

In [None]:
# setting english stopwords
stopword_list = nltk.corpus.stopwords.words('english')
print(stopword_list)

The above list of stopwords also contains the word "not", and its other forms such as don't, didn't, etc. We need them for correct sentiment classification.

For example, consider a negative review "*not a good movie*", and if we remove 'not' from it then it becomes a positive review "*a good movie*".

In [None]:
# Exclude 'not' and its other forms from the stopwords list

updated_stopword_list = []

for word in stopword_list:
    if word=='not' or word.endswith("n't"):
        pass
    else:
        updated_stopword_list.append(word)

print(updated_stopword_list)

In [None]:
# removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    # splitting strings into tokens (list of words)

    # YOUR CODE HERE to create 'tokens' variable by tokenizing 'text'

    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        # filtering out the stop words
        filtered_tokens = [token for token in tokens if token not in updated_stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in updated_stopword_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

In [None]:
txt = "The movie was not that great"

# YOUR CODE HERE to call remove_stopwords() on txt

In [None]:
# apply remove_stopwords function on review column
# YOUR CODE HERE