# Natural Language Processing with `nltk`

`nltk` is the most popular Python package for Natural Language processing, it provides algorithms for importing, cleaning, pre-processing text data in human language and then apply computational linguistics algorithms like sentiment analysis.

## Inspect the Movie Reviews Dataset

It also includes many easy-to-use datasets in the `nltk.corpus` package, we can download for example the `movie_reviews` package using the `nltk.download` function:

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
import nltk

In [None]:
#nltk.download()

In [None]:
nltk.download("movie_reviews")

You can also list and download other datasets interactively just typing:

    nltk.download()
    
in the Jupyter Notebook. #Seems not working anymore.

Once the data have been downloaded, we can import them from `nltk.corpus`

In [None]:
from nltk.corpus import movie_reviews

The `fileids` method provided by all the datasets in `nltk.corpus` gives access to a list of all the files available.

In particular in the movie_reviews dataset we have 2000 text files, each of them is a review of a movie, and they are already split in a `neg` folder for the negative reviews and a `pos` folder for the positive reviews:

In [None]:
len(movie_reviews.fileids())
type(movie_reviews.fileids())

In [None]:
movie_reviews.fileids()[:5]

In [None]:
movie_reviews.fileids()[-5:]

`fileids` can also filter the available files based on their category, which is the name of the subfolders they are located in. Therefore we can have lists of positive and negative reviews separately.

In [None]:
negative_fileids = movie_reviews.fileids('neg')
positive_fileids = movie_reviews.fileids('pos')

In [None]:
len(negative_fileids), len(positive_fileids)

We can inspect one of the reviews using the `raw` method of `movie_reviews`, each file is split into sentences, the curators of this dataset also removed from each review from any direct mention of the rating of the movie.

In [None]:
print(movie_reviews.raw(fileids=positive_fileids[0]))

## Build bag-of-words: Tokenize Text in Words

In [None]:
romeo_text = """Why then, O brawling love! O loving hate!
O any thing, of nothing first create!
O heavy lightness, serious vanity,
Misshapen chaos of well-seeming forms,
Feather of lead, bright smoke, cold fire, sick health,
Still-waking sleep, that is not what it is!
This love feel I, that feel no love in this."""

The first step in Natural Language processing is generally to split the text into words, this process might appear simple but it is very tedious to handle all corner cases such as punctuations. If we just start with a split on whitespace, punctuation will be a problem.

`nltk` has a sophisticated word tokenizer trained on English named `punkt`, we first have to download its parameters: 

Then we can use the `word_tokenize` function to properly tokenize this text, compare to the whitespace splitting we used above:

In [None]:

#check here for different tokenizers http://text-processing.com/demo/tokenize/

Good news is that the `movie_reviews` corpus already has direct access to tokenized text with the `words` method:

## Build a bag-of-words: stopwords

The simplest model for analyzing text is just to think about text as an unordered collection of words (bag-of-words). This can generally allow to infer from the text the category, the topic or the sentiment.

From the bag-of-words model we can build features to be used by a classifier, here we assume that each word is a feature that can either be `True` or `False`.
We implement this in Python as a dictionary where for each word in a sentence we associate `True`, if a word is missing, that would be the same as assigning `False`.

This is what we wanted, but we notice that also punctuation like "!" and words useless for classification purposes like "of" or "that" are also included.
Those words are named "stopwords" and `nltk` has a convenient corpus we can download:

Using the Python `string.punctuation` list and the English stopwords we can build better features by filtering out those words that would not help in the classification:

First we want to filter out `useless_words` as defined in the previous section, this will reduce the length of the dataset:

## ## Build a bag-of-words: count

The `collection` package of the standard library contains a `Counter` class that is handy for counting frequencies of words in our list:

## Train a Classifier for Sentiment Analysis

Using our `build_bag_of_words_features` function we can build separately the negative and positive features.
Basically for each of the 1000 negative and for the 1000 positive review, we create one dictionary of the words and we associate the label "neg" and "pos" to it.

One of the simplest supervised machine learning classifiers is the Naive Bayes Classifier, it can be trained on 80% of the data to learn what words are generally associated with positive or with negative reviews.

We can check after training what is the accuracy on the training set, i.e. the same data used for training, we expect this to be a very high number because the algorithm already "saw" those data. Accuracy is the fraction of the data that is classified correctly, we can turn it into percent:

The accuracy above is mostly a check that nothing went very wrong in the training, the real measure of accuracy is on the remaining 20% of the data that wasn't used in training, the test data:

Accuracy here is around 70% which is pretty good for such a simple model if we consider that the estimated accuracy for a person is about 80%.
We can finally print the most informative features, i.e. the words that mostly identify a positive or a negative review:

## Word Cloud

In [None]:
!pip install wordcloud
#if you want more advanced wordcloud functions, install this way
    #git clone https://github.com/amueller/word_cloud.git
    #cd word_cloud
    #pip install .

## Spam detection using bag-of-words approach

In [None]:
import pandas as pd
spam = pd.read_csv('spam.csv', encoding='ISO-8859-1');

In [None]:
spam.head()

In [None]:
#Tokenize
spam['tokenizedsms']=spam.apply(lambda row: nltk.word_tokenize(row['sms']), axis=1)

In [None]:
spam.head()

In [None]:
#Remove stop words

In [None]:
#Stemming


In [None]:
#CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()


In [None]:
#Build a training model