# NLP I: Tokenizing/Lemmatization and Sentiment Analysis
---

#### Before we begin, try running this:

In [None]:
import nltk

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [None]:
lemmatizer.lemmatize("cats")

If you ran into issues with the above:

1. Open a Jupyter notebook and run `import nltk`.
    - If this runs without issue, fantastic! Move to step 4.
    - If `import nltk` does not work, then move to step 2.
2. Run `nltk.download()`. A new screen will pop up outside your Jupyter notebook. (It may be hidden behind other windows.)
3. Once this box opens up, click `all`, then `download`. Once this is done, restart your Jupyter notebook and return to step 1.
4. Run:
```python
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize("cats")```

    - If this returns `cat`, then fantastic! You’re done. 
    - If not, head to http://www.nltk.org/install.html and follow instructions for your computer, then go back to step 1.

### Kick-Off

Generally when we get text data, strings aren't broken out into individual words or even sentences. We might have a full tweet, full chapter of a book, or full .pdf file all in one long string.

This evening, we're diving into the practical side of NLP - taking this data and breaking it out into words that we can then leverage into $n$-grams or $tfidf$.

A couple things to note before beginning:
1. NLP describes how we can get unstructured data into a more structured form. That does not mean these tools we used today work to the exclusion of other methods.
2. You can and should include other variables in your model!

#### Agenda
1. Pre-Processing
    - Break strings into words.
    - Combine words.
2. Sentiment Analysis

In [1]:
spam = 'Hello,\nI saw your contact information on LinkedIn. I have carefully read through your profile and you seem to have an outstanding personality. This is one major reason why I am in contact with you. My name is Mr. Valery Grayfer Chairman of the Board of Directors of PJSC "LUKOIL". I am 86 years old and I was diagnosed with cancer 2 years ago. I will be going in for an operation later this week. I decided to WILL/Donate the sum of 8,750,000.00 Euros(Eight Million Seven Hundred And Fifty Thousand Euros Only etc. etc.'

### Pre-Processing 

- Tokenizing
- Regular Expression (RegEx)
- Lemmatizing
- Stemming
- Additional Things (i.e. removing HTML)

#### Tokenizing

When we "tokenize" data, we take it and split it up into distinct chunks based on some pattern.

In [2]:
#Before we can lemmatize our spam string we need to tokenize it.

from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') ## We'll talk about this in a moment.

#### Regular Expressions

Regular Expressions, or RegEx, are a very helpful way for us to detect patterns in text. 

In [None]:
import regex as re

RegEx in Python 3 understands `\d+` to identify numeric digits. Therefore, the above code searched through `spam_tokens` to see if any numeric digits were in there. 

A `RegexpTokenizer` splits a string into substrings using a regular expression. (You could also say it uses RegEx to tokenize.)

The following example is pulled from [this site](http://www.nltk.org/_modules/nltk/tokenize/regexp.html).

Let's get a better sense of how regex works: Open [Regex101 Tester](http://www.regex101.com) and paste the content of string s (below) into the "test string" field on the website.

In [None]:
s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."

In [None]:
tokenizer_1 = RegexpTokenizer('\w+|\$[\d\.]+|\S+')

In [None]:
tokenizer_2 = RegexpTokenizer('\s+', gaps=True)

In [None]:
capword_tokenizer = RegexpTokenizer('[A-Z]\w+')

#### Lemmatizing

When we "lemmatize" data, we take words and attempt to return their *lemma*, or the base/dictionary form of a word.

We can also do this on individual words.

#### Stemming

When we "stem" data, we take words and attempt to return a base form of the word. It tends to be cruder than using lemmatization. There's a [method developed by Porter in 1980](https://www.cs.toronto.edu/~frank/csc2501/Readings/R2_Porter/Porter-1980.pdf) that explains the algorithm used below.

In [None]:
# Create p_stemmer of class PorterStemmer


We can also do this on individual words as well.

# Let's start with a very simple example

Let's build a function that can classify a small amount of text, such as a tweet, into positive and negative.

What words tell us whether certain text is positive?

In [None]:
theTweet = "We have some delightful new food in the cafeteria. Awesome!!!"

In [None]:
# Let's come up with a list of positive and negative words we might run into in one tweet

positive_words = 
negative_words = 

In [None]:
#Tokenize

#import re
theTokens = re.findall(r'\b\w[\w-]*\b', theTweet.lower())

**Check:** What are some shortcomings of this method?

# Sorting Positive from Negative Reviews

The easiest way to do sentiment classification of analysis is by training a model on data we've already labeled. 

Today we will begin by reviewing the basic NLP techniques we learned to create a sentiment analyzer from Rotten Tomatoes Movie reivew.  This code-along is adapted from Kaggle's tutorial, available [here](https://www.kaggle.com/c/word2vec-nlp-tutorial#part-1-for-beginners-bag-of-words).


## Step One: Import The Data

In [None]:
import pandas as pd       
train = pd.read_csv("labeledTrainData.tsv", header=0, 
                    delimiter="\t", quoting=3)

In [None]:
#What are we looking at? Someone describe the columns

There are a few steps we'll take to clean up the text data before it's ready for processing

- Remove the HTML code artifacts from the text
- Remove punctuation
- Remove stopwords (what are these?)


## Step One: Remove HTML code artifacts

Fortunately, we can use beautiful soup to remove the HTML artificats from our corpus

In [None]:
from bs4 import BeautifulSoup             

# Initialize the BeautifulSoup object on a single movie review     
example1 = BeautifulSoup(train['review'][0])

# Print the raw review and then the output of get_text(), for 
# comparison
print(train['review'][0])
print('\n')
print(example1.get_text())

## Step Two: Remove Punctuation

Punctuation can be removed using regular expressions

In [None]:
# Use regular expressions to do a find-and-replace
letters_only = re.sub("[^a-zA-Z]",           # The pattern to search for
                      " ",                   # The pattern to replace it with
                      example1.get_text() )  # The text to search
print(letters_only)

In [None]:
# Let's also take this time to convert everything to lowercase 
# and split into individual words.

## Step Three: Remove Stop Words

If you didn't complete the NLTK download you may run into some issues here.

In [None]:
import nltk
# nltk.download()  # Download text data sets, including stop words. Uncomment this if you did not download

In [None]:
# Remove stop words from "words"


## Step Four: Combine our cleaning into one function

**Check**: Why should we do everything with one function?

In [None]:
def review_to_words(raw_review):
    # Function to convert a raw review to a string of words
    # The input is a single string (a raw movie review), and 
    # the output is a single string (a preprocessed movie review)
    #
    # 1. Remove HTML

    #
    # 2. Remove non-letters        

    #
    # 3. Convert to lower case, split into individual words

    #
    # 4. In Python, searching a set is much faster than searching
    #   a list, so convert the stop words to a set

    # 
    # 5. Remove stop words

    #
    # 6. Join the words back into one string separated by space, 
    # and return the result.



## Step Five (Finally!) Applying our Function

In [None]:
# Get the number of reviews based on the dataframe column size


# Initialize an empty list to hold the clean reviews


## Our data is finally ready.....

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer" object, which is scikit-learn's
# bag of words tool.  
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() does two functions: First, it fits the model
# and learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a list of 
# strings.
train_data_features = vectorizer.fit_transform(clean_train_reviews)

# Numpy arrays are easy to work with, so convert the result to an 
# array
train_data_features = train_data_features.toarray()

In [None]:
print(train_data_features.shape)

In [None]:
vocab = vectorizer.get_feature_names()
print(vocab)

### Now we have an array that we can use for classification!

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
clf = KNeighborsClassifier(n_neighbors = 5)

In [None]:
clf.fit(train_data_features, train['sentiment'])
#this will take a while....

In [None]:
## In general, we want to score on the test data!
clf.score(train_data_features, train['sentiment'])