# Natural Language Pre-Processing

In [None]:
%load_ext autoreload
%autoreload 2

import os
import sys
module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)
    
import pandas as pd
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, word_tokenize, RegexpTokenizer
import matplotlib.pyplot as plt
import string
import re

## Learning Goals

SWBAT:
- describe the basic concepts of NLP
- use pre-processing methods for NLP 
    - tokenization
    - stopword removal

## 1. Overview of NLP

NLP allows computers to interact with text data in a structured and sensible way. In short, we will be breaking up series of texts into individual words (or groups of words), and isolating the words with **semantic value**.  We will then compare texts with similar distributions of these words, and group them together.

In this section, we will discuss some steps and approaches to common text data analytic procedures. Some of the applications of natural language processing are:
- Chatbots 
- Speech recognition and audio processing 
- Classifying documents 

Here is an example that uses some of the tools we use in this notebook.  
  -[chi_justice_project](https://chicagojustice.org/research/justice-media-project/)  
  -[chicago_justice classifier](https://github.com/chicago-justice-project/article-tagging/blob/master/lib/notebooks/bag-of-words-count-stemmed-binary.ipynb)

We will introduce you to the preprocessing steps, feature engineering, and other steps you need to take in order to format text data for machine learning tasks. 

We will also introduce you to [**NLTK**](https://www.nltk.org/) (Natural Language Toolkit), which will be our main tool for engaging with textual data.

## NLP process

<img src="img/nlp_process.png" style="width:1000px;">

## 2. Preprocessing for NLP

In [None]:
# !pip install nltk
# !conda install -c anaconda nltk

We will be working with a dataset which includes both **satirical** (The Onion) and real news (Reuters) articles. 

We refer to the entire set of articles as the **corpus**.  

![the_onion](img/the_onion.jpeg) ![reuters](img/reuters.png)

In [None]:
corpus = pd.read_csv('../data/satire_nosatire.csv')
corpus.shape

In [None]:
corpus.head()

Our goal is to detect satire, so our target class of 1 is associated with The Onion articles.  

In [None]:
corpus.loc[10].body

In [None]:
corpus.loc[10].target

In [None]:
corpus.loc[502].body

In [None]:
corpus.loc[502].target

Each article in the corpus is refered to as a **document**.

It is a balanced dataset with 500 documents of each category. 

In [None]:
corpus.target.value_counts()

Let's think about our types of error and the use cases of being able to correctly separate satirical from authentic news. What type of error should we decide to optimize our models for?  

In [None]:
# Thoughts here



### Tokenization 

In order to convert the texts into data suitable for machine learning, we need to break down the documents into smaller parts. 

The first step in doing that is **tokenization**.

Tokenization is the process of splitting documents into units of observations. We usually represent the tokens as __n-grams__, where n represent the number of consecutive words occuring in a document that we will consider a unit. In the case of unigrams (one-word tokens), the sentence "David works here" would be tokenized into:

- "David", "works", "here";

If we want (also) to consider bigrams, we would (also) consider:

- "David works" and "works here".

Let's consider the first document in our corpus:

In [None]:
first_document = corpus.iloc[0].body

In [None]:
first_document

There are many ways to tokenize our document. 

It is a long string, so the first way we might consider is to split it by spaces.

In [None]:
# code



But this is not ideal. We are trying to create a set of tokens with **high semantic value**.  In other words, we want to isolate text which best represents the meaning in each document.

### Common text cleaning tasks:  
  1. remove capitalization  
  2. remove punctuation  
  3. remove stopwords  
  4. remove numbers

We could manually perform all of these tasks with string operations.

#### Capitalization

When we create our matrix of words associated with our corpus, **capital letters** will mess things up.  The semantic value of a word used at the beginning of a sentence is the same as that same word in the middle of the sentence.  In the two sentences:

sentence_one =  "Excessive gerrymandering in small counties suppresses turnout."   
sentence_two =  "Turnout is suppressed in small counties by excessive gerrymandering."  

'excessive' has the same semantic value, but will be treated as different tokens because of capitals.

In [None]:
sentence_one =  "Excessive gerrymandering in small counties suppresses turnout." 
sentence_two =  "Turnout is suppressed in small counties by excessive gerrymandering."

Excessive = sentence_one.split(' ')[0]
excessive = sentence_two.split(' ')[-2]
print(excessive, Excessive)
excessive == Excessive

In [None]:
manual_cleanup = [word.lower() for word in first_document.split(' ')]

In [None]:
print(f"Our initial token set for our first document is {len(manual_cleanup)} words long")

In [None]:
print(f"Our initial token set for our first document has \
{len(set(first_document.split()))} unique words")

In [None]:
print(f"After removing capitals, our first document has \
{len(set(manual_cleanup))} unique words")

#### Punctuation

Like capitals, splitting on white space will create tokens which include punctuation that will muck up our semantics.  

Returning to the above example, 'gerrymandering' and 'gerrymandering.' will be treated as different tokens.

In [None]:
no_punct = sentence_one.split(' ')[1]
punct = sentence_two.split(' ')[-1]
print(no_punct, punct)
no_punct == punct

In [None]:
## Manual removal of punctuation

string.punctuation

In [None]:
manual_cleanup = [s.translate(str.maketrans('', '', string.punctuation))\
                  for s in manual_cleanup]

In [None]:
print(f"After removing punctuation, our first document has \
{len(set(manual_cleanup))} unique words")

#### Stopwords

Stopwords are the **filler** words in a language: prepositions, articles, conjunctions. They have low semantic value, and almost always need to be removed.  

Luckily, NLTK has lists of stopwords ready for our use.

In [None]:
stopwords.__dict__

In [None]:
stopwords.words('english')[:10]

In [None]:
stopwords.words('greek')[:10]

Let's see which stopwords are present in our first document.

In [None]:
stops = [token for token in manual_cleanup if token in stopwords.words('english')]
stops[:10]

In [None]:
print(f'There are {len(stops)} stopwords in the first document')

In [None]:
print(f'That is {len(stops)/len(manual_cleanup): 0.2%} of our text')

Let's also use the **FreqDist** tool to look at the makeup of our text before and after removal:

In [None]:
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

In [None]:
manual_cleanup = [token for token in manual_cleanup if\
                  token not in stopwords.words('english')]

In [None]:
# We can also customize our stopwords list

custom_sw = stopwords.words('english')
custom_sw.extend(["i'd","say"] )
custom_sw[-10:]

In [None]:
manual_cleanup = [token for token in manual_cleanup if token not in custom_sw]

In [None]:
print(f'After removing stopwords, there are {len(set(manual_cleanup))} unique words left')

In [None]:
fdist = FreqDist(manual_cleanup)
plt.figure(figsize=(10, 10))
fdist.plot(30);

#### Numbers

Numbers also usually have low semantic value. Their removal can help improve our models. 

To remove them, we will use regular expressions, a powerful tool which you may already have some familiarity with.

Regex allows us to match strings based on a pattern.  This pattern comes from a language of identifiers, which we can begin exploring on the cheatsheet found here:
  -   https://regexr.com/

A few key symbols:
  - . : matches any character
  - \d, \w, \s : represent digit, word, whitespace  
  - *, ?, +: matches 0 or more, 0 or 1, 1 or more of the preceding character  
  - [A-Z]: matches any capital letter  
  - [a-z]: matches lowercase letter  

Other helpful resources:
  - https://regexcrossword.com/
  - https://www.regular-expressions.info/tutorial.html

We can use regex to isolate numbers:

In [None]:
first_document

In [None]:
pattern = '[0-9]'
number = re.findall(pattern, first_document)
number

In [None]:
pattern2 = '[0-9]+'
number2 = re.findall(pattern2, first_document)
number2

Sklearn and NLTK provide us with a suite of **tokenizers** for our text preprocessing convenience.

In [None]:
first_document

In [None]:
# Remember that the '?' indicates 0 or 1 of what follows!

re.findall(r"([a-zA-Z]+(?:'[a-z]+)?)", "I'd")

In [None]:
pattern = "([a-zA-Z]+(?:'[a-z]+)?)"
tokenizer = RegexpTokenizer(r"([a-zA-Z]+(?:[’'][a-z]+)?)")
first_doc = tokenizer.tokenize(first_document)
first_doc = [token.lower() for token in first_doc]
first_doc = [token for token in first_doc if token not in custom_sw]
first_doc[10]

In [None]:
print(f'We are down to {len(set(first_doc))} unique words')