# Parsing Text (aka Prepping Text Data)

We're going to take our acquired data and parse it, that is, we'll better understand the text data by breaking it down into smaller components

## Big Idea

We want to reduce the variability between words.

Both "math" and "Math" mean the same thing, so we lowercase things to reduce the variability of the same exact term.

Erdős, Erdös, and Erdos refer to the same person. 

Run and runs are referring to the same thing.

Again, we're looking to reduce variability before we start searching for relationships between values.

### Workflow

To make this happen, we will establish a workflow to process our text data and prepare it for further use in exploration and modeling. 

This preprocessing is know as text **normalization**. 

![image.png](attachment:image.png)

## Let's see it in action

In [1]:
#standard imports
import pandas as pd
import numpy as np
import unicodedata

### original text

In [2]:
original = "Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"
original

"Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

### 1. lowercase text

In [3]:
lowered = original.lower()

### 2. remove any accented characters and non-ASCII characters

- `unicodedata.normalize` removes any inconsistencies in unicode character encoding
- `.encode` to convert the resulting string to the ASCII character set
- `.decode` to turn the resulting bytes object back into a string

In [4]:
lowered

"paul erdős and george pólya were influential hungarian mathematicians who contributed a lot to the field. erdős's name contains the hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as erdos or erdös either by mistake or out of typographical necessity"

In [None]:
unicodedata.normalize()

### 3. remove special characters

- remove anything that isn't a-z, a number, a single quote, or a whitespace

### 4. tokenize

Tokenization is the process of breaking something down into smaller, discrete units. These units are called tokens.

It's common to tokenize the strings to break up words and punctutation left over into discrete units. 

### 5. stem or lemmatize words (choose one!)

#### Stemming: reduce related words in your text to their common stem
    - "calls", "called", and "calling" all share the base stem "call". It can make it easier when you are searching for a particular word in your text to search for their common stem rather than every form of the word.  
    - suffix stripping
    - Algorithmic rules (non lingustic)
    - Fast and efficient

#### Lemmatize: 
    - Similar to stemming, but the root word is lexicographically correct word (present in the dictionary)  
    - Slower than stemming

In [105]:
# download the first time
# nltk.download('all')

### 6. remove stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords
- example: a, an, the, and like

We will use a standard English language stopwords list from nltk

### 7. Store the clean text for exploration