# Parsing Text (aka Prepping Text Data)

What is it?
- Breaking our text data into smaller compenents and reduce variability between words

Why do we care? 
- Allows us to better understand our data programatically and get us ready for explore and modeling

Workflow

original text--->
1. lowercase text
2. remove accented and non-ASCII characters
3. remove special characters
4. tokenize the strings into discrete units
5. stem/lemmatize words
6. remove stopwords

ready for exploration!

## Let's see it in action

In [1]:
#standard imports
import pandas as pd
import numpy as np

### original text

In [2]:
original = "Paul Erdős and George Pólya were influential Hungarian mathematicians who contributed \
a lot to the field. Erdős's name contains the Hungarian letter 'ő' ('o' with double acute accent), \
but is often incorrectly written as Erdos or Erdös either by mistake or out of typographical necessity"

### 1. lowercase text

In [3]:
article = original.lower()
article

"paul erdős and george pólya were influential hungarian mathematicians who contributed a lot to the field. erdős's name contains the hungarian letter 'ő' ('o' with double acute accent), but is often incorrectly written as erdos or erdös either by mistake or out of typographical necessity"

### 2. remove any accented characters and non-ASCII characters

- `unicodedata.normalize` removes any inconsistencies in unicode character encoding
- `.encode` to convert the resulting string to the ASCII character set
- `.decode` to turn the resulting bytes object back into a string

Use `unicodedata.normalize().encode().decode`

In [4]:
#import
import unicodedata

In [5]:
# normalizing
# getting ride of anything not in ascii
# turning back to a string
article = unicodedata.normalize('NFKD', article).encode('ascii','ignore').decode('utf-8')
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field. erdos's name contains the hungarian letter 'o' ('o' with double acute accent), but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 3. remove special characters

- remove anything that isn't a-z, a number, a single quote, or a whitespace

In [6]:
#import regular expression operations
import re

In [7]:
#use re.sub to remove special characters
article = re.sub(r'[^a-z0-9\'\s]', '', article)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos's name contains the hungarian letter 'o' 'o' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 4. tokenize

Tokenization is the process of breaking something down into smaller, discrete units. These units are called tokens.

It's common to tokenize the strings to break up words and punctutation left over into discrete units. 

Use `nltk.tokenize.ToktokTokenizer`

In [8]:
#import natural language toolkit
import nltk

In [9]:
#create the tokenizer
tokenize = nltk.tokenize.ToktokTokenizer()
tokenize

<nltk.tokenize.toktok.ToktokTokenizer at 0x15cb5eb20>

In [10]:
#use the tokenizer
article = tokenize.tokenize(article, return_str=True)
article

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necessity"

### 5. stem or lemmatize words (choose one!)

Stemming
- **truncates** words to their "stem"
- algorithmic rules (non lingustic)
- example: "calls", "called", "calling" --> "call"
- fast and efficient


Lemmatize
- **changes** words to their "root"
- it can conjugate to the base word 
- example: "mouse", "mice" --> "mouse"
- slower than stemming

#### stemmer

Use `nltk.porter.PorterStemmer`

In [11]:
#create porter stemmer
ps = nltk.porter.PorterStemmer()
ps

<PorterStemmer>

In [12]:
#test stemmer
ps.stem('calling'), ps.stem('calls'), ps.stem('called'), ps.stem('call')

('call', 'call', 'call', 'call')

In [13]:
ps.stem('mouse'), ps.stem('mice')

('mous', 'mice')

In [14]:
#use stemmer - apply stem to each word in our string
ps.stem(article)

"paul erdos and george polya were influential hungarian mathematicians who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written as erdos or erdos either by mistake or out of typographical necess"

In [15]:
# split all the words in the article
article.split()

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematicians',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'as',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [16]:
stems = [ps.stem (word) for word in article.split()]
stems[:10]

['paul',
 'erdo',
 'and',
 'georg',
 'polya',
 'were',
 'influenti',
 'hungarian',
 'mathematician',
 'who']

In [17]:
#join words back together
article_stemmed = ' '.join(stems)
article_stemmed

"paul erdo and georg polya were influenti hungarian mathematician who contribut a lot to the field erdo ' s name contain the hungarian letter ' o ' ' o ' with doubl acut accent but is often incorrectli written as erdo or erdo either by mistak or out of typograph necess"

#### lemmatize

Use `nltk.stem.WordNetLemmatizer`

In [18]:
# download the first time
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/rosendo/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /Users/rosendo/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /Users/rosendo/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /Users/rosendo/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     /Users/rosendo/nltk_data...
[nltk_data]    |   Package basque_grammars is already up-to-date!
[nltk_data]    | Downloading pa

True

In [19]:
#create the lemmatizer
wnl = nltk.stem.WordNetLemmatizer()
wnl

<WordNetLemmatizer>

In [20]:
#test lemmatizer
wnl.lemmatize('mouse'), wnl.lemmatize('mice')

('mouse', 'mouse')

In [21]:
#use lemmatize - apply stem to each word in our string
# wnl.lemmatize(article)
lemma = [wnl.lemmatize(word) for word in article.split()]
lemma

['paul',
 'erdos',
 'and',
 'george',
 'polya',
 'were',
 'influential',
 'hungarian',
 'mathematician',
 'who',
 'contributed',
 'a',
 'lot',
 'to',
 'the',
 'field',
 'erdos',
 "'",
 's',
 'name',
 'contains',
 'the',
 'hungarian',
 'letter',
 "'",
 'o',
 "'",
 "'",
 'o',
 "'",
 'with',
 'double',
 'acute',
 'accent',
 'but',
 'is',
 'often',
 'incorrectly',
 'written',
 'a',
 'erdos',
 'or',
 'erdos',
 'either',
 'by',
 'mistake',
 'or',
 'out',
 'of',
 'typographical',
 'necessity']

In [22]:
#join words back together
article_lemma = ' '.join(lemma)
article_lemma

"paul erdos and george polya were influential hungarian mathematician who contributed a lot to the field erdos ' s name contains the hungarian letter ' o ' ' o ' with double acute accent but is often incorrectly written a erdos or erdos either by mistake or out of typographical necessity"

### 6. remove stopwords

Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords
- example: a, an, the, and like

We will use a standard English language stopwords list from nltk

Use `nltk.corpus.stopwords`

In [23]:
#import stopwords list
from nltk.corpus import stopwords

In [24]:
#only need to do once
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rosendo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [25]:
#save stopwords
stopwords_ls = stopwords.words('english')
stopwords_ls[:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [26]:
stopwords_ls.sort()

In [27]:
stopwords_ls[:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

In [28]:
#set a list to remove some stopwords
extra = ['all', 'about','after']
extra

['all', 'about', 'after']

In [29]:
set([1,5,5,2])

{1, 2, 5}

In [36]:
# remove extra words
set(stopwords_ls) - set(extra)

TypeError: unsupported operand type(s) for +: 'set' and 'set'

In [32]:
#add to stopword list
stopwords_ls.append('o')

In [33]:
len(stopwords_ls)

180

In [34]:
#remove from stopword list
stopwords_ls.remove('o')

In [35]:
len(stopwords_ls)

179

In [None]:
#split words in lemmatized article
words = article_lemma.split()
words [:10]

In [None]:
#add to stopword list
stopwords_ls.append("'")
stopwords_ls.append("o")

In [None]:
#remove stopwords from list of words
filtered = [word for word in words if word not in stopwords_ls]
filtered

In [None]:
#show how many words we removed
len(words) - len(filtered)

#### ready for exploration!

In [None]:
original

In [None]:
def basic_clean(article):
    """
    
    """
    # lowercase text
    article = article.lower()
    
    
    # remove any accented characters and non-ASCII characters
    # normalizing
    # getting ride of anything not in ascii
    # turning back to a string
    article = unicodedata.normalize('NFKD', article).encode('ascii','ignore').decode('utf-8')
    
    # remove special characters
    #use re.sub to remove special characters
    article = re.sub(r'[^a-z0-9\'\s]', '', article)
    
    # tokenization is the process of breaking something down into smaller, discrete units.
    # these units are called tokens.
    #create the tokenizer
    tokenize = nltk.tokenize.ToktokTokenizer()
    article = tokenize.tokenize(article, return_str=True)
    
    # Lemmatize
    # - **changes** words to their "root"
    # - it can conjugate to the base word 
    # - example: "mouse", "mice" --> "mouse"
    # - slower than stemming
    #create the lemmatizer
    wnl = nltk.stem.WordNetLemmatizer()
    
    #use lemmatize - apply stem to each word in our string
    # wnl.lemmatize(article)
    lemma = [wnl.lemmatize(word) for word in article.split()]
    
    #join words back together
    article_lemma = ' '.join(lemma)
    
    #save stopwords
    stopwords_ls = stopwords.words('english')
    
    # sort words inside stopwords
    stopwords_ls.sort()
    
    # #set a list to remove some stopwords IF THEY ARE NEEDED!
    # extra = ['all', 'about','after']
    # # remove extra words
    # set(stopwords_ls) - set(extra)

    #add to stopword list
    stopwords_ls.append("'")
    # #add to stopword list
    # stopwords_ls.append('o')
    
    # #remove from stopword list
    # stopwords_ls.remove('o')
    
    #split words in lemmatized article
    words = article_lemma.split()
        
    #remove stopwords from list of words
    filtered = [word for word in words if word not in stopwords_ls]
    
    #join words back together
    parsed_article = ' '.join(filtered)
    
    return parsed_article

In [None]:
#join words back together
parsed_article = ' '.join(filtered)
parsed_article

In [None]:
parsed_article = basic_clean(original)
parsed_article