# TOC

  __Chapter 2 - Text wrangling and processing__

1. [Import](#Import)
1. [Text wrangling](#Text-wrangling)
1. [Tokenization](#Tokenization)
1. [Stemming](#Stemming)
1. [Lemmatization](#Lemmatization)
1. [Stop word removal](#Stop-word-removal)
1. [Spelling correction](#Spelling-correction)

# Import

<a id = 'Import'></a>

In [1]:
# Standard libary and settings
import os
import sys
import importlib
import itertools
import warnings; warnings.simplefilter('ignore')
from IPython.core.display import display, HTML; display(HTML("<style>.container { width:95% !important; }</style>"))

# Data extensions and settings
import numpy as np
np.set_printoptions(threshold = np.inf, suppress = True)
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.options.display.float_format = '{:,.6f}'.format

# Modeling extensions
import nltk

# Visualization extensions and settings
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set_style('whitegrid')


# Text wrangling


<a id = 'Text-wrangling'></a>

In [3]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [5]:
# split into sentences using sent_tokenize
from nltk.tokenize import sent_tokenize
inputstring = 'this is an example sent. the sentence splitter will split on sent markers. Ohh really!!'

all_sent = sent_tokenize(inputstring)
print(all_sent)


['this is an example sent.', 'the sentence splitter will split on sent markers.', 'Ohh really!', '!']


In [6]:
# create a custom sentence splitter
import nltk.tokenize.punkt
tokenizer = nltk.tokenize.PunktSentenceTokenizer()


# Tokenization

A token, aka a word, is the minimal unit that a machine can evaluate and process. Tokenization is the process of splitting text data down to the point of building a collection of individual words.


<a id = 'Tokenization'></a>

In [7]:
# simple split using basic Python
s = 'Hi everyone! hola gr8'
print(s.split())


['Hi', 'everyone!', 'hola', 'gr8']


In [8]:
# simple split nltk
from nltk.tokenize import word_tokenize
word_tokenize(s)


['Hi', 'everyone', '!', 'hola', 'gr8']

In [9]:
# basic examples with various tokenizers
from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
print(regexp_tokenize(s, pattern = '\w+'))
print(regexp_tokenize(s, pattern = '\d+'))
print(wordpunct_tokenize(s))
print(blankline_tokenize(s))


['Hi', 'everyone', 'hola', 'gr8']
['8']
['Hi', 'everyone', '!', 'hola', 'gr8']
['Hi everyone! hola gr8']


# Stemming

Stemming is the process of reducing a token down to its stem, i.e. reducing 'eating' down to 'eat'


<a id = 'Stemming'></a>

In [12]:
# basic stemming examples
from nltk.stem import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer

pst = PorterStemmer()
lst = LancasterStemmer()
print(lst.stem('eating'))
print(pst.stem('shopping'))


eat
shop


# Lemmatization

Lemmatization is a more precide way of converting tokens to their roots. Lemmatization uses context and parts of speech to determine how to get to the root, aka lemma.


<a id = 'Lemmatization'></a>

In [20]:
# lemmatization that uses wordnet, a semantic dictionary for performing lookups
from nltk.stem import WordNetLemmatizer
wlem = WordNetLemmatizer()
wlem.lemmatize("dogs")


'dog'

# Stop word removal

Stop word removal is the process is removing words that occur commonly across documents and generally have no significance. These stop words lists are typically hand-curated lists of words


<a id = 'Stop-word-removal'></a>

In [24]:
# remove stop words from a sample sentence
from nltk.corpus import stopwords
stoplist = stopwords.words('english')
text = 'this is just a test, only a test'
cleanwords = [word for word in text.split() if word not in stoplist]
print(cleanwords)


['test,', 'test']


# Spelling correction

NLTK includes an algorithm called edit-distance that can be used to perform fuzzy string matching.


<a id = 'Spelling-correction'></a>

In [29]:
# calculate Levenshtein distance between two words
from nltk.metrics import edit_distance
edit_distance('rain','shine')


3