## Text Preprocessing 
- Tokenization
- Stemming and Lemmatization
- StopWords
- POS 

In [1]:
import nltk
#nltk.download()

In [2]:
paragraph = """Wildfire status, Being Held. Good news: overnight, parts of Jasper received over 30.5 mm of rain! 
Rain is expected to continue into tonight which, coupled with cooler weather, will further decrease fire activity. 
Caution: Heavy periods of rain can cause slopes and burnt trees to become unstable. 
Residents should use caution as gusty or shifting wind conditions can cause fire-weakened trees with loose roots to fall. 
Danger tree assessments adjacent to park highways have been completed to be safe for vehicles ONLY. 
This is an important distinction as only roads within the townsite and Pyramid Lake Road have been assessed to a safe standard for bicycling, running, and walking.
We appreciate your patience in respecting park closures and restrictions, put in place to keep people safe throughout the fire-affected areas of the park.
Due to the encouraging progress that fire crews have made within the Jasper Wildfire Complex, daily information updates will shift to weekly updates, unless this situation changes, and the fire becomes more active.
Recovery update
The Joint Recovery Coordination Centre (JRCC) is a partnership between the Municipality of Jasper and Parks Canada focused on helping our community navigate the recovery process. 
As we move ahead together, the Municipality of Jasper and Parks Canada will continue to keep residents and businesses up to date as progress is made towards recovery."""

### Tokenization
Convert a sequence of text into smaller parts, known as tokens. 

In [3]:
# Tokenizing sentences
sentences = nltk.sent_tokenize(paragraph)

# Tokenizing words
words = nltk.word_tokenize(paragraph)

In [4]:
sentences[0:5]

['Wildfire status, Being Held.',
 'Good news: overnight, parts of Jasper received over 30.5 mm of rain!',
 'Rain is expected to continue into tonight which, coupled with cooler weather, will further decrease fire activity.',
 'Caution: Heavy periods of rain can cause slopes and burnt trees to become unstable.',
 'Residents should use caution as gusty or shifting wind conditions can cause fire-weakened trees with loose roots to fall.']

In [5]:
words[0:5]

['Wildfire', 'status', ',', 'Being', 'Held']

In [6]:
len(sentences)

11

### Stemming and Lemmatization

Stemming is the process of reducing infected words to their word stem (base word). E.g. history, historical -- histori; going, goes, gone -- go. 

Lemmatization, on the other hand convert wrods into human readable words. E.g. history, historical -- history. 

Note that lemmatization will take more time than stemming. 

### Stop words
Stop words are words that are so widely used and carry very little useful information.

In [7]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [8]:
stemmer = PorterStemmer()

In [9]:
for i in range(len(sentences)):
    words = nltk.word_tokenize(sentences[i])
    words = [stemmer.stem(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i] = ' '.join(words)   
    

In [10]:
sentences[0:5]

['wildfir statu , be held .',
 'good news : overnight , part jasper receiv 30.5 mm rain !',
 'rain expect continu tonight , coupl cooler weather , decreas fire activ .',
 'caution : heavi period rain caus slope burnt tree becom unstabl .',
 'resid use caution gusti shift wind condit caus fire-weaken tree loos root fall .']