In this notebook we shall perform basic text processing using NLTK. The steps involved are:
1. Load text file
2. Split text into sentences
3. Split text into words/tokens
4. Filter punctuation tokens
5. Filter stop words
6. Stemming

### Read text file

In [2]:
import nltk

In [3]:
filename = 'FP.txt'

In [4]:
file = open(filename, 'rt')

In [5]:
text = file.read()

In [6]:
file.close()

In [7]:
text

'A financial crisis is any of a broad variety of situations in which some financial assets suddenly lose a large part of their nominal value. In the 19th and early 20th centuries, many financial crises were associated with banking panics, and many recessions coincided with these panics. Other situations that are often called financial crises include stock market crashes and the bursting of other financial bubbles, currency crises, and sovereign defaults.[1][2] Financial crises directly result in a loss of paper wealth but do not necessarily result in significant changes in the real economy (e.g. the crisis resulting from the famous tulip mania bubble in the 17th century).\n\nMany economists have offered theories about how financial crises develop and how they could be prevented. There is no consensus, however, and financial crises continue to occur from time to time.'

### Split text into sentences

In [8]:
from nltk import sent_tokenize

In [9]:
sentences = sent_tokenize(text)

In [13]:
len(sentences)

7

In [12]:
for sent in sentences:
    print(sent,'\n')

A financial crisis is any of a broad variety of situations in which some financial assets suddenly lose a large part of their nominal value. 

In the 19th and early 20th centuries, many financial crises were associated with banking panics, and many recessions coincided with these panics. 

Other situations that are often called financial crises include stock market crashes and the bursting of other financial bubbles, currency crises, and sovereign defaults. 

[1][2] Financial crises directly result in a loss of paper wealth but do not necessarily result in significant changes in the real economy (e.g. 

the crisis resulting from the famous tulip mania bubble in the 17th century). 

Many economists have offered theories about how financial crises develop and how they could be prevented. 

There is no consensus, however, and financial crises continue to occur from time to time. 



### Split text into words/tokens

In [18]:
from nltk import word_tokenize

In [15]:
tokens = word_tokenize(text)

In [17]:
len(tokens)

159

In [21]:
#split occurs at each white space and punctuation
print(tokens)

['A', 'financial', 'crisis', 'is', 'any', 'of', 'a', 'broad', 'variety', 'of', 'situations', 'in', 'which', 'some', 'financial', 'assets', 'suddenly', 'lose', 'a', 'large', 'part', 'of', 'their', 'nominal', 'value', '.', 'In', 'the', '19th', 'and', 'early', '20th', 'centuries', ',', 'many', 'financial', 'crises', 'were', 'associated', 'with', 'banking', 'panics', ',', 'and', 'many', 'recessions', 'coincided', 'with', 'these', 'panics', '.', 'Other', 'situations', 'that', 'are', 'often', 'called', 'financial', 'crises', 'include', 'stock', 'market', 'crashes', 'and', 'the', 'bursting', 'of', 'other', 'financial', 'bubbles', ',', 'currency', 'crises', ',', 'and', 'sovereign', 'defaults', '.', '[', '1', ']', '[', '2', ']', 'Financial', 'crises', 'directly', 'result', 'in', 'a', 'loss', 'of', 'paper', 'wealth', 'but', 'do', 'not', 'necessarily', 'result', 'in', 'significant', 'changes', 'in', 'the', 'real', 'economy', '(', 'e.g', '.', 'the', 'crisis', 'resulting', 'from', 'the', 'famous', 

In [25]:
alpha_tokens = [token for token in tokens if token.isalpha()]

In [32]:
len(alpha_tokens)

134

In [26]:
print(alpha_tokens)

['A', 'financial', 'crisis', 'is', 'any', 'of', 'a', 'broad', 'variety', 'of', 'situations', 'in', 'which', 'some', 'financial', 'assets', 'suddenly', 'lose', 'a', 'large', 'part', 'of', 'their', 'nominal', 'value', 'In', 'the', 'and', 'early', 'centuries', 'many', 'financial', 'crises', 'were', 'associated', 'with', 'banking', 'panics', 'and', 'many', 'recessions', 'coincided', 'with', 'these', 'panics', 'Other', 'situations', 'that', 'are', 'often', 'called', 'financial', 'crises', 'include', 'stock', 'market', 'crashes', 'and', 'the', 'bursting', 'of', 'other', 'financial', 'bubbles', 'currency', 'crises', 'and', 'sovereign', 'defaults', 'Financial', 'crises', 'directly', 'result', 'in', 'a', 'loss', 'of', 'paper', 'wealth', 'but', 'do', 'not', 'necessarily', 'result', 'in', 'significant', 'changes', 'in', 'the', 'real', 'economy', 'the', 'crisis', 'resulting', 'from', 'the', 'famous', 'tulip', 'mania', 'bubble', 'in', 'the', 'century', 'Many', 'economists', 'have', 'offered', 'theo

### Filter stop words

In [28]:
from nltk.corpus import stopwords

In [29]:
stop_words = stopwords.words('english')

In [31]:
len(stop_words)

179

In [30]:
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [33]:
filtered_tokens = [token for token in alpha_tokens if token not in stop_words]

In [34]:
len(filtered_tokens)

82

In [35]:
print(filtered_tokens)

['A', 'financial', 'crisis', 'broad', 'variety', 'situations', 'financial', 'assets', 'suddenly', 'lose', 'large', 'part', 'nominal', 'value', 'In', 'early', 'centuries', 'many', 'financial', 'crises', 'associated', 'banking', 'panics', 'many', 'recessions', 'coincided', 'panics', 'Other', 'situations', 'often', 'called', 'financial', 'crises', 'include', 'stock', 'market', 'crashes', 'bursting', 'financial', 'bubbles', 'currency', 'crises', 'sovereign', 'defaults', 'Financial', 'crises', 'directly', 'result', 'loss', 'paper', 'wealth', 'necessarily', 'result', 'significant', 'changes', 'real', 'economy', 'crisis', 'resulting', 'famous', 'tulip', 'mania', 'bubble', 'century', 'Many', 'economists', 'offered', 'theories', 'financial', 'crises', 'develop', 'could', 'prevented', 'There', 'consensus', 'however', 'financial', 'crises', 'continue', 'occur', 'time', 'time']


We notice that there are 159 tokens in the original text. After removing punctuations we have 134 alpha tokens. After further removing stop words we are left with 82 tokens

### Stemming

In [55]:
from nltk.stem.porter import PorterStemmer

In [56]:
porter = PorterStemmer()

In [57]:
stem_tokens = [porter.stem(token) for token in filtered_tokens]

In [58]:
len(stem_tokens)

82

In [59]:
print(stem_tokens)

['A', 'financi', 'crisi', 'broad', 'varieti', 'situat', 'financi', 'asset', 'suddenli', 'lose', 'larg', 'part', 'nomin', 'valu', 'In', 'earli', 'centuri', 'mani', 'financi', 'crise', 'associ', 'bank', 'panic', 'mani', 'recess', 'coincid', 'panic', 'other', 'situat', 'often', 'call', 'financi', 'crise', 'includ', 'stock', 'market', 'crash', 'burst', 'financi', 'bubbl', 'currenc', 'crise', 'sovereign', 'default', 'financi', 'crise', 'directli', 'result', 'loss', 'paper', 'wealth', 'necessarili', 'result', 'signific', 'chang', 'real', 'economi', 'crisi', 'result', 'famou', 'tulip', 'mania', 'bubbl', 'centuri', 'mani', 'economist', 'offer', 'theori', 'financi', 'crise', 'develop', 'could', 'prevent', 'there', 'consensu', 'howev', 'financi', 'crise', 'continu', 'occur', 'time', 'time']


As shown above, Porter stemmer performed poorly as it striped of suffixes of words destroying the meaning.