# PG AI - Natural Language Processing and Speech Recognition
# Implement Text Processing Using Stemming and Regular Expression after Noise Removal and Convert It into List of Phrases

DESCRIPTION

In this practice, we will show you how to implement text processing using stemming and regular expression after noise removal and convert it into a list of phrases using NLP.<br>     
The goal of both stemming and regular expression is to reduce inflectional forms and sometimes, derivationally related forms of a word to a common base form. So, we will use stemming and regular expression as one of the important features of NLTK.<br>
The Natural Language Toolkit, or NLTK for short, is a Python library written for modeling text and working model.<br>
It provides good tools for loading and cleaning the text, that we can use to get our data ready for working with machine learning and deep learning algorithms<br>

Steps to be followed:
1. Import NLTK
2. Split into Sentences
3. Split into Words
4. Filter Out Punctuation
5. Filter out Stop Words (and Pipeline)
6. Stem Words

By Edson Teixeira<br>
teixeiraedson252@gmail.com <br>
December 29th 2021

In [1]:
#  Step 1: Import NLTK 
import nltk

In [2]:
# Step 2: Split into Sentences
# Load data
filename ='FP.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

# Split into Sentences
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[0])

﻿This is financial Report of 2019-2020



The share of large borrowers in Indian banks’ total loan portfolios stood at 53% as on March 2019

The gross non-performing assets (NPAs) as a percentage of total loans stood at 9.3% as on March 2019

Mumbai: Indian banks continue to see an improvement in asset quality with bad loans as a percentage of total loans expected to fall to 9% by March 2020, according to the Financial Stability Report released by the Reserve Bank of India (RBI) on Thursday.


In [3]:
# Step 3: Split into Words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
print(tokens[:100])

['\ufeffThis', 'is', 'financial', 'Report', 'of', '2019-2020', 'The', 'share', 'of', 'large', 'borrowers', 'in', 'Indian', 'banks', '’', 'total', 'loan', 'portfolios', 'stood', 'at', '53', '%', 'as', 'on', 'March', '2019', 'The', 'gross', 'non-performing', 'assets', '(', 'NPAs', ')', 'as', 'a', 'percentage', 'of', 'total', 'loans', 'stood', 'at', '9.3', '%', 'as', 'on', 'March', '2019', 'Mumbai', ':', 'Indian', 'banks', 'continue', 'to', 'see', 'an', 'improvement', 'in', 'asset', 'quality', 'with', 'bad', 'loans', 'as', 'a', 'percentage', 'of', 'total', 'loans', 'expected', 'to', 'fall', 'to', '9', '%', 'by', 'March', '2020', ',', 'according', 'to', 'the', 'Financial', 'Stability', 'Report', 'released', 'by', 'the', 'Reserve', 'Bank', 'of', 'India', '(', 'RBI', ')', 'on', 'Thursday', '.', 'The', 'gross', 'non-performing']


In [4]:
# Step 4: Filter Out Punctuation
# Remove all tokens that are not alphabetic
words = [word for word in tokens if word.isalpha()]
print(words[:100])

['is', 'financial', 'Report', 'of', 'The', 'share', 'of', 'large', 'borrowers', 'in', 'Indian', 'banks', 'total', 'loan', 'portfolios', 'stood', 'at', 'as', 'on', 'March', 'The', 'gross', 'assets', 'NPAs', 'as', 'a', 'percentage', 'of', 'total', 'loans', 'stood', 'at', 'as', 'on', 'March', 'Mumbai', 'Indian', 'banks', 'continue', 'to', 'see', 'an', 'improvement', 'in', 'asset', 'quality', 'with', 'bad', 'loans', 'as', 'a', 'percentage', 'of', 'total', 'loans', 'expected', 'to', 'fall', 'to', 'by', 'March', 'according', 'to', 'the', 'Financial', 'Stability', 'Report', 'released', 'by', 'the', 'Reserve', 'Bank', 'of', 'India', 'RBI', 'on', 'Thursday', 'The', 'gross', 'assets', 'NPAs', 'as', 'a', 'percentage', 'of', 'total', 'loans', 'stood', 'at', 'as', 'on', 'March', 'According', 'to', 'the', 'report', 'stress', 'tests', 'done', 'on']


In [5]:
# Step 5: Filter out Stop Words (and Pipeline)
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [6]:
# Step 6: Stem Words
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in tokens]
print(stemmed[:100])

['\ufeffthi', 'is', 'financi', 'report', 'of', '2019-2020', 'the', 'share', 'of', 'larg', 'borrow', 'in', 'indian', 'bank', '’', 'total', 'loan', 'portfolio', 'stood', 'at', '53', '%', 'as', 'on', 'march', '2019', 'the', 'gross', 'non-perform', 'asset', '(', 'npa', ')', 'as', 'a', 'percentag', 'of', 'total', 'loan', 'stood', 'at', '9.3', '%', 'as', 'on', 'march', '2019', 'mumbai', ':', 'indian', 'bank', 'continu', 'to', 'see', 'an', 'improv', 'in', 'asset', 'qualiti', 'with', 'bad', 'loan', 'as', 'a', 'percentag', 'of', 'total', 'loan', 'expect', 'to', 'fall', 'to', '9', '%', 'by', 'march', '2020', ',', 'accord', 'to', 'the', 'financi', 'stabil', 'report', 'releas', 'by', 'the', 'reserv', 'bank', 'of', 'india', '(', 'rbi', ')', 'on', 'thursday', '.', 'the', 'gross', 'non-perform']
