# Basic Text Processing

This project aims to:
+ Explore regular expression patterns and functionality
+ Strip HTML tags, images, code scripts
+ Tokenize words and sentences
+ Lemmatize stem word tokens
+ Assign Part-of-Speech (POS) tags

# Use urllib or requests package to read html 

In [1]:
from urllib import request

In [2]:
#1. Use urllib or requests package to read html 
html = request.urlopen('https://www.cnbc.com/2020/12/16/coronavirus-stimulus-update-congress-may-offer-900-billion-relief-plan.html').read()
# Range used to limit amount of output
html[:2000]

b'<!DOCTYPE html><html lang="en" prefix="og=https://ogp.me/ns#" itemType="https://schema.org/WebPage"><head><script src="//fm.cnbc.com/applications/cnbc.com/resources/newrelic/agent.js" defer=""></script><link rel="preload" as="script" href="https://sb.scorecardresearch.com/beacon.js"/><title itemProp="name">Coronavirus stimulus update: Congress working on $900 billion relief plan</title><meta name="viewport" content="initial-scale=1.0, width=device-width"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta property="AssetType" content="cnbcnewsstory"/><meta property="pageNodeId" content="106812089"/><meta itemProp="description" name="description" content="The coronavirus relief deal would reportedly include direct payments but not liability protections or state and local relief."/><link itemProp="url" rel="canonical" href="https://www.cnbc.com/2020/12/16/coronavirus-stimulus-update-congress-may-offer-900-billion-relief-plan.html"/><link rel="icon" type="image/png" href="/favi

# Use BeautifulSoup or another HTML parsing package to extract text from the article

In [3]:
from bs4 import BeautifulSoup
from bs4.element import Comment

def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

def text_from_html(body):
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

In [4]:
# Extract text from the article.
text = text_from_html(html)

#  Use re (regular expression) package

In [5]:
import re
#Find all matches of $ amounts in the article
amount = re.findall('\$\d*\.?\d+?', text)

print('The $ amounts in the article: \n'+ str(amount))
first_index = 0
amount_indexes = []

for i in range(text.count('$')):
    new_list = text[first_index:]
    match_indexes = new_list.index('$') + 1
    amount_indexes.append(first_index + new_list.index('$'))
    first_index += match_indexes
    
print('Matched positions of the $ amounts: \n' + str(amount_indexes))


The $ amounts in the article: 
['$900', '$900', '$900', '$900', '$300', '$900', '$900', '$600', '$1', '$900', '$600', '$600', '$900']
Matched positions of the $ amounts: 
[1722, 1921, 2394, 2494, 3684, 3893, 5030, 5337, 5604, 6000, 6215, 6234, 7056]


In [6]:
#substitute all numbers with # and print output
subs_output = re.sub(r'[0-9]','#',text)
print(subs_output[:2000])

Skip Navigation Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Fintech Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress #### Elections CNBC TV Live TV Live Audio Business Day Shows The News with Shepard Smith Entertainment Shows Full Episodes Latest Video Top Video CEO Interviews CNBC Documentaries CNBC Podcasts CNBC World Digital Originals Live TV Schedule Watchlist PRO PRO News PRO Live Subscribe Sign In Menu Make It Select USA INTL Search quotes, news & videos SIGN IN Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy 

In [7]:
#Count (using regular expressions) ”Netflix” and “Disney” mentions 
print('Total number mentions of Democrats: '+str(len(re.findall('Democrats', text, flags=0))))
print('Total number mentions of Congress: '+str(len(re.findall('Congress', text, flags=0))))

Total number mentions of Democrats: 4
Total number mentions of Congress: 11


#  Use NTLK and/or Spacy (Links to an external site.) tokenization features

In [8]:
import nltk
from nltk import word_tokenize, sent_tokenize, ngrams, pos_tag, RegexpParser
from collections import Counter

In [9]:
#Tokenize sentences
sentences = sent_tokenize(text)
for sentence in sentences[:15]:
    print(sentence)

Skip Navigation Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy Finance Health & Science Media Real Estate Energy Transportation Industrials Retail Wealth Life Small Business Investing Invest In You Personal Finance Fintech Financial Advisors Trading Nation Options Action ETF Street Buffett Archive Earnings Trader Talk Tech Cybersecurity Enterprise Internet Media Mobile Social Media Venture Capital Tech Guide Politics White House Policy Defense Congress 2020 Elections CNBC TV Live TV Live Audio Business Day Shows The News with Shepard Smith Entertainment Shows Full Episodes Latest Video Top Video CEO Interviews CNBC Documentaries CNBC Podcasts CNBC World Digital Originals Live TV Schedule Watchlist PRO PRO News PRO Live Subscribe Sign In Menu Make It Select USA INTL Search quotes, news & videos SIGN IN Markets Pre-Markets U.S. Markets Currencies Cryptocurrency Futures & Commodities Bonds Funds & ETFs Business Economy 

In [10]:
# Tokenize word from text
words = word_tokenize(text)
for word in words[:10]:
    print(word)

Skip
Navigation
Markets
Pre-Markets
U.S.
Markets
Currencies
Cryptocurrency
Futures
&


In [11]:
# Remove all English stop words
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english')) 
filtered_words = [] 

for w in words: 
    if w not in stop_words: 
        filtered_words.append(w) 

print('Removing all English stop words: \n',filtered_words,'\n')

Removing all English stop words: 
 ['Skip', 'Navigation', 'Markets', 'Pre-Markets', 'U.S.', 'Markets', 'Currencies', 'Cryptocurrency', 'Futures', '&', 'Commodities', 'Bonds', 'Funds', '&', 'ETFs', 'Business', 'Economy', 'Finance', 'Health', '&', 'Science', 'Media', 'Real', 'Estate', 'Energy', 'Transportation', 'Industrials', 'Retail', 'Wealth', 'Life', 'Small', 'Business', 'Investing', 'Invest', 'In', 'You', 'Personal', 'Finance', 'Fintech', 'Financial', 'Advisors', 'Trading', 'Nation', 'Options', 'Action', 'ETF', 'Street', 'Buffett', 'Archive', 'Earnings', 'Trader', 'Talk', 'Tech', 'Cybersecurity', 'Enterprise', 'Internet', 'Media', 'Mobile', 'Social', 'Media', 'Venture', 'Capital', 'Tech', 'Guide', 'Politics', 'White', 'House', 'Policy', 'Defense', 'Congress', '2020', 'Elections', 'CNBC', 'TV', 'Live', 'TV', 'Live', 'Audio', 'Business', 'Day', 'Shows', 'The', 'News', 'Shepard', 'Smith', 'Entertainment', 'Shows', 'Full', 'Episodes', 'Latest', 'Video', 'Top', 'Video', 'CEO', 'Interview

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tramh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
#List and count n-grams for any given input n
ngram_list = []
ngram_count = {}
def ngram_listAndCount(tokens,n_grams):
    # List of n_grams 
    for i in ngrams(tokens,n_grams):
        ngram_list.append(i)
    #Count of n_grams
    for i in ngram_list:
        if i in ngram_count:
            ngram_count[i] = ngram_count[i] + 1
        else:
            ngram_count[i] = 1
    return(ngram_count)             

In [13]:
# Example of list and count n-grams for any given input n
ngram_listAndCount(filtered_words, 4)

{('Skip', 'Navigation', 'Markets', 'Pre-Markets'): 1,
 ('Navigation', 'Markets', 'Pre-Markets', 'U.S.'): 1,
 ('Markets', 'Pre-Markets', 'U.S.', 'Markets'): 2,
 ('Pre-Markets', 'U.S.', 'Markets', 'Currencies'): 2,
 ('U.S.', 'Markets', 'Currencies', 'Cryptocurrency'): 2,
 ('Markets', 'Currencies', 'Cryptocurrency', 'Futures'): 2,
 ('Currencies', 'Cryptocurrency', 'Futures', '&'): 2,
 ('Cryptocurrency', 'Futures', '&', 'Commodities'): 2,
 ('Futures', '&', 'Commodities', 'Bonds'): 2,
 ('&', 'Commodities', 'Bonds', 'Funds'): 2,
 ('Commodities', 'Bonds', 'Funds', '&'): 2,
 ('Bonds', 'Funds', '&', 'ETFs'): 2,
 ('Funds', '&', 'ETFs', 'Business'): 2,
 ('&', 'ETFs', 'Business', 'Economy'): 2,
 ('ETFs', 'Business', 'Economy', 'Finance'): 2,
 ('Business', 'Economy', 'Finance', 'Health'): 2,
 ('Economy', 'Finance', 'Health', '&'): 2,
 ('Finance', 'Health', '&', 'Science'): 2,
 ('Health', '&', 'Science', 'Media'): 2,
 ('&', 'Science', 'Media', 'Real'): 2,
 ('Science', 'Media', 'Real', 'Estate'): 2,


In [14]:
#Lemmatize and deduplicate unigrams into a vocabulary of terms.
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tramh\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [15]:
# using Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
deduplicate_lem=[]
#Lemmatize and deduplicate unigrams into a vocabulary of terms.
for w in words: 
    if wordnet_lemmatizer.lemmatize(w) not in deduplicate_lem:
        deduplicate_lem.append(wordnet_lemmatizer.lemmatize(w))

In [16]:
# Example of lemmatize and deduplicate unigrams into a vocabulary of terms.
print(deduplicate_lem[:200])
print(words[:200])

['Skip', 'Navigation', 'Markets', 'Pre-Markets', 'U.S.', 'Currencies', 'Cryptocurrency', 'Futures', '&', 'Commodities', 'Bonds', 'Funds', 'ETFs', 'Business', 'Economy', 'Finance', 'Health', 'Science', 'Media', 'Real', 'Estate', 'Energy', 'Transportation', 'Industrials', 'Retail', 'Wealth', 'Life', 'Small', 'Investing', 'Invest', 'In', 'You', 'Personal', 'Fintech', 'Financial', 'Advisors', 'Trading', 'Nation', 'Options', 'Action', 'ETF', 'Street', 'Buffett', 'Archive', 'Earnings', 'Trader', 'Talk', 'Tech', 'Cybersecurity', 'Enterprise', 'Internet', 'Mobile', 'Social', 'Venture', 'Capital', 'Guide', 'Politics', 'White', 'House', 'Policy', 'Defense', 'Congress', '2020', 'Elections', 'CNBC', 'TV', 'Live', 'Audio', 'Day', 'Shows', 'The', 'News', 'with', 'Shepard', 'Smith', 'Entertainment', 'Full', 'Episodes', 'Latest', 'Video', 'Top', 'CEO', 'Interviews', 'Documentaries', 'Podcasts', 'World', 'Digital', 'Originals', 'Schedule', 'Watchlist', 'PRO', 'Subscribe', 'Sign', 'Menu', 'Make', 'It', 

In [17]:
from nltk.util import ngrams
#Create the list of first 5 sentences
sentences_5 = sentences[:5]

#Function to return ngrams 
def grams_output(sent, n):
    words = word_tokenize(sent)
    grams_return = ngrams(words, n)
    return(list(grams_return))

In [18]:
#Print bigrams and trigrams in the first 5 sentences
for i in range(5):
    print("Bigrams and trigrams" + "for sentence " + str(i+1))
    print(grams_output(sentences_5[i], 2))
    print(grams_output(sentences_5[i], 3))
    print('\n')

Bigrams and trigramsfor sentence 1
[('Skip', 'Navigation'), ('Navigation', 'Markets'), ('Markets', 'Pre-Markets'), ('Pre-Markets', 'U.S.'), ('U.S.', 'Markets'), ('Markets', 'Currencies'), ('Currencies', 'Cryptocurrency'), ('Cryptocurrency', 'Futures'), ('Futures', '&'), ('&', 'Commodities'), ('Commodities', 'Bonds'), ('Bonds', 'Funds'), ('Funds', '&'), ('&', 'ETFs'), ('ETFs', 'Business'), ('Business', 'Economy'), ('Economy', 'Finance'), ('Finance', 'Health'), ('Health', '&'), ('&', 'Science'), ('Science', 'Media'), ('Media', 'Real'), ('Real', 'Estate'), ('Estate', 'Energy'), ('Energy', 'Transportation'), ('Transportation', 'Industrials'), ('Industrials', 'Retail'), ('Retail', 'Wealth'), ('Wealth', 'Life'), ('Life', 'Small'), ('Small', 'Business'), ('Business', 'Investing'), ('Investing', 'Invest'), ('Invest', 'In'), ('In', 'You'), ('You', 'Personal'), ('Personal', 'Finance'), ('Finance', 'Fintech'), ('Fintech', 'Financial'), ('Financial', 'Advisors'), ('Advisors', 'Trading'), ('Trading

In [19]:
nltk.download('averaged_perceptron_tagger')
# Print POS tags in the first 5 sentences
for i in range(5):
    tokens = word_tokenize(sentences_5[i])
    print("POS tags" + "for sentence " + str(i+1))
    sentence_pos = pos_tag(tokens)
    print(sentence_pos)

POS tagsfor sentence 1
[('Skip', 'JJ'), ('Navigation', 'NNP'), ('Markets', 'NNP'), ('Pre-Markets', 'NNP'), ('U.S.', 'NNP'), ('Markets', 'NNP'), ('Currencies', 'NNP'), ('Cryptocurrency', 'NNP'), ('Futures', 'NNP'), ('&', 'CC'), ('Commodities', 'NNP'), ('Bonds', 'NNP'), ('Funds', 'NNP'), ('&', 'CC'), ('ETFs', 'NNP'), ('Business', 'NNP'), ('Economy', 'NNP'), ('Finance', 'NNP'), ('Health', 'NNP'), ('&', 'CC'), ('Science', 'NNP'), ('Media', 'NNP'), ('Real', 'NNP'), ('Estate', 'NNP'), ('Energy', 'NNP'), ('Transportation', 'NNP'), ('Industrials', 'NNP'), ('Retail', 'NNP'), ('Wealth', 'NNP'), ('Life', 'NNP'), ('Small', 'NNP'), ('Business', 'NNP'), ('Investing', 'NNP'), ('Invest', 'NNP'), ('In', 'IN'), ('You', 'PRP'), ('Personal', 'NNP'), ('Finance', 'NNP'), ('Fintech', 'NNP'), ('Financial', 'NNP'), ('Advisors', 'NNPS'), ('Trading', 'NNP'), ('Nation', 'NN'), ('Options', 'NNP'), ('Action', 'NNP'), ('ETF', 'NNP'), ('Street', 'NNP'), ('Buffett', 'NNP'), ('Archive', 'NNP'), ('Earnings', 'NNP'), ('T

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tramh\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
