# Text Processing

## Capturing Text Data

### Plain Text

In [22]:
import os

# Read in a plain text file
with open(os.path.join("data", "hieroglyph.txt"), "r") as f:
    text = f.read()
    print(text)

Hieroglyphic writing dates from c. 3000 BC, and is composed of hundreds of symbols. A hieroglyph can represent a word, a sound, or a silent determinative; and the same symbol can serve different purposes in different contexts. Hieroglyphs were a formal script, used on stone monuments and in tombs, that could be as detailed as individual works of art.



### Tabular Data

In [23]:
import pandas as pd

# Extract text column from a dataframe
df = pd.read_csv(os.path.join("data", "news.csv"))
df.head()[['publisher', 'title']]

# Convert text column to lowercase
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

Unnamed: 0,publisher,title
0,Livemint,fed's charles plosser sees high bar for change...
1,IFA Magazine,us open: stocks fall after fed official hints ...
2,IFA Magazine,"fed risks falling 'behind the curve', charles ..."
3,Moneynews,fed's plosser: nasty weather has curbed job gr...
4,NASDAQ,plosser: fed may have to accelerate tapering pace


### Online Resource

In [24]:
import requests
import json

# Fetch data from a REST API
r = requests.get(
    "https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent=4))

# Extract relevant object and field
q = res["contents"]["quotes"][0]
print(q["quote"], "\n--", q["author"])

{
    "success": {
        "total": 1
    },
    "contents": {
        "quotes": [
            {
                "quote": "Stop complaining. Start creating.",
                "author": "Dale Patridge",
                "length": null,
                "tags": [
                    "complain",
                    "create",
                    "inspire"
                ],
                "category": "inspire",
                "title": "Inspiring Quote of the day",
                "date": "2019-03-23",
                "id": null
            }
        ],
        "copyright": "2017-19 theysaidso.com"
    }
}
Stop complaining. Start creating. 
-- Dale Patridge


## Cleaning

In [25]:
import requests

# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text)

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?SQfqA2xbTzsMRwoDFk4N">
            <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">past</a> | <a href="newco

In [26]:
import re

# Remove HTML tags using RegEx
pattern = re.compile(r'<.*?>')  # tags look like <...>
print(pattern.sub('', r.text))  # replace them with blank


            
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Calculating the mean of a list of numbers (2016) (hypothesis.works)
        153 points by GregBuchholz 4 hours ago  | hide | 60&nbsp;comments              
      
                
      2.      Machine Learning: Full-Text Search in JavaScript – Relevance Scoring (burakkanber.com)
        62 points by octosphere 3 hours ago  | hide | 10&nbsp;comments              
      
                
      3.      An Archive of 55k Boxing Matches on VHS (nytimes.com)
        46 points by typographer 3 hours ago  | hide | 16&nbsp;comments              
      
                
      4.      Fuck The Vessel (thebaffler.com)
        21 points by portobello 1 hour ago  | hide | 6&nbsp;comments              
      
                
      5. 

In [27]:
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())


            
          
        Hacker News
        
                  Hacker News
              new | past | comments | ask | show | jobs | submit            
                              login
                          
              

              
      1.      Calculating the mean of a list of numbers (2016) (hypothesis.works)
        153 points by GregBuchholz 4 hours ago  | hide | 60 comments              
      
                
      2.      Machine Learning: Full-Text Search in JavaScript – Relevance Scoring (burakkanber.com)
        62 points by octosphere 3 hours ago  | hide | 10 comments              
      
                
      3.      An Archive of 55k Boxing Matches on VHS (nytimes.com)
        46 points by typographer 3 hours ago  | hide | 16 comments              
      
                
      4.      Fuck The Vessel (thebaffler.com)
        21 points by portobello 1 hour ago  | hide | 6 comments              
      
                
      5.      Replete 2.0 (Cl

In [28]:
# Find all articles
summaries = soup.find_all("tr", class_="athing")
summaries[0]

<tr class="athing" id="19470945">
      <td align="right" class="title" valign="top"><span class="rank">1.</span></td>      <td class="votelinks" valign="top"><center><a href="vote?id=19470945&amp;how=up&amp;goto=news" id="up_19470945"><div class="votearrow" title="upvote"></div></a></center></td><td class="title"><a class="storylink" href="https://hypothesis.works/articles/calculating-the-mean/">Calculating the mean of a list of numbers (2016)</a><span class="sitebit comhead"> (<a href="from?site=hypothesis.works"><span class="sitestr">hypothesis.works</span></a>)</span></td></tr>

In [29]:
# Extract title
summaries[0].find("a", class_="storylink").get_text().strip()

'Calculating the mean of a list of numbers (2016)'

In [30]:
# Find all articles, extract titles
articles = []
summaries = soup.find_all("tr", class_="athing")
for summary in summaries:
    title = summary.find("a", class_="storylink").get_text().strip()
    articles.append((title))

print(len(articles), "Article summaries found. Sample:")
print(articles[0])

30 Article summaries found. Sample:
Calculating the mean of a list of numbers (2016)


## Normalization

### Case Normalization

In [3]:
# Sample text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"
print(text)

The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?


In [4]:
# Convert to lowercase
text = text.lower() 
print(text)

the first time you see the second renaissance it may look boring. look at it at least twice and definitely watch part 2. it will change your view of the matrix. are the human people the ones who started the war ? is ai a bad thing ?


### Punctuation Removal

In [5]:
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text) 
print(text)

the first time you see the second renaissance it may look boring  look at it at least twice and definitely watch part 2  it will change your view of the matrix  are the human people the ones who started the war   is ai a bad thing  


## Tokenization

In [6]:
# Split text into tokens (words)
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


### NLTK: Natural Language ToolKit

In [7]:
import os
import nltk
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

In [8]:
# Another sample text
text = "Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers."
print(text)

Dr. Smith graduated from the University of Washington. He later started an analytics firm called Lux, which catered to enterprise customers.


In [9]:
from nltk.tokenize import word_tokenize

# Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Dr.', 'Smith', 'graduated', 'from', 'the', 'University', 'of', 'Washington', '.', 'He', 'later', 'started', 'an', 'analytics', 'firm', 'called', 'Lux', ',', 'which', 'catered', 'to', 'enterprise', 'customers', '.']


In [10]:
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Dr. Smith graduated from the University of Washington.', 'He later started an analytics firm called Lux, which catered to enterprise customers.']


In [11]:
# List stop words
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [12]:
# Reset text
text = "The first time you see The Second Renaissance it may look boring. Look at it at least twice and definitely watch part 2. It will change your view of the matrix. Are the human people the ones who started the war ? Is AI a bad thing ?"

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

['the', 'first', 'time', 'you', 'see', 'the', 'second', 'renaissance', 'it', 'may', 'look', 'boring', 'look', 'at', 'it', 'at', 'least', 'twice', 'and', 'definitely', 'watch', 'part', '2', 'it', 'will', 'change', 'your', 'view', 'of', 'the', 'matrix', 'are', 'the', 'human', 'people', 'the', 'ones', 'who', 'started', 'the', 'war', 'is', 'ai', 'a', 'bad', 'thing']


In [13]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'ones', 'started', 'war', 'ai', 'bad', 'thing']


### Sentence Parsing

In [14]:
import nltk

# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


In [15]:
for tree in parser.parse(sentence):
    tree.draw()

# Named Entity Recognition

In [16]:
import nltk
from nltk import pos_tag, ne_chunk
from nltk.tokenize import word_tokenize
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt')
# Recognize named entities in a tagged sentence
ne_chunk(pos_tag(word_tokenize('Yulia joined Udacity Inc. in California.')))

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/yudzhi/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /Users/yudzhi/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /Users/yudzhi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


LookupError: 

===========================================================================
NLTK was unable to find the gs file!
Use software specific configuration paramaters or set the PATH environment variable.
===========================================================================

Tree('S', [Tree('PERSON', [('Yulia', 'NNS')]), ('joined', 'VBD'), Tree('ORGANIZATION', [('Udacity', 'NNP'), ('Inc.', 'NNP')]), ('in', 'IN'), Tree('GPE', [('California', 'NNP')]), ('.', '.')])

## Stemming & Lemmatization

### Stemming

In [17]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['first', 'time', 'see', 'second', 'renaiss', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definit', 'watch', 'part', '2', 'chang', 'view', 'matrix', 'human', 'peopl', 'one', 'start', 'war', 'ai', 'bad', 'thing']


### Lemmatization

In [18]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [19]:
lemmed_verb = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'boring', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'started', 'war', 'ai', 'bad', 'thing']


In [45]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['first', 'time', 'see', 'second', 'renaissance', 'may', 'look', 'bore', 'look', 'least', 'twice', 'definitely', 'watch', 'part', '2', 'change', 'view', 'matrix', 'human', 'people', 'one', 'start', 'war', 'ai', 'bad', 'thing']


In [47]:
# Copying lines from a lecture
r = requests.get("https://www.udacity.com/courses/all")
# You don't want to print it out. Really
# print(r.text)

In [49]:
# Skipped comparing with other methods, they don't work anyway
soup = BeautifulSoup(r.text, 'html5')
# You don't want it either. Leaves all the Java-script and spaces
# print(soup.get_text())

In [52]:
# Inspect or view page source
# find all course summaries
div_summaries= soup.find_all("div", class_="course-summary-card")
div_summaries[0]

<div _ngcontent-sc272="" class="course-summary-card row row-gap-medium catalog-card nanodegree-card ng-star-inserted"><ir-catalog-card _ngcontent-sc272="" _nghost-sc275=""><div _ngcontent-sc275="" class="card-wrapper is-collapsed"><div _ngcontent-sc275="" class="card__inner card mb-0"><div _ngcontent-sc275="" class="card__inner--upper"><div _ngcontent-sc275="" class="image_wrapper hidden-md-down"><a _ngcontent-sc275="" href="/course/data-engineer-nanodegree--nd027"><!----><div _ngcontent-sc275="" class="image-container ng-star-inserted" style="background-image:url(https://d20vrrgs8k4bvw.cloudfront.net/images/degrees/nd027/nd-card.jpg);"><div _ngcontent-sc275="" class="image-overlay"></div></div></a><!----></div><div _ngcontent-sc275="" class="card-content"><!----><span _ngcontent-sc275="" class="tag tag--new card ng-star-inserted">New</span><!----><div _ngcontent-sc275="" class="category-wrapper"><span _ngcontent-sc275="" class="mobile-icon"></span><!----><h4 _ngcontent-sc275="" class=

In [54]:
# Using CSS selector, finds only the first tag that matches a selector
div_summaries[0].select_one("h3 a")

<a _ngcontent-sc275="" class="capitalize" href="/course/data-engineer-nanodegree--nd027">Data Engineer</a>

In [56]:
# Extract title and strip extra white space
div_summaries[0].select_one("h3 a").get_text()

'Data Engineer'

In [57]:
div_summaries[0].select_one("h3 a").get_text().strip()

'Data Engineer'

In [76]:
# Extract description
div_summaries[0].select_one("div.card__expander")

<div _ngcontent-sc275="" class="card__expander"><div _ngcontent-sc275="" class="card__expander--summary mb-1"><!----><span _ngcontent-sc275="" class="ng-star-inserted">Data Engineering is the foundation for the new world of Big Data. Enroll now to build production-ready data infrastructure, an essential skill for advancing your data career.</span></div><hr _ngcontent-sc275=""/><div _ngcontent-sc275="" class="card__expander--details"><div _ngcontent-sc275="" class="rating"><!----></div><a _ngcontent-sc275="" class="button--primary btn" href="/course/data-engineer-nanodegree--nd027">Learn More</a></div></div>

In [77]:
#That was weird, but anyway
div_summaries[0].select_one("div.card__expander").get_text()

'Data Engineering is the foundation for the new world of Big Data. Enroll now to build production-ready data infrastructure, an essential skill for advancing your data career.Learn More'

In [81]:
#Find all course summaries, Extract name and description
courses = []
div_summaries = soup.find_all("div", class_="course-summary-card")
for summary in div_summaries:
    title = summary.select_one("h3 a").get_text().strip()
    description = summary.select_one("div.card__expander").get_text().strip()
#     print('***', title, '***')
#     print(description)
    courses.append((title, description))
    
print(len(courses), "Courses summaries found. Sample:")
print(courses[0][0])
print(courses[0][1])

233 Courses summaries found. Sample:
Data Engineer
Data Engineering is the foundation for the new world of Big Data. Enroll now to build production-ready data infrastructure, an essential skill for advancing your data career.Learn More


# Part-of-Speech Tagging

Note: Part-of-speech tagging using a predefined grammar like this is a simple, but limited solution. It can be very tedious and error-prone for a large corpus of text, since you have to account for all possible sentence structures and tags!

There are other more advanced forms of POS tagging that can learn sentence structures and tags from given data, including Hidden Markov Models (HMMs) and Recurrent Neural Networks (RNNs).

In [84]:
from nltk import pos_tag
# Tag part of speech (PoS)
sentence = word_tokenize('I always lie down to tell a lie.')
nltk.download('averaged_perceptron_tagger')
pos_tag(sentence)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/yudzhi/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('I', 'PRP'),
 ('always', 'RB'),
 ('lie', 'VBP'),
 ('down', 'RP'),
 ('to', 'TO'),
 ('tell', 'VB'),
 ('a', 'DT'),
 ('lie', 'NN'),
 ('.', '.')]