<a href="https://colab.research.google.com/github/Zehra-Khuwaja/DS-A-20sw065-lab-tasks/blob/main/LAB_12_text_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

DS&A LAB-12
20SW041
SECTION-I


# Text Processing

## Capturing Text Data

### Plain Text

In [None]:
import os

# Read in a plain text file
with open(os.path.join("data", "hieroglyph.txt"), "r") as f:
    text = f.read()
    print(text)

### Tabular Data

In [None]:
import pandas as pd

# Extract text column from a dataframe
df = pd.read_csv(os.path.join("data", "news.csv"))
df.head()[['publisher', 'title']]

# Convert text column to lowercase
df['title'] = df['title'].str.lower()
df.head()[['publisher', 'title']]

### Online Resource

In [None]:
import requests
import json

# Fetch data from a REST API
r = requests.get(
    "https://quotes.rest/qod.json")
res = r.json()
print(json.dumps(res, indent=4))

# Extract relevant object and field
q = res["contents"]["quotes"][0]
print(q["quote"], "\n--", q["author"])

## Cleaning

In [None]:
import requests

# Fetch a web page
r = requests.get("https://news.ycombinator.com")
print(r.text)

In [None]:
import re

# Remove HTML tags using RegEx
pattern = re.compile(r'<.*?>')  # tags look like <...>
print(pattern.sub('', r.text))  # replace them with blank

In [None]:
from bs4 import BeautifulSoup

# Remove HTML tags using Beautiful Soup library
soup = BeautifulSoup(r.text, "html5lib")
print(soup.get_text())

In [None]:
# Find all articles
summaries = soup.find_all("tr", class_="athing")
summaries[0]

In [None]:
# Extract title
summaries[0].find("a", class_="storylink").get_text().strip()

In [None]:
# Find all articles, extract titles
articles = []
summaries = soup.find_all("tr", class_="athing")
for summary in summaries:
    title = summary.find("a", class_="storylink").get_text().strip()
    articles.append((title))

print(len(articles), "Article summaries found. Sample:")
print(articles[0])

## Normalization

### Case Normalization

In [None]:
# Sample text
text = "In today's business world, smart data-driven decisions are the number one priority. For this reason, companies track, monitor, and record information 24/7. The good news is there is plenty of public data on servers that can help businesses stay competitive.\xa0The process of extracting data from web pages manually can be tiring, time-consuming, error-prone, and sometimes even impossible. That is why most web data analysis efforts use automated tools.\xa0Web scraping is an automated method of collecting data from web pages. Data is extracted from web pages using software called web scrapers, which are basically web bots."
print(text)

In today's business world, smart data-driven decisions are the number one priority. For this reason, companies track, monitor, and record information 24/7. The good news is there is plenty of public data on servers that can help businesses stay competitive. The process of extracting data from web pages manually can be tiring, time-consuming, error-prone, and sometimes even impossible. That is why most web data analysis efforts use automated tools. Web scraping is an automated method of collecting data from web pages. Data is extracted from web pages using software called web scrapers, which are basically web bots.


In [None]:
# Convert to lowercase
text = text.lower()
print(text)

in today's business world, smart data-driven decisions are the number one priority. for this reason, companies track, monitor, and record information 24/7. the good news is there is plenty of public data on servers that can help businesses stay competitive. the process of extracting data from web pages manually can be tiring, time-consuming, error-prone, and sometimes even impossible. that is why most web data analysis efforts use automated tools. web scraping is an automated method of collecting data from web pages. data is extracted from web pages using software called web scrapers, which are basically web bots.


### Punctuation Removal

In [None]:
import re

# Remove punctuation characters
text = re.sub(r"[^a-zA-Z0-9]", " ", text)
print(text)

in today s business world  smart data driven decisions are the number one priority  for this reason  companies track  monitor  and record information 24 7  the good news is there is plenty of public data on servers that can help businesses stay competitive  the process of extracting data from web pages manually can be tiring  time consuming  error prone  and sometimes even impossible  that is why most web data analysis efforts use automated tools  web scraping is an automated method of collecting data from web pages  data is extracted from web pages using software called web scrapers  which are basically web bots 


## Tokenization

In [None]:
# Split text into tokens (words)
words = text.split()
print(words)

['in', 'today', 's', 'business', 'world', 'smart', 'data', 'driven', 'decisions', 'are', 'the', 'number', 'one', 'priority', 'for', 'this', 'reason', 'companies', 'track', 'monitor', 'and', 'record', 'information', '24', '7', 'the', 'good', 'news', 'is', 'there', 'is', 'plenty', 'of', 'public', 'data', 'on', 'servers', 'that', 'can', 'help', 'businesses', 'stay', 'competitive', 'the', 'process', 'of', 'extracting', 'data', 'from', 'web', 'pages', 'manually', 'can', 'be', 'tiring', 'time', 'consuming', 'error', 'prone', 'and', 'sometimes', 'even', 'impossible', 'that', 'is', 'why', 'most', 'web', 'data', 'analysis', 'efforts', 'use', 'automated', 'tools', 'web', 'scraping', 'is', 'an', 'automated', 'method', 'of', 'collecting', 'data', 'from', 'web', 'pages', 'data', 'is', 'extracted', 'from', 'web', 'pages', 'using', 'software', 'called', 'web', 'scrapers', 'which', 'are', 'basically', 'web', 'bots']


### NLTK: Natural Language ToolKit

In [None]:
import os
import nltk
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

In [None]:
# Another sample text
text = "Data is extracted from web pages using software called web scrapers, which are basically web bots."
print(text)

Data is extracted from web pages using software called web scrapers, which are basically web bots.


In [None]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk.tokenize import word_tokenize

# Split text into words using NLTK
words = word_tokenize(text)
print(words)

['Data', 'is', 'extracted', 'from', 'web', 'pages', 'using', 'software', 'called', 'web', 'scrapers', ',', 'which', 'are', 'basically', 'web', 'bots', '.']


In [None]:
from nltk.tokenize import sent_tokenize

# Split text into sentences
sentences = sent_tokenize(text)
print(sentences)

['Data is extracted from web pages using software called web scrapers, which are basically web bots.']


In [None]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
# List stop words
from nltk.corpus import stopwords
print(stopwords.words("english"))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# Reset text
text = "In today's business world, smart data-driven decisions are the number one priority. For this reason, companies track, monitor, and record information 24/7. The good news is there is plenty of public data on servers that can help businesses stay competitive. "

# Normalize it
text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())

# Tokenize it
words = text.split()
print(words)

['in', 'today', 's', 'business', 'world', 'smart', 'data', 'driven', 'decisions', 'are', 'the', 'number', 'one', 'priority', 'for', 'this', 'reason', 'companies', 'track', 'monitor', 'and', 'record', 'information', '24', '7', 'the', 'good', 'news', 'is', 'there', 'is', 'plenty', 'of', 'public', 'data', 'on', 'servers', 'that', 'can', 'help', 'businesses', 'stay', 'competitive']


In [None]:
# Remove stop words
words = [w for w in words if w not in stopwords.words("english")]
print(words)

['today', 'business', 'world', 'smart', 'data', 'driven', 'decisions', 'number', 'one', 'priority', 'reason', 'companies', 'track', 'monitor', 'record', 'information', '24', '7', 'good', 'news', 'plenty', 'public', 'data', 'servers', 'help', 'businesses', 'stay', 'competitive']


### Sentence Parsing

In [None]:
import nltk

# Define a custom grammar
my_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")
parser = nltk.ChartParser(my_grammar)

# Parse a sentence
sentence = word_tokenize("I shot an elephant in my pajamas")
for tree in parser.parse(sentence):
    print(tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


## Stemming & Lemmatization

### Stemming

In [None]:
from nltk.stem.porter import PorterStemmer

# Reduce words to their stems
stemmed = [PorterStemmer().stem(w) for w in words]
print(stemmed)

['today', 'busi', 'world', 'smart', 'data', 'driven', 'decis', 'number', 'one', 'prioriti', 'reason', 'compani', 'track', 'monitor', 'record', 'inform', '24', '7', 'good', 'news', 'plenti', 'public', 'data', 'server', 'help', 'busi', 'stay', 'competit']


### Lemmatization

In [None]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

# Reduce words to their root form
lemmed = [WordNetLemmatizer().lemmatize(w) for w in words]
print(lemmed)

['today', 'business', 'world', 'smart', 'data', 'driven', 'decision', 'number', 'one', 'priority', 'reason', 'company', 'track', 'monitor', 'record', 'information', '24', '7', 'good', 'news', 'plenty', 'public', 'data', 'server', 'help', 'business', 'stay', 'competitive']


In [None]:
# Lemmatize verbs by specifying pos
lemmed = [WordNetLemmatizer().lemmatize(w, pos='v') for w in lemmed]
print(lemmed)

['today', 'business', 'world', 'smart', 'data', 'drive', 'decision', 'number', 'one', 'priority', 'reason', 'company', 'track', 'monitor', 'record', 'information', '24', '7', 'good', 'news', 'plenty', 'public', 'data', 'server', 'help', 'business', 'stay', 'competitive']
