<a href="https://colab.research.google.com/github/rish88c/DAV/blob/main/Exp_07/DAV_EXP07.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Experiment - 7

Aim: Perform the steps involved in Text Analytics in Python & R


Task to be performed :
* Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)
* Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)
* Perform the following experiments using Python & R
* Tokenization (Sentence & Word)
* Frequency Distribution
* Remove stopwords & punctuations
* Lexicon Normalization (Stemming, Lemmatization)
* Part of Speech tagging
* Named Entity Recognization
* Scrape data from a website

#Scrape data from a website

In [4]:
import requests
from bs4 import BeautifulSoup

# Specify the URL of the webpage to scrape
url = 'https://toscrape.com/'
x=''
# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the webpage
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find and extract specific elements or data from the webpage
    # Example: Extract all <p> tags
    paragraphs = soup.find_all('p')

    # Print the extracted data
    for paragraph in paragraphs:
        print(paragraph.text)
        x += paragraph.text
else:
    print('Failed to retrieve the webpage. Status code:', response.status_code)

A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com
A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.


##TOKENIZER

In [5]:
# Python
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('punkt')
text = x
sentences = sent_tokenize(text)
words = word_tokenize(text)
print("Sentences:", sentences)
print("Words:", words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sentences: ['A fictional bookstore that desperately wants to be scraped.', "It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well.", 'Available at: books.toscrape.comA website that lists quotes from famous people.', 'It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.']
Words: ['A', 'fictional', 'bookstore', 'that', 'desperately', 'wants', 'to', 'be', 'scraped', '.', 'It', "'s", 'a', 'safe', 'place', 'for', 'beginners', 'learning', 'web', 'scraping', 'and', 'for', 'developers', 'validating', 'their', 'scraping', 'technologies', 'as', 'well', '.', 'Available', 'at', ':', 'books.toscrape.comA', 'website', 'that', 'lists', 'quotes', 'from', 'famous', 'people', '.', 'It', 'has', 'many', 'endpoints', 'showing', 'the', 'quotes', 'in', 'many', 'different', 'ways', ',', 'each', 'of', 'them', 'including', 'new', 'scraping', 'challenges', 'for

##Frequency Distribution

In [6]:
# Python
from nltk.probability import FreqDist

# Assuming 'words' is a list of words
fdist = FreqDist(words)

# Print frequency counts for each word
for word, frequency in fdist.items():
    print(f"{word}: {frequency}")


A: 1
fictional: 1
bookstore: 1
that: 2
desperately: 1
wants: 1
to: 1
be: 1
scraped: 1
.: 4
It: 2
's: 1
a: 1
safe: 1
place: 1
for: 3
beginners: 1
learning: 1
web: 1
scraping: 3
and: 1
developers: 1
validating: 1
their: 1
technologies: 1
as: 2
well: 1
Available: 1
at: 1
:: 1
books.toscrape.comA: 1
website: 1
lists: 1
quotes: 2
from: 1
famous: 1
people: 1
has: 1
many: 2
endpoints: 1
showing: 1
the: 1
in: 1
different: 1
ways: 1
,: 2
each: 1
of: 1
them: 1
including: 1
new: 1
challenges: 1
you: 1
described: 1
below: 1


##Remove stopwords & punctuations


In [7]:
# Python
from nltk.corpus import stopwords
import string
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words and word not in string.punctuation]
print("Filtered Words:", filtered_words)


Filtered Words: ['fictional', 'bookstore', 'desperately', 'wants', 'scraped', "'s", 'safe', 'place', 'beginners', 'learning', 'web', 'scraping', 'developers', 'validating', 'scraping', 'technologies', 'well', 'Available', 'books.toscrape.comA', 'website', 'lists', 'quotes', 'famous', 'people', 'many', 'endpoints', 'showing', 'quotes', 'many', 'different', 'ways', 'including', 'new', 'scraping', 'challenges', 'described']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


##Lexicon Normalization (Stemming, Lemmatization)

In [8]:
import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string

# Download WordNet corpus
nltk.download('wordnet')



# Tokenize the text and remove punctuation
words = [word for word in word_tokenize(text) if word not in string.punctuation]

# Initialize stemmer and lemmatizer
porter = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Perform stemming and lemmatization
stemmed_words = [porter.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word.lower()) for word in words]

# Print the results
print("Stemmed Words:", stemmed_words)
print("Lemmatized Words:", lemmatized_words)


[nltk_data] Downloading package wordnet to /root/nltk_data...


Stemmed Words: ['a', 'fiction', 'bookstor', 'that', 'desper', 'want', 'to', 'be', 'scrape', 'it', "'s", 'a', 'safe', 'place', 'for', 'beginn', 'learn', 'web', 'scrape', 'and', 'for', 'develop', 'valid', 'their', 'scrape', 'technolog', 'as', 'well', 'avail', 'at', 'books.toscrape.coma', 'websit', 'that', 'list', 'quot', 'from', 'famou', 'peopl', 'it', 'ha', 'mani', 'endpoint', 'show', 'the', 'quot', 'in', 'mani', 'differ', 'way', 'each', 'of', 'them', 'includ', 'new', 'scrape', 'challeng', 'for', 'you', 'as', 'describ', 'below']
Lemmatized Words: ['a', 'fictional', 'bookstore', 'that', 'desperately', 'want', 'to', 'be', 'scraped', 'it', "'s", 'a', 'safe', 'place', 'for', 'beginner', 'learning', 'web', 'scraping', 'and', 'for', 'developer', 'validating', 'their', 'scraping', 'technology', 'a', 'well', 'available', 'at', 'books.toscrape.coma', 'website', 'that', 'list', 'quote', 'from', 'famous', 'people', 'it', 'ha', 'many', 'endpoint', 'showing', 'the', 'quote', 'in', 'many', 'different

##Part of Speech tagging

In [9]:
nltk.download('averaged_perceptron_tagger')
pos_tags = nltk.pos_tag(filtered_words)
print("Part of Speech Tags:", pos_tags)


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Part of Speech Tags: [('fictional', 'JJ'), ('bookstore', 'NN'), ('desperately', 'RB'), ('wants', 'VBZ'), ('scraped', 'VBD'), ("'s", 'POS'), ('safe', 'JJ'), ('place', 'NN'), ('beginners', 'NNS'), ('learning', 'VBG'), ('web', 'JJ'), ('scraping', 'VBG'), ('developers', 'NNS'), ('validating', 'VBG'), ('scraping', 'NN'), ('technologies', 'NNS'), ('well', 'RB'), ('Available', 'NNP'), ('books.toscrape.comA', 'NN'), ('website', 'NN'), ('lists', 'NNS'), ('quotes', 'VBZ'), ('famous', 'JJ'), ('people', 'NNS'), ('many', 'JJ'), ('endpoints', 'NNS'), ('showing', 'VBG'), ('quotes', 'NNS'), ('many', 'JJ'), ('different', 'JJ'), ('ways', 'NNS'), ('including', 'VBG'), ('new', 'JJ'), ('scraping', 'NN'), ('challenges', 'NNS'), ('described', 'VBD')]


##Named Entity Recognization

In [10]:
# Python
import spacy

nlp = spacy.load("en_core_web_sm")
text2="New York City, often simply referred to as New York, is the most populous city in the United States. It is located in the northeastern region of the country and is known for its iconic landmarks such as the Statue of Liberty, Times Square, and Central Park. The city is a major hub for finance, culture, and entertainment, attracting millions of tourists every year. Some of the world's leading companies and institutions are headquartered in New York City, making it a global center for commerce and innovation."
doc = nlp(text2)
entities = [(entity.text, entity.label_) for entity in doc.ents]
print("Named Entities:", entities)

Named Entities: [('New York City', 'GPE'), ('New York', 'GPE'), ('the United States', 'GPE'), ('the Statue of Liberty', 'FAC'), ('Times Square', 'FAC'), ('Central Park', 'LOC'), ('millions', 'CARDINAL'), ('New York City', 'GPE')]


## IN R

##TOKENIZER

In [3]:
install.packages("tokenizers")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’




In [4]:
library(tokenizers)

text <- readline(prompt = "Enter text: ")

word_tokens <- unlist(tokenize_words(text))
sentence_tokens <- unlist(tokenize_sentences(text))

cat("\nTokenized words:\n")
print(word_tokens)

cat("\nTokenized sentences:\n")
print(sentence_tokens)

Enter text: A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well.

Tokenized words:
 [1] "a"            "fictional"    "bookstore"    "that"         "desperately" 
 [6] "wants"        "to"           "be"           "scraped"      "it's"        
[11] "a"            "safe"         "place"        "for"          "beginners"   
[16] "learning"     "web"          "scraping"     "and"          "for"         
[21] "developers"   "validating"   "their"        "scraping"     "technologies"
[26] "as"           "well"        

Tokenized sentences:
[1] "A fictional bookstore that desperately wants to be scraped."                                                             
[2] "It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well."


##Frequency Distribution

In [5]:
word_freq <- table(word_tokens)

print("Most common words:")
print(head(sort(word_freq, decreasing = TRUE), 2))

print("Frequency of each word:")
print(word_freq)

[1] "Most common words:"
word_tokens
  a for 
  2   2 
[1] "Frequency of each word:"
word_tokens
           a          and           as           be    beginners    bookstore 
           2            1            1            1            1            1 
 desperately   developers    fictional          for         it's     learning 
           1            1            1            2            1            1 
       place         safe      scraped     scraping technologies         that 
           1            1            1            2            1            1 
       their           to   validating        wants          web         well 
           1            1            1            1            1            1 


##Remove stopwords & punctuations



In [7]:
install.packages("tm")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘NLP’, ‘slam’, ‘BH’




In [8]:
library(tm)

filtered_tokens <- word_tokens[!word_tokens %in% stopwords("en")]

print("Filtered Tokens:")
print(filtered_tokens)

Loading required package: NLP



[1] "Filtered Tokens:"
 [1] "fictional"    "bookstore"    "desperately"  "wants"        "scraped"     
 [6] "safe"         "place"        "beginners"    "learning"     "web"         
[11] "scraping"     "developers"   "validating"   "scraping"     "technologies"
[16] "well"        


##Lexicon Normalization (Stemming, Lemmatization)

In [9]:
stemming <- function(text) {
  corpus <- Corpus(VectorSource(text))
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

stemmed_corpus <- stemming(filtered_tokens)

print("Stemmed Tokens:")
print(unlist(sapply(stemmed_corpus, as.character)))

“transformation drops documents”


[1] "Stemmed Tokens:"
 [1] "fiction"   "bookstor"  "desper"    "want"      "scrape"    "safe"     
 [7] "place"     "beginn"    "learn"     "web"       "scrape"    "develop"  
[13] "valid"     "scrape"    "technolog" "well"     


In [12]:
install.packages("textstem")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘zoo’, ‘dtt’, ‘ISOcodes’, ‘sylly.en’, ‘sylly’, ‘syuzhet’, ‘fastmatch’, ‘RcppParallel’, ‘stopwords’, ‘RcppArmadillo’, ‘english’, ‘mgsub’, ‘qdapRegex’, ‘koRpus.lang.en’, ‘hunspell’, ‘koRpus’, ‘lexicon’, ‘quanteda’, ‘textclean’, ‘textshape’




In [14]:
library(textstem)
lemmatization <- function(text) {
  corpus <- Corpus(VectorSource(text))
  corpus <- tm_map(corpus, lemmatize_strings)
  return(corpus)
}

lemmatized_corpus <- lemmatization(text)

print("Lemmatized Tokens:")
print(unlist(sapply(lemmatized_corpus, as.character)))

Loading required package: koRpus.lang.en

Loading required package: koRpus

Loading required package: sylly

For information on available language packages for 'koRpus', run

  available.koRpus.lang()

and see ?install.koRpus.lang()



Attaching package: ‘koRpus’


The following object is masked from ‘package:tm’:

    readTagged


“transformation drops documents”


[1] "Lemmatized Tokens:"
[1] "A fictional bookstore that desperately want to be scrape. It's a safe place for beginner learn web scrape and for developer validate their scrape technology as good."


##Scrape data from a website

In [15]:
# Install and load required libraries
install.packages("rvest")
library(rvest)

# Function to scrape text within <p> tags from a website
scrape_website <- function(url) {
  webpage <- read_html(url)
  paragraphs <- html_nodes(webpage, "p")  # Select only <p> tags
  text <- html_text(paragraphs)
  return(text)
}

# URL of the website to scrape
url <- "https://toscrape.com/"

# Scrape data from the website
paragraphs_text <- scrape_website(url)

# Print the scraped text within <p> tags
cat(paragraphs_text)


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



A fictional bookstore that desperately wants to be scraped. It's a safe place for beginners learning web scraping and for developers validating their scraping technologies as well. Available at: books.toscrape.com A website that lists quotes from famous people. It has many endpoints showing the quotes in many different ways, each of them including new scraping challenges for you, as described below.

Conclusion - Successfully performed various steps of text analytics in R and Python