<a href="https://colab.research.google.com/github/vedantrk/DAV-SEM6/blob/main/DAV_Exp7.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Experiment - 7:** Perform the steps involved in Text Analytics in Python & R

**Task to be performed :**

Explore Top-5 Text Analytics Libraries in Python (w.r.t Features & Applications)

Explore Top-5 Text Analytics Libraries in R (w.r.t Features & Applications)

Perform the following experiments using Python & R

Tokenization (Sentence & Word)

Frequency Distribution

Remove stopwords & punctuations

Lexicon Normalization (Stemming, Lemmatization)

Part of Speech tagging

Named Entity Recognization

Scrape data from a website

Prepare a document with the Aim, Tasks performed, Program, Output, and Conclusion.

**Program:**

In [None]:
import nltk

nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
text = '''If you're visiting this page, you're likely here because you're searching for a random sentence.'''

In [None]:
#Tokenizing into sentences
sentences = nltk.tokenize.sent_tokenize(text)

print(sentences)

["If you're visiting this page, you're likely here because you're searching for a random sentence."]


In [None]:
#Tokenizing into words
words = nltk.tokenize.word_tokenize(text)

print(words)

['If', 'you', "'re", 'visiting', 'this', 'page', ',', 'you', "'re", 'likely', 'here', 'because', 'you', "'re", 'searching', 'for', 'a', 'random', 'sentence', '.']


In [None]:
#Frequency Distribution of words
freq_dis = nltk.FreqDist(words)

print(freq_dis.most_common(5))

[('you', 3), ("'re", 3), ('If', 1), ('visiting', 1), ('this', 1)]


In [None]:
nltk.download('stopwords')

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

print(stop_words)

{'very', 'with', 'from', 'between', 'weren', 'where', "couldn't", 'will', 'couldn', 'now', 'is', 'once', 'isn', 'down', 'yourselves', 'then', 'only', 'wouldn', 'them', 'through', 'further', 'more', 'or', 'who', 'at', 'me', 'same', 'this', 'so', 'myself', 'needn', 'nor', "won't", 'about', "shouldn't", 'ain', 'we', 'won', 'what', 'am', 'both', 'some', 'doing', 'again', "hadn't", 'i', 'but', 'all', "aren't", 'him', "weren't", 're', 'by', 'an', 'my', 'such', 'your', 'its', 'aren', "mightn't", 'her', 'does', 'of', 'hadn', 'himself', 'it', 'to', 'our', 'while', 'other', 'as', 'own', "that'll", 'were', 'and', 'which', 'any', "you'll", 'she', 'you', 's', 'are', 've', 'that', 'in', 'for', "you're", 'can', 'no', 'their', 'should', 'didn', 'the', 'whom', 'themselves', 'why', 'y', "haven't", 'over', 'theirs', 'm', 'herself', 'mustn', "needn't", 'll', 'below', 't', 'shouldn', "you've", 'before', 'few', 'don', 'did', 'ourselves', 'most', 'when', "wouldn't", 'just', 'after', 'have', 'those', 'into', 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
#Filtering Stopwords

filtered_words = []

for i in words:
  if i not in stop_words:
    filtered_words.append(i)

print(filtered_words)

['If', "'re", 'visiting', 'page', ',', "'re", 'likely', "'re", 'searching', 'random', 'sentence', '.']


In [None]:
#Filter out Punctuation

filtered_words2 = []

import string

punctuations=list(string.punctuation)

for i in filtered_words:
    if i not in punctuations:
        filtered_words2.append(i)

print(filtered_words2)

['If', "'re", 'visiting', 'page', "'re", 'likely', "'re", 'searching', 'random', 'sentence']


In [None]:
#Stemming

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stemmed_words = []

for i in filtered_words2:
  stemmed_words.append(stemmer.stem(i))

print(stemmed_words)

['if', "'re", 'visit', 'page', "'re", 'like', "'re", 'search', 'random', 'sentenc']


In [None]:
#Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')

lem = WordNetLemmatizer()
word = "flying"

print("Lemmatized Word:",lem.lemmatize(word,"v"))

[nltk_data] Downloading package wordnet to /root/nltk_data...


Lemmatized Word: fly


In [None]:
#POS Tagging

from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')

sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens=word_tokenize(sent)
pos_=pos_tag(tokens)

print("PoS tags:",pos_)

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


PoS tags: [('Albert', 'NNP'), ('Einstein', 'NNP'), ('was', 'VBD'), ('born', 'VBN'), ('in', 'IN'), ('Ulm', 'NNP'), (',', ','), ('Germany', 'NNP'), ('in', 'IN'), ('1879', 'CD'), ('.', '.')]


In [None]:
#Name Entity Recognition

from nltk import ne_chunk
nltk.download('maxent_ne_chunker')
nltk.download('words')

sent="New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases."

for chunk in ne_chunk(nltk.pos_tag(word_tokenize(sent))):
        if hasattr(chunk, 'label'):
            print(chunk.label(), ' '.join(c[0] for c in chunk))


[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


GPE New York City


In [None]:
!pip install beautifulsoup4
!pip install requests



In [None]:
from bs4 import BeautifulSoup
import requests
from requests import HTTPError

In [None]:
url = 'https://github.com/LifnaJos/ADL601-Data-Analytics-and-Visualization-Lab/blob/main/Experiments/Experiment_7.md'

try:
  response = requests.get(url)
except HTTPError as e:
  print(e)

In [None]:
soup = BeautifulSoup(response.text, 'html.parser')

In [None]:
links = soup.find_all('a')

for link in links:
    print(link.get('href'))

\"#experiment---7-perform-the-steps-involved-in-text-analytics-in-python--r\"
\"#lab-outcomes-lo\"
\"#task-to-be-performed-\"
\"#tools--libraries-to-be-explored\"
\"#theory-to-be-written\"
\"#outcome-\"
\"#online-resources\"
\"https://github.com/LifnaJos/ADC601-Data-Analytics-Visualization/blob/DAV_Colab_Notebooks/Data_Preprocessing_techniques.ipynb\"
\"https://guides.library.upenn.edu/penntdm/python\"
\"https://machinelearninggeek.com/text-analytics-for-beginners-using-python-nltk/\"
\"https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/\"
\"https://www.kdnuggets.com/2020/05/text-mining-python-steps-examples.html\"
\"https://www.youtube.com/watch?v=bZoC-UW50sI&list=PLH6mU1kedUy-xjgiuvqMkVn8npK0TGAv5\"
\"https://www.analyticsvidhya.com/blog/2022/07/sentiment-analysis-using-python/\"


**R**

In [None]:
install.packages("tokenizers")
install.packages("tm")
install.packages("udpipe")
install.packages("spacyr")
install.packages("rvest")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘Rcpp’, ‘SnowballC’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘NLP’, ‘slam’, ‘BH’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘RcppTOML’, ‘here’, ‘png’, ‘reticulate’


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(tokenizers)
library(tm)
library(udpipe)
library(spacyr)
library(rvest)

Loading required package: NLP



In [None]:
#Tokenization
text <- 'Suave, charming and volatile, Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s. Using violence to get what they want, the siblings orchestrate robberies and murders while running nightclubs and protection rackets. With police Detective Leonard "Nipper" Read hot on their heels, the brothers continue their rapid rise to power and achieve tabloid notoriety.'

word_tokens <- tokenize_words(text)
sent_tokens <- tokenize_sentences(text)

print("Sentence Tokens:")
print(sent_tokens)
print("Word Tokens:")
print(word_tokens)

[1] "Sentence Tokens:"
[[1]]
[1] "Suave, charming and volatile, Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s." 
[2] "Using violence to get what they want, the siblings orchestrate robberies and murders while running nightclubs and protection rackets."           
[3] "With police Detective Leonard \"Nipper\" Read hot on their heels, the brothers continue their rapid rise to power and achieve tabloid notoriety."

[1] "Word Tokens:"
[[1]]
 [1] "suave"       "charming"    "and"         "volatile"    "reggie"     
 [6] "kray"        "and"         "his"         "unstable"    "twin"       
[11] "brother"     "ronnie"      "start"       "to"          "leave"      
[16] "their"       "mark"        "on"          "the"         "london"     
[21] "underworld"  "in"          "the"         "1960s"       "using"      
[26] "violence"    "to"          "get"         "what"        "they"       
[31] "want"        "the"         "siblings

In [None]:
#Word Frequency Distribution
word_freq <- table(word_tokens)

print("Word Frequency Distribution:")
print(word_freq)

[1] "Word Frequency Distribution:"
word_tokens
      1960s     achieve         and     brother    brothers    charming 
          1           1           5           1           1           1 
   continue   detective         get       heels         his         hot 
          1           1           1           1           1           1 
         in        kray       leave     leonard      london        mark 
          1           1           1           1           1           1 
    murders  nightclubs      nipper   notoriety          on orchestrate 
          1           1           1           1           2           1 
     police       power  protection     rackets       rapid        read 
          1           1           1           1           1           1 
     reggie        rise   robberies      ronnie     running    siblings 
          1           1           1           1           1           1 
      start       suave     tabloid         the       their        they 
    

In [None]:
sorted_freq <- sort(word_freq, decreasing = TRUE)

print("Top 5 most used words:")
print(sorted_freq[1:5])

[1] "Top 5 most used words:"
word_tokens
  and   the their    to    on 
    5     4     3     3     2 


In [None]:
#Punctuation and Stop word removal
corpus <- Corpus(VectorSource(text))

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

cleaned_text <- sapply(corpus, as.character)
print(cleaned_text)

“transformation drops documents”
“transformation drops documents”
“transformation drops documents”


[1] "suave charming  volatile reggie kray   unstable twin brother ronnie start  leave  mark   london underworld   1960s using violence  get   want  siblings orchestrate robberies  murders  running nightclubs  protection rackets  police detective leonard nipper read hot   heels  brothers continue  rapid rise  power  achieve tabloid notoriety"


In [None]:
#Lexicon Normalization
ud_model <- udpipe_download_model(language = "english")
ud_model <- udpipe_load_model(ud_model$file_model)

Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /content/english-ewt-ud-2.5-191206.udpipe

 - This model has been trained on version 2.5 of data from https://universaldependencies.org

 - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

 - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

 - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

Downloading finished, model stored at '/content/english-ewt-ud-2.5-191206.udpipe'



In [None]:
tokens <- udpipe_annotate(ud_model, x = text)

lemmas <- as.data.frame(tokens)$lemma
cat("Lemmas:", lemmas, "\n")

stems <- as.data.frame(tokens)$token
cat("Stems:", stems, "\n")

Lemmas: Suave , charming and volatile , Reggie Kray and he unstable twin brother Ronnie start to leave they mark on the London underworld in the 1960 . use violence to get what they want , the sibling orchestrate robbery and murder while run nightclub and protection racket . with police detective Leonard " Nipper " read hot on they heel , the brother continue they rapid rise to power and achieve tabloid notoriety . 
Stems: Suave , charming and volatile , Reggie Kray and his unstable twin brother Ronnie start to leave their mark on the London underworld in the 1960s . Using violence to get what they want , the siblings orchestrate robberies and murders while running nightclubs and protection rackets . With police Detective Leonard " Nipper " Read hot on their heels , the brothers continue their rapid rise to power and achieve tabloid notoriety . 


In [None]:
#POS tagging
pos_tags <- as.data.frame(tokens)$upos
cat("POS tags:", pos_tags, "\n")

POS tags: PROPN PUNCT NOUN CCONJ NOUN PUNCT PROPN PROPN CCONJ PRON ADJ NOUN NOUN PROPN VERB PART VERB PRON NOUN ADP DET PROPN NOUN ADP DET NOUN PUNCT VERB NOUN PART VERB PRON PRON VERB PUNCT DET NOUN ADJ NOUN CCONJ NOUN SCONJ VERB NOUN CCONJ NOUN NOUN PUNCT ADP NOUN PROPN PROPN PUNCT PROPN PUNCT VERB ADJ ADP PRON NOUN PUNCT DET NOUN VERB PRON ADJ NOUN ADP NOUN CCONJ NOUN NOUN NOUN PUNCT 


In [None]:
#Web Scraping
url <- "https://en.wikipedia.org/wiki/Web_scraping"
webpage <- read_html(url)

article_titles <- webpage %>%
  html_nodes(".text") %>%
  html_text()

print(article_titles)

 [1] "\"Web scraping\""                                                                                                                   
 [2] "news"                                                                                                                               
 [3] "newspapers"                                                                                                                         
 [4] "books"                                                                                                                              
 [5] "scholar"                                                                                                                            
 [6] "JSTOR"                                                                                                                              
 [7] "improve this section"                                                                                                               
 [8] "\"SASSCAL WebSAPI: A 

**CONCLUSION:** Identified the Text Analytics Libraries in Python and R
Performed simple experiments with these libraries in Python and R