# ASSIGNMENT 8

## NAMED ENTITY RECOGNITION

## LOAD SPACY MODELS FOR ENGLISH AND FRENCH

Spacy is a robust Python natural language processing (NLP) package that excels at processing massive amounts of text data quickly and effectively. It is a well-liked option for NLP jobs because of its strong capabilities for dependency parsing, named entity recognition, tokenization, and part-of-speech tagging.

In [5]:
# Load Spacy models for processing text in English and French.

import spacy
nlp = spacy.load("en_core_web_lg")
nlp = spacy.load("fr_core_news_lg")

## TOKENIZE TEXT

In [6]:
text = "The sunrise painted the sky in hues of pink and orange, heralding a new day full of possibilities"
doc = nlp(text)
for token in doc:
    print(token, end=" | ")

The | sunrise | painted | the | sky | in | hues | of | pink | and | orange | , | heralding | a | new | day | full | of | possibilities | 

## GENERATE DATAFRAME FOR TOKEN VISUALIZATION

In [9]:
# Generate a dataframe for visualizing spaCy tokens with options to include or exclude punctuation.

import pandas as pd

def display_nlp(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i,  'text': t.text, 'lemma_': t.lemma_, 
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos_': t.pos_, 'dep_': t.dep_, 
                   'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df
display_nlp(doc)

Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,The,The,False,True,X,nsubj,MISC,B
1,sunrise,sunrise,False,True,X,flat:foreign,MISC,I
2,painted,painted,False,True,X,flat:foreign,MISC,I
3,the,the,False,True,X,flat:foreign,MISC,I
4,sky,sky,False,True,X,flat:foreign,MISC,I
5,in,in,False,True,X,flat:foreign,MISC,I
6,hues,huer,False,True,X,flat:foreign,MISC,I
7,of,of,False,True,ADP,case,MISC,I
8,pink,pink,False,True,PROPN,flat:foreign,MISC,I
9,and,and,False,True,X,flat:foreign,MISC,I


## FILTER OUT STOP WORDS AND PUNCTUATION

Words like "and," "the," and "is" are examples of stopwords, which are frequently omitted from text analysis in order to concentrate on important content. Exclamation points, commas, and periods are examples of punctuation marks. These symbols are employed in written language to organize and convey meaning; in text processing jobs, they are frequently eliminated or handled individually.

In [10]:
# Extract non-stop words and non-punctuation tokens from the given text.

text = "Nestled in the heart of the forest, the ancient oak tree stood tall, its branches whispering secrets of the past to those who would listen"
doc = nlp(text)

non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)

[Nestled, in, the, heart, of, the, forest, the, ancient, oak, tree, stood, tall, its, branches, whispering, secrets, of, the, past, to, those, who, would, listen]


## EXTRACT NOUNS FROM TEXT

In [11]:
# Extract nouns and proper nouns from the given text.

text = "Nestled in the heart of the forest, the ancient oak tree stood tall, its branches whispering secrets of the past to those who would listen"
doc = nlp(text)

nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)

[Nestled, stood, tall, branches]


## IDENTIFY ENTITIES IN TEXT

In [13]:
# Print identified entities along with their labels.

text = "Nestled in the heart of the forest, the ancient oak tree stood tall, its branches whispering secrets of the past to those who would listen"
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(Nestled in the heart, MISC) (oak tree stood tall, ORG) (its branches whispering secrets of the past to those who would listen, MISC) 

## IDENTIFY ENTITIES IN TEXT

Recognizing and classifying particular pieces of information, such as names of individuals, groups, places, dates, and numerical expressions, is necessary to identify entities in text. Deeper semantic analysis and information retrieval are made possible by this procedure, which frequently makes use of named entity recognition (NER) algorithms to automatically extract and classify these entities from a given text corpus.

In [14]:
# Print identified entities along with their labels.

text = "The gentle hum of the city at night was a comforting reminder of the interconnected lives and stories unfolding within it."
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(The gentle hum of the city at night was, MISC) (comforting, PER) 

## VISUALIZE ENTITIES IN TEXT

In [15]:
# Render a visualization of the identified entities in the text.

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

## CONVERT URL TO TEXT AND COUNT ENTITIES

In [97]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.thehansindia.com/')
article = nlp(ny_bb)
len(article.ents)

190

## VISUALIZE ENTITIES IN TEXT

In [98]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

## COUNT ENTITY LABELS

In [99]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)


Counter({'MISC': 109, 'PER': 33, 'ORG': 28, 'LOC': 20})

## COUNT MOST COMMON ENTITIES

In [100]:
# Count the most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(5)

[('March 2024', 4),
 ('Kanna Rao', 3),
 ("Kejriwal'", 3),
 ('Gold rates in', 3),
 ('Steel and', 2)]

## PRINT SPECIFIC SENTENCE

In [101]:
# Print the 21st sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[20])

In KCR’s nephew Kanna Rao arrested for land-grabbing 


## VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [102]:
# Render a visualization of the identified entities in the 21st sentence of the extracted article text.

displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

## EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [103]:
# Extract words along with their parts of speech and lemmas from the 21st sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('In', 'X', 'In'),
 ('\xa0', 'SPACE', '\xa0'),
 ('KCR’', 'X', 'KCR’'),
 ('s', 'X', 's'),
 ('nephew', 'PROPN', 'nephew'),
 ('Kanna', 'X', 'Kanna'),
 ('Rao', 'X', 'Rao'),
 ('arrested', 'X', 'arrested'),
 ('for', 'X', 'for'),
 ('land', 'X', 'land'),
 ('-', 'ADJ', '-'),
 ('grabbing', 'NOUN', 'grabbing'),
 ('\xa0', 'SPACE', '\xa0')]

## VISUALIZE DEPENDENCY PARSING

In natural language processing (NLP), dependency parsing is a technique that examines a sentence's grammatical structure by identifying the relationships between words, which are shown as directed edges between tokens in a dependency tree. It makes tasks like semantic analysis, question answering, and machine translation easier by revealing the hierarchical structure and syntactic relationships within phrases.

In [104]:
# Render a visualization of the dependency parsing for the 21st sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

## CONVERT URL TO TEXT AND COUNT ENTITIES

In [105]:
# Convert the content of a given URL into text and count the identified entities

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.thehindu.com/')
article = nlp(ny_bb)
len(article.ents)

97

## VISUALIZE ENTITIES IN TEXT

In [106]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

## COUNT ENTITY LABELS

In [107]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 52, 'ORG': 23, 'LOC': 16, 'PER': 6})

## COUNT MOST COMMON ENTITIES

In [108]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('Tamil Nadu', 3),
 ('TRIALGIFT', 2),
 ('Subscription Subscribe', 2),
 ('Shorts Data Health', 2),
 ('Delhi Police', 2),
 ('Cricket IPL', 2),
 ('The Hindu', 2),
 ('Breaking News Today', 1),
 ('Top Headlines', 1),
 ('India World Opinion Elections e-Paper Shorts Data Health', 1),
 ('Open in The Hindu', 1),
 ('Subscription ACCOUNT', 1),
 ('India World Opinion', 1),
 ('SEARCH News Business Entertainment Life & Style Society', 1),
 ('Movies Food Children Data Kochi Books Brandhub', 1),
 ('ShowcaseSubscribe to NewslettersCrossword+CONNECT', 1),
 ('USIndia', 1),
 ('Luxury Luxury', 1),
 ('EnvironmentGoing green | Myriad hues for festival of colours', 1),
 ('How bad is the humanitarian crisis in Gaza', 1),
 ('Explained AAP calls for protest against Kejriwal’', 1),
 ('ITO', 1),
 ('PTILive LS', 1),
 ('BJP likely to release 4th list of candidates today', 1),
 ('Congress U.P.', 1)]

## PRINT SPECIFIC SENTENCE

In [109]:
# Print the 4th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[3])


| The Hindu   India World Opinion Elections e-Paper Shorts Data Health


## VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [110]:
# Render a visualization of the identified entities in the 4th sentence of the extracted article text.

displacy.render(nlp(str(sentences[3])), jupyter=True, style='ent')

## EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [111]:
# Extract words along with their parts of speech and lemmas from the 4th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[3])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('The', 'X', 'The'),
 ('Hindu', 'X', 'Hindu'),
 ('  ', 'SPACE', '  '),
 ('India', 'NOUN', 'india'),
 ('World', 'X', 'World'),
 ('Opinion', 'NOUN', 'opinion'),
 ('Elections', 'NOUN', 'election'),
 ('e-Paper', 'NOUN', 'e-paper'),
 ('Shorts', 'NOUN', 'short'),
 ('Data', 'X', 'Data'),
 ('Health', 'X', 'Health')]

## VISUALIZE DEPENDENCY PARSING

In [112]:
# Render a visualization of the dependency parsing for the 4th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[3])), style='dep', jupyter = True, options = {'distance': 120})

## CONVERT URL TO TEXT AND COUNT ENTITIES

In [113]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://timesofindia.indiatimes.com/us')
article = nlp(ny_bb)
len(article.ents)

322

## VISUALIZE ENTITIES IN TEXT

In [114]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

## COUNT ENTITY LABELS

In [115]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 166, 'PER': 81, 'ORG': 49, 'LOC': 26})

## COUNT MOST COMMON ENTITIES

In [116]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('March 24', 3),
 ('IPL', 3),
 ('Read your weekly', 2),
 ('Facebook & Whatsapp', 2),
 ('RCB IPL', 2),
 ("Mustafizur Rahman'", 2),
 ('world class', 2),
 ('Yemen by Houthi', 2),
 ('KKR', 2),
 ('Pradesh', 2),
 ('CM Kejriwal', 2),
 ('by Bollywood', 2),
 ('How to make', 2),
 ('US', 2),
 ('Consulate', 2),
 ('News', 1),
 ('US News', 1),
 ('Top News in India', 1),
 ('US election news', 1),
 ('Business news', 1),
 ('Sports & International News | Times of IndiaEditionUSUSINSun', 1),
 ('Mar 24', 1),
 ('Updated', 1),
 ('ISTRead', 1),
 ('InCityMetro CitiesmumbaidelhibengaluruHyderabadkolkatachennaiOther CitiesCityagraagartalaahmedabadajmerallahabadamaravatiamritsaraurangabadbareillybhubaneswarbhopalchandigarhcoimbatorecuttackdehradunerodefaridabadghaziabadgoagurgaonguwahatihubballiimphalindoreitanagarjaipurjammujamshedpurjodhpurkanpurkochikohimakolhapurkozhikodeludhianalucknowmaduraimangalurumeerutmumbai',
  1)]

## PRINT SPECIFIC SENTENCE

In [117]:
# Print the 26th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[25])

Not a chanceWeekly Love Horoscope, March 24


## VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [119]:
# Render a visualization of the identified entities in the 26th sentence of the extracted article text.

displacy.render(nlp(str(sentences[25])), jupyter=True, style='ent')

## EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [120]:
# Extract words along with their parts of speech and lemmas from the 26th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[25])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Not', 'PROPN', 'Not'),
 ('chanceWeekly', 'VERB', 'chanceweekly'),
 ('Love', 'PROPN', 'Love'),
 ('Horoscope', 'PROPN', 'Horoscope'),
 ('March', 'PROPN', 'March'),
 ('24', 'NUM', '24')]

## VISUALIZE DEPENDENCY PARSING

In [121]:
# Render a visualization of the dependency parsing for the 26th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[25])), style='dep', jupyter = True, options = {'distance': 120})