## ASSIGNMENT 9
## NAMED ENTITY RECOGNITION

### LOAD SPACY MODELS FOR ENGLISH AND FRENCH

Spacy is a powerful natural language processing (NLP) library in Python, known for its speed and efficiency in handling large volumes of text data. It offers robust features for tokenization, part-of-speech tagging, named entity recognition, and dependency parsing, making it a popular choice for NLP tasks.

In [334]:
# Load Spacy models for processing text in English and French.

import spacy
nlp = spacy.load("en_core_web_lg")
nlp = spacy.load("fr_core_news_lg")

### TOKENIZE TEXT

In [335]:
# Tokenize the given text and print each token along with its attributes.

text = "The sun dipped below the horizon, casting hues of orange and pink across the sky"
doc = nlp(text)
for token in doc:
    print(token, end=" | ")

The | sun | dipped | below | the | horizon | , | casting | hues | of | orange | and | pink | across | the | sky | 

### GENERATE DATAFRAME FOR TOKEN VISUALIZATION

In [336]:
# Generate a dataframe for visualizing spaCy tokens with options to include or exclude punctuation.

import pandas as pd

def display_nlp(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""
    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i,  'text': t.text, 'lemma_': t.lemma_, 
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos_': t.pos_, 'dep_': t.dep_, 
                   'ent_type_': t.ent_type_, 'ent_iob_': t.ent_iob_}
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    return df
display_nlp(doc)

Unnamed: 0,text,lemma_,is_stop,is_alpha,pos_,dep_,ent_type_,ent_iob_
0,The,The,False,True,X,dep,,O
1,sun,sun,False,True,X,ROOT,,O
2,dipped,dipped,False,True,NOUN,flat:foreign,,O
3,below,below,False,True,X,flat:foreign,,O
4,the,the,False,True,X,dep,,O
5,horizon,horizon,False,True,X,flat:foreign,,O
7,casting,casting,False,True,NOUN,appos,MISC,B
8,hues,hue,False,True,ADJ,amod,MISC,I
9,of,of,False,True,ADP,case,MISC,I
10,orange,orange,False,True,NOUN,nmod,MISC,I


### FILTER OUT STOP WORDS AND PUNCTUATION

Stopwords are common words like "and," "the," and "is" that are often filtered out during text analysis to focus on meaningful content. Punctuation marks, such as commas, periods, and exclamation points, are symbols used to organize and convey meaning in written language, often removed or processed separately in text processing tasks.

In [337]:
# Extract non-stop words and non-punctuation tokens from the given text.

text = "As the waves crashed against the shore, seagulls soared overhead."
doc = nlp(text)

non_stop = [t for t in doc if not t.is_stop and not t.is_punct]
print(non_stop)

[the, waves, crashed, against, the, shore, seagulls, soared, overhead]


### EXTRACT NOUNS FROM TEXT

In [338]:
# Extract nouns and proper nouns from the given text.

text = "As the waves crashed against the shore, seagulls soared overhead."
doc = nlp(text)

nouns = [t for t in doc if t.pos_ in ['NOUN', 'PROPN']]
print(nouns)

[As, soared]


### IDENTIFY ENTITIES IN TEXT

In [339]:
# Print identified entities along with their labels.

text = "As the waves crashed against the shore, seagulls soared overhead."
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(As the waves crashed against the shore, MISC) 

### IDENTIFY ENTITIES IN TEXT

Identifying entities in text involves recognizing and categorizing specific pieces of information such as names of people, organizations, locations, dates, and numerical expressions. This process often utilizes techniques like named entity recognition (NER) to automatically extract and classify these entities within a given text corpus, enabling deeper semantic analysis and information retrieval.

In [340]:
# Print identified entities along with their labels.

text = "James O'Neill, chairman of World Cargo Inc, lives in San Francisco." 
doc = nlp(text)

for ent in doc.ents:
    print(f"({ent.text}, {ent.label_})", end=" ")

(James O'Neill, PER) (chairman of World Cargo Inc, MISC) 

### VISUALIZE ENTITIES IN TEXT

In [341]:
# Render a visualization of the identified entities in the text.

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

### CONVERT URL TO TEXT AND COUNT ENTITIES

In [342]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://fr.wikipedia.org/wiki/Déterminants_et_articles_en_français')
article = nlp(ny_bb)
len(article.ents)

143

### VISUALIZE ENTITIES IN TEXT

In [343]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [344]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 70, 'LOC': 34, 'PER': 28, 'ORG': 11})

### COUNT MOST COMMON ENTITIES

In [345]:
# Count the most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(5)

[('Amédée', 3),
 ('Adjectif', 3),
 ('Wikipédia', 2),
 ('exclamatifs', 2),
 ('Paris', 2)]

### PRINT SPECIFIC SENTENCE

In [346]:
# Print the 21st sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[20])

Modifier les liens ArticleDiscussion français LireModifierModifier


### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [347]:
# Render a visualization of the identified entities in the 21st sentence of the extracted article text.

displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [348]:
# Extract words along with their parts of speech and lemmas from the 21st sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Modifier', 'VERB', 'modifier'),
 ('liens', 'NOUN', 'lien'),
 ('ArticleDiscussion', 'NOUN', 'articlediscussion'),
 ('français', 'ADJ', 'français'),
 ('LireModifierModifier', 'ADJ', 'liremodifiermodifier')]

### VISUALIZE DEPENDENCY PARSING

Dependency parsing is a technique in natural language processing (NLP) that analyzes the grammatical structure of a sentence by determining the relationships between words, represented as directed edges between tokens in a dependency tree. It helps uncover the syntactic dependencies and hierarchical structure within sentences, facilitating tasks like semantic analysis, question answering, and machine translation.

In [349]:
# Render a visualization of the dependency parsing for the 21st sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

### CONVERT URL TO TEXT AND COUNT ENTITIES

In [350]:
# Convert the content of a given URL into text and count the identified entities

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.nationalreview.com/corner/trucks-will-still-get-to-new-york-city/')
article = nlp(ny_bb)
len(article.ents)

2

### VISUALIZE ENTITIES IN TEXT

In [351]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [352]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'ORG': 1, 'PER': 1})

### COUNT MOST COMMON ENTITIES

In [353]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('Trucks', 1), ('Will Still', 1)]

### PRINT SPECIFIC SENTENCE

In [354]:
# Print the 4th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[3])

Will Still Get to New York City | National Review     


### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [355]:
# Render a visualization of the identified entities in the 4th sentence of the extracted article text.

displacy.render(nlp(str(sentences[3])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [356]:
# Extract words along with their parts of speech and lemmas from the 4th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[3])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('Will', 'PROPN', 'Will'),
 ('Still', 'PROPN', 'Still'),
 ('Get', 'X', 'Get'),
 ('to', 'X', 'to'),
 ('New', 'X', 'New'),
 ('York', 'X', 'York'),
 ('City', 'X', 'City'),
 ('|', 'PROPN', '|'),
 ('National', 'ADJ', 'national'),
 ('Review', 'PROPN', 'Review'),
 ('    ', 'SPACE', '    ')]

### VISUALIZE DEPENDENCY PARSING

In [357]:
# Render a visualization of the dependency parsing for the 4th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[3])), style='dep', jupyter = True, options = {'distance': 120})

### CONVERT URL TO TEXT AND COUNT ENTITIES

In [358]:
# Convert the content of a given URL into text and count the identified entities.

from bs4 import BeautifulSoup
import requests
import re
def url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))
ny_bb = url_to_string('https://www.cnn.com/2024/03/22/business/trump-truth-social-dwac-shares/index.html')
article = nlp(ny_bb)
len(article.ents)

321

### VISUALIZE ENTITIES IN TEXT

In [359]:
# Render a visualization of the identified entities in the extracted article text.

displacy.render(article, style='ent', jupyter=True)

### COUNT ENTITY LABELS

In [360]:
# Count the occurrence of each entity label in the extracted article text.

from collections import Counter

labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'MISC': 203, 'ORG': 64, 'PER': 34, 'LOC': 20})

### COUNT MOST COMMON ENTITIES

In [361]:
# Count the 25 most common entities in the extracted article text.

items = [x.text for x in article.ents]
Counter(items).most_common(25)

[('Now playing', 15),
 ('CNN Video', 8),
 ('Trump', 7),
 ('CNN Video Ad Feedback', 6),
 ('Sign in to your CNN', 4),
 ('CNN', 4),
 ('Listen', 3),
 ('Topics You Follow                      Sign Out          Your CNN', 2),
 ('Energy + Environment', 2),
 ('Extreme Weather', 2),
 ('Space + Science                      World                      ', 2),
 ('Americas', 2),
 ('Asia', 2),
 ('Australia', 2),
 ('China', 2),
 ('Europe', 2),
 ('Middle East', 2),
 ('United Kingdom                      Politics                      ', 2),
 ('Congress                        Facts', 2),
 ('After-Hours                        ', 2),
 ('Market Movers                        ', 2),
 ('Fear & Greed', 2),
 ('World Markets                        ', 2),
 ('Investing                        ', 2),
 ('Before the Bell', 2)]

### PRINT SPECIFIC SENTENCE

In [362]:
# Print the 26th sentence from the extracted article text.

sentences = [x for x in article.sents]
print(sentences[25])

China                        Europe                        India                        Middle East                        United Kingdom                      Politics                      SCOTUS                        Congress                        Facts


### VISUALIZE ENTITIES IN SPECIFIC SENTENCE

In [363]:
# Render a visualization of the identified entities in the 26th sentence of the extracted article text.

displacy.render(nlp(str(sentences[25])), jupyter=True, style='ent')

### EXTRACT WORDS WITH PARTS OF SPEECH AND LEMMAS

In [364]:
# Extract words along with their parts of speech and lemmas from the 26th sentence of the extracted article text, excluding stop words and punctuation.

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[25])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

[('China', 'PROPN', 'China'),
 ('                       ', 'SPACE', '                       '),
 ('Europe', 'NOUN', 'europe'),
 ('                       ', 'SPACE', '                       '),
 ('India', 'NOUN', 'india'),
 ('                       ', 'SPACE', '                       '),
 ('Middle', 'PROPN', 'Middle'),
 ('East', 'PROPN', 'East'),
 ('                       ', 'SPACE', '                       '),
 ('United', 'X', 'United'),
 ('Kingdom', 'PROPN', 'Kingdom'),
 ('                     ', 'SPACE', '                     '),
 ('Politics', 'PROPN', 'Politics'),
 ('                     ', 'SPACE', '                     '),
 ('SCOTUS', 'PROPN', 'SCOTUS'),
 ('                       ', 'SPACE', '                       '),
 ('Congress', 'X', 'Congress'),
 ('                       ', 'SPACE', '                       '),
 ('Facts', 'X', 'Facts')]

### VISUALIZE DEPENDENCY PARSING

In [365]:
# Render a visualization of the dependency parsing for the 26th sentence of the extracted article text with adjusted distance between words.

displacy.render(nlp(str(sentences[25])), style='dep', jupyter = True, options = {'distance': 120})