***Web Scraping and NLP: Extracting and Analyzing News Articles***

In [None]:
#1.1 Copy the code examples to scrape the webpage in BeautifulSoup

import bs4 as bs
import urllib.request

url = 'https://www.usatoday.com/story/sponsor-story/everroot/2024/01/29/supplements-arent-just-for-humans-they-can-help-dogs-too/72326994007/?mvt=i&mvn=213ed4a2e7f944f0b8768d713940d4c7&mvp=NA-USAT-11238597&mvl=Key-tangent-tile%20AdUnit-%2F7103%2Fusatoday%2Fnative-front_tile%2F*%20%5BTangent%20Desktop%20Tile%5D&utm_campaign=d1f2d4e86d804ad28f5f2ac6f01d1537&utm_source=polar&utm_medium=cpc'

def text_extracted(url):
    try:
        scraped_textdata = urllib.request.urlopen(url)
        textdata = scraped_textdata.read()
        parsed_textdata = bs.BeautifulSoup(textdata, 'html.parser')
        paragraphs = parsed_textdata.find_all('p')
        formatted_text = " ".join(para.text for para in paragraphs)
        return formatted_text
    except Exception as e:
        return f"An error occurred: {e}"

text = text_extracted(url)
print(text)

Every dog deserves to feel their best, but like humans, they all have unique needs. Regular vet visits, proper diet and exercise can all help, but sometimes our canine pals need a little extra support above and beyond their normal diet. That’s where EverRoot dog supplements, powered by Purina come in: Developed by Purina’s experts in animal nutrition, EverRoot dog supplements come in multiple forms, covering a variety of key health benefit areas. This helps pet parents further personalize their dog’s supplement plan to meet their unique needs and preferences. With a focus on natural and organic ingredients and a commitment to traceable sourcing, EverRoot stands out for making sure that their dog supplements have everything your dog needs, without the things they don’t. To help dogs get the personalized nutrition they need, the brand teamed up with top athlete and health enthusiast Laila Ali to launch EverRoot’s new line of soft chew dog supplements. As a fitness expert, Ali believes in

Interpretation: The above Python code extracts text from a given URL from sponsored story on USA Today about how supplements can help dogs, similar to their benefits for humans. It uses the BeautifulSoup library to parse HTML and collect text within paragraph tags. If an error occurs during the process, it returns an error message​

In [None]:
#1.2.1 Count all the named entities in the document

import spacy
from pprint import pprint
from collections import Counter

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

article_text = text

# Processing the text with SpaCy
doc = nlp(article_text)

# Extract and print named entities and their labels
entities = [(entity.text, entity.label_) for entity in doc.ents]
pprint(entities)

# Count every named entity by their label
entity_counts = Counter([entity.label_ for entity in doc.ents])
print("\nCount of named entity types:")
pprint(entity_counts)

[('Purina', 'GPE'),
 ('Purina', 'ORG'),
 ('EverRoot', 'PRODUCT'),
 ('Laila Ali', 'PERSON'),
 ('EverRoot', 'PRODUCT'),
 ('Ali', 'PERSON'),
 ('Malibu', 'GPE'),
 ('Buddy', 'ORG'),
 ('Ali', 'PERSON'),
 ('Muhammad Ali', 'PERSON'),
 ('Ali', 'PERSON'),
 ('four', 'CARDINAL'),
 ('Laila Ali Lifestyle', 'PERSON'),
 ('Soft Chews', 'PERSON'),
 ('Ali', 'PERSON'),
 ('Buddy the EverRoot', 'PRODUCT'),
 ('EverRoot', 'PRODUCT'),
 ('Malibu', 'GPE'),
 ('Ali', 'PERSON'),
 ('EverRoot', 'PRODUCT'),
 ('more than a century', 'DATE'),
 ('Purina', 'GPE'),
 ('Soft Chews', 'PERSON'),
 ('EverRoot', 'PRODUCT'),
 ('Sri Lankan', 'PERSON'),
 ('Alaskan', 'NORP'),
 ('Marine Stewardship Council', 'ORG'),
 ('one', 'CARDINAL'),
 ('EverRoot', 'PRODUCT'),
 ('New Zealand’s', 'GPE'),
 ('today', 'DATE'),
 ('EverRoot', 'PRODUCT'),
 ('Annie Valuska', 'PERSON'),
 ('Purina', 'GPE'),
 ('EverRoot', 'PRODUCT'),
 ('meaty treats', 'PERSON'),
 ('Valuska', 'ORG'),
 ('EverRoot', 'PRODUCT'),
 ('Soft Chews', 'PERSON'),
 ('EverRoot', 'PRODUCT')

Interpretation: Now, the above code analyzes a text document using the SpaCy library to identify and count named entities, such as people, organizations, products, and more. It processes the text, extracts named entities (like 'Purina', 'EverRoot', 'Laila Ali'), and counts them by type, revealing insights into the document's content. For instance, it found entities related to brands, individuals, and locations, indicating the text discusses specific products, notable people, and geographical areas.

In [None]:
#1.2.2 Count the most frequent tokens for the entire document

# Count the most frequent tokens in the document
tokens = [token.text.lower() for token in doc if not token.is_stop and not token.is_punct and token.is_alpha]
most_common_tokens = Counter(tokens).most_common(6)

print("\nMost frequent tokens in the document:")
pprint(most_common_tokens)


Most frequent tokens in the document:
[('everroot', 19),
 ('dog', 15),
 ('dogs', 10),
 ('soft', 10),
 ('supplements', 9),
 ('health', 8)]


Interpretations:This code snippet helps us count the most frequent meaningful words in a document, excluding stopwords (commonly used words of little value) and punctuation. It reveals that words like "everroot," "dog," "dogs," "soft," "supplements," and "health" are among the most mentioned, suggesting the document heavily focuses on dog health and supplements, particularly highlighting the brand "Everroot" and its products.

In [None]:
#1.2.3 Pick a random integer K using Python random module, then pick three consecutive sentences starting with Kth, and print these sentences. Note that you must make sure all picked sentences are in the document.

import spacy
import random

nlp = spacy.load("en_core_web_sm")
doc = nlp(article_text)  # Ensure 'article_text' contains your text

# Converting document into a list of sentences
sentences = list(doc.sents)

# Pick a random integer K
k = random.randint(0, len(sentences) - 3)

# Select three consecutive sentences starting from the Kth sentence
selected_sentences = sentences[k:k+3]

# Print the selected sentences
for sentence in selected_sentences:
    print(sentence.text)

To help dogs get the personalized nutrition they need, the brand teamed up with top athlete and health enthusiast Laila Ali to launch EverRoot’s new line of soft chew dog supplements.
As a fitness expert, Ali believes in extending the same care she takes for her own health to her beloved pups, Malibu and Buddy.
“I want them to be around for the long haul, so it feels good to know I’m giving my dogs the best chance to look and feel their best with tasty soft chews they love,” said Ali, who is the daughter of the late great professional boxer and humanitarian Muhammad Ali.


Interpretation: This code snippet randomly selects and prints three consecutive sentences from a given text, demonstrating a way to sample sections of content for analysis or review. It showcases the flexibility in accessing and manipulating text data, providing a glimpse into the document's discussion on dog health and nutrition through the perspective of Laila Ali, highlighting her collaboration with EverRoot on dog supplements.

In [None]:
#1.2.4 Extract part-of-speech and lemmatize these consecutive sentences

# Extracting part-of-speech and lemmatize for each token in the selected sentences
for sentence in selected_sentences:
    print("\nProcessing sentence:", sentence.text)
    for token in sentence:
        pos_tag = token.pos_
        lemma = token.lemma_
        print(f"Token: {token.text}, POS: {pos_tag}, Lemma: {lemma}")


Processing sentence: To help dogs get the personalized nutrition they need, the brand teamed up with top athlete and health enthusiast Laila Ali to launch EverRoot’s new line of soft chew dog supplements.
Token: To, POS: PART, Lemma: to
Token: help, POS: VERB, Lemma: help
Token: dogs, POS: NOUN, Lemma: dog
Token: get, POS: VERB, Lemma: get
Token: the, POS: DET, Lemma: the
Token: personalized, POS: ADJ, Lemma: personalized
Token: nutrition, POS: NOUN, Lemma: nutrition
Token: they, POS: PRON, Lemma: they
Token: need, POS: VERB, Lemma: need
Token: ,, POS: PUNCT, Lemma: ,
Token: the, POS: DET, Lemma: the
Token: brand, POS: NOUN, Lemma: brand
Token: teamed, POS: VERB, Lemma: team
Token: up, POS: ADP, Lemma: up
Token: with, POS: ADP, Lemma: with
Token: top, POS: ADJ, Lemma: top
Token: athlete, POS: NOUN, Lemma: athlete
Token: and, POS: CCONJ, Lemma: and
Token: health, POS: NOUN, Lemma: health
Token: enthusiast, POS: NOUN, Lemma: enthusiast
Token: Laila, POS: PROPN, Lemma: Laila
Token: Ali, 

Interpretation: This code uses the SpaCy library to analyze sentences, identifying each word's part of speech and its base form (lemma). It processes three consecutive sentences from a text, highlighting the grammatical role and the root form of each word, which aids in understanding the sentence structure and content at a deeper linguistic level.

In [None]:
#1.2.5 Get and print the entity annotation for each token of the Kth sentence

kth_sentence = selected_sentences[0]

print("Entity annotation for each token in the Kth sentence:")
for token in kth_sentence:
    entity = token.ent_type_ if token.ent_type_ else 'No Entity'
    print(f"Token: {token.text}, Entity: {entity}")

Entity annotation for each token in the Kth sentence:
Token: To, Entity: No Entity
Token: help, Entity: No Entity
Token: dogs, Entity: No Entity
Token: get, Entity: No Entity
Token: the, Entity: No Entity
Token: personalized, Entity: No Entity
Token: nutrition, Entity: No Entity
Token: they, Entity: No Entity
Token: need, Entity: No Entity
Token: ,, Entity: No Entity
Token: the, Entity: No Entity
Token: brand, Entity: No Entity
Token: teamed, Entity: No Entity
Token: up, Entity: No Entity
Token: with, Entity: No Entity
Token: top, Entity: No Entity
Token: athlete, Entity: No Entity
Token: and, Entity: No Entity
Token: health, Entity: No Entity
Token: enthusiast, Entity: No Entity
Token: Laila, Entity: PERSON
Token: Ali, Entity: PERSON
Token: to, Entity: No Entity
Token: launch, Entity: No Entity
Token: EverRoot, Entity: PRODUCT
Token: ’s, Entity: No Entity
Token: new, Entity: No Entity
Token: line, Entity: No Entity
Token: of, Entity: No Entity
Token: soft, Entity: No Entity
Token: che

Interpretation: This code snippet extracts and prints the entity type (such as PERSON, ORGANIZATION, etc.) for each word in a selected sentence, using SpaCy's named entity recognition capabilities. It identifies the entities within the first sentence of a predefined list, assigning "No Entity" to tokens that do not belong to a specific entity category. For instance, "Laila" and "Ali" are recognized as part of a PERSON entity, while "EverRoot" is identified as a PRODUCT.

In [None]:
#1.2.6 Visualize the entities and dependencies of Kth sentence

from spacy import displacy

# Visualizing the entities in the Kth sentence
displacy.render(kth_sentence, style='ent', jupyter=True)

# Visualizing the dependency parse of the Kth sentence
displacy.render(kth_sentence, style='dep', jupyter=True, options={'distance': 100})

Interpretation: The code uses SpaCy's displacy module to visually display named entities and the grammatical structure of a sentence. It highlights entities like "Laila Ali" as a PERSON and "EverRoot" as a PRODUCT and shows the grammatical connections between words in the sentence, illustrating their roles and relationships.

In [None]:
#1.2.7 Visualize all the entities in the document

from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

Interpretation: The code uses spaCy and its visualization tool displacy to display the named entities present in a processed text document within a Jupyter notebook, showing entities like people, organizations, and products for quick analysis.

In [None]:
#2 De-Identification:
2.1 De-identify all person names (PERSON) in the webpage document with [REDACTED] and visualize them.

redacted_text = article_text
sentences= list(doc.sents)
for i in sentences:
    for j in i:
        if j.ent_type_ == 'PERSON':
            redacted_text = redacted_text.replace(str(j), '[REDACTED]')

print(redacted_text)

Every dog deserves to feel their best, but like humans, they all have unique needs. Regular vet visits, proper diet and exercise can all help, but sometimes our canine pals need a little extra support above and beyond their normal diet. That’s where EverRoot dog supplements, powered by Purina come in: Developed by Purina’s experts in animal nutrition, EverRoot dog supplements come in multiple forms, covering a variety of key health benefit areas. This helps pet parents further personalize their dog’s supplement plan to meet their unique needs and preferences. With a focus on natural and organic ingredients and a commitment to traceable sourcing, EverRoot stands out for making sure that their dog supplements have everything your dog needs, without the things they don’t. To help dogs get the personalized nutrition they need, the brand teamed up with top athlete and health enthusiast [REDACTED] [REDACTED] to launch EverRoot’s new line of soft chew dog supplements. As a fitness expert, [RE

In [None]:
displacy.render(nlp(redacted_text),jupyter= True,style= 'ent')

Interpretation: The 'De-Identification' section instructs to replace all named entities classified as "PERSON" with [REDACTED]. This involves using a natural language processing library such as SpaCy to identify and classify named entities, and then replace any entities classified as persons with the redaction label.