In [3]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.gutenberg.org/files/1342/1342-h/1342-h.htm'
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    print("Successfully fetched and parsed the web page.")
else:
    print(f"Failed to fetch the web page. Status code: {response.status_code}")
    soup = None # Assign None to soup in case of failure

Successfully fetched and parsed the web page.


In [4]:
if soup is not None:
    text_content = soup.get_text()
    print("Text content extracted successfully.")
else:
    text_content = None
    print("Soup object is None. Text extraction not possible.")

Text content extracted successfully.


In [5]:
import spacy

nlp = spacy.load('en_core_web_sm')
print("spaCy language model loaded successfully.")

spaCy language model loaded successfully.


In [6]:
if text_content is not None:
    doc = nlp(text_content)
    print("Text processed successfully with spaCy.")
else:
    print("Cannot process text as text_content is None.")

Text processed successfully with spaCy.


In [7]:
if doc is not None:
    print("\n--- POS Tagging Results ---")
    for token in doc:
        print(f"{token.text}: {token.pos_}")

    print("\n--- Named Entity Recognition Results ---")
    for ent in doc.ents:
        print(f"{ent.text}: {ent.label_}")
else:
    print("Cannot display results because the text was not processed.")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
45: CARDINAL
five: CARDINAL
two: CARDINAL
half-past six: DATE
Elizabeth: PERSON
Bingley: PERSON
Jane: PERSON
three: CARDINAL
four: CARDINAL
Jane: PERSON
Elizabeth: PERSON
Jane: PERSON
Bingley: PERSON
Darcy: PERSON
Elizabeth: PERSON
Jane: PERSON
Miss Bingley: PERSON
Hurst: PERSON
this morning: TIME
Louisa: PERSON
six inches: QUANTITY
Louisa: PERSON
Bingley: PERSON
Elizabeth Bennet: PERSON
this morning: TIME
Darcy: PERSON
Miss Bingley: PERSON
three miles: QUANTITY
four miles: QUANTITY
five miles: QUANTITY
Bingley: PERSON
Darcy: PERSON
Miss Bingley: PERSON
a half: CARDINAL
Hurst: PERSON
Jane: PERSON
Cheapside: PRODUCT
Bingley: ORG
Darcy: PERSON
Bingley: PERSON
friendâs vulgar relations: ORG
Elizabeth: PERSON
Hurst: PERSON
âDo: GPE
Eliza Bennet: ORG
Miss Bingley: PERSON
Elizabeth: PERSON
Bingley: PERSON
Elizabeth: PERSON
Elizabeth: PERSON
Miss Bingley: PERSON
Darcy!â: PERSON
such days: DATE
Charles: PERSON
âI: DATE
Pe

## Summary:

### Data Analysis Key Findings

*   The necessary libraries (`beautifulsoup4`, `requests`, and `spacy`) were already installed.
*   The `en_core_web_sm` spaCy language model was successfully downloaded and installed.
*   The content of the specified web page was successfully fetched and parsed using `BeautifulSoup`, indicated by a status code of 200.
*   The text content was successfully extracted from the parsed web page.
*   The `en_core_web_sm` spaCy language model was successfully loaded.
*   The extracted text was successfully processed by the loaded spaCy model, creating a `Doc` object.
*   Part-of-Speech (POS) tags and Named Entity Recognition (NER) results were successfully extracted and displayed from the processed `Doc` object.

### Insights or Next Steps

*   The current process extracts and displays all POS tags and entities, which can be extensive. Consider filtering the output to show only specific POS tags or entity types for a more focused analysis.
*   The extracted text can be noisy due to HTML artifacts not fully removed by `get_text()`. Further text cleaning steps could be implemented before spaCy processing to improve accuracy.
