<a href="https://colab.research.google.com/github/vshalisko/python_at_JetBrainsAcademy/blob/main/Ejemplo_Scraper_Hyperskill.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import requests
from bs4 import BeautifulSoup

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


* Url 0: https://icanhazdadjoke.com/j/R7UfaahVfFd
* Url 1: https://www.nature.com/articles/d41586-023-00103-3

In [4]:
print('Input the URL:')
url = str(input())

Input the URL:
https://www.nature.com/articles/d41586-023-00103-3


In [6]:
try:
    response = requests.get(url,
                            headers={'Accept': 'application/json',
                                     'Accept-Language': 'en-US,en;q=0.5'})
    if response.status_code == 200:
        data = response.json()
        if 'joke' in data:
            print(data['joke'])
        else:
            print('Invalid resource!')
    else:
        print('Invalid resource!')
except requests.exceptions.RequestException:
    print('Invalid resource!')

My dog used to chase people on a bike a lot. It got so bad I had to take his bike away.


In [7]:
if 'nature.com/articles/' in url:
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            title = soup.find('title').get_text() if soup.find('title') else 'N/A'
            description_meta = soup.find('meta', {'name': 'description'})
            description = description_meta['content'] if description_meta and 'content' in description_meta.attrs else 'N/A'
            print({"title": title, "description": description})
        else:
            print('Invalid page!')
    except requests.exceptions.RequestException:
        print('Invalid page!')
else:
    print('Invalid page!')

Input the URL:
 https://www.nature.com/articles/d41586-023-00103-3
{'title': 'Night skies are brightening — and dimming the outlook for astronomy', 'description': 'Fewer stars are visible worldwide now than a decade ago, according to measurements by community scientists.'}


In [7]:
from http import HTTPStatus

path = '/content/drive/MyDrive/Colab Data/'
filename = 'source.html'
path_filename = path + filename

try:
    response = requests.get(url)
    if response.status_code == HTTPStatus.OK:
        with open(path + filename, 'wb') as f:
            f.write(response.content)
        print(f'Content saved at {path_filename}.')
    else:
        print(f'The URL returned {response.status_code}!')
except requests.exceptions.RequestException:
    print('Invalid URL or network error.')

Content saved at /content/drive/MyDrive/Colab Data/source.html.


# Task
Create a Python program that downloads the content of the URL "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3", parses it to find links to "News" articles, downloads each of those news articles, extracts the article body from the `<p>` tag with the attribute `{"class": "article__teaser"}`, formats the article title by replacing spaces with underscores and removing punctuation, and saves the extracted article body to a text file named after the formatted title.

## Fetch and parse the main page

### Subtask:
Download the content of the provided nature.com URL and parse it using BeautifulSoup.


**Reasoning**:
The first step is to define the URL and download its content using requests, then parse the content with BeautifulSoup if the request is successful.



In [9]:
url = "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3"
try:
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')
    else:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

## Find news article links

### Subtask:
Iterate through the parsed content to find all `<article>` tags. Within each article tag, identify the article type from the `<span>` tag with the `data-test="article.type"` attribute. If the type is "News", extract the link to the article content from the `<a>` tag with the `data-track-action="view article"` attribute. Store these links.


**Reasoning**:
Iterate through the parsed content to find news article links based on the specified criteria.



In [10]:
article_tags = soup.find_all('article')
news_article_urls = []

for article in article_tags:
    article_type_tag = article.find('span', {'data-test': 'article.type'})
    if article_type_tag and article_type_tag.get_text(strip=True) == 'News':
        article_link_tag = article.find('a', {'data-track-action': 'view article'})
        if article_link_tag and 'href' in article_link_tag.attrs:
            news_article_urls.append("https://www.nature.com" + article_link_tag['href'])

print(f"Found {len(news_article_urls)} news article URLs.")
print(news_article_urls)

Found 3 news article URLs.
['https://www.nature.com/articles/d41586-020-03621-6', 'https://www.nature.com/articles/d41586-020-03593-7', 'https://www.nature.com/articles/d41586-020-03561-1']


## Fetch and parse each news article page

### Subtask:
For each news article link found in the previous step, download the content of that article's page and parse it using BeautifulSoup.


**Reasoning**:
Iterate through the list of news article URLs, download the content of each page using requests, and parse it with BeautifulSoup, storing the parsed content along with the original URL.



In [11]:
parsed_articles = []

for url in news_article_urls:
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, 'html.parser')
            parsed_articles.append({"url": url, "soup": soup})
        else:
            print(f"Failed to retrieve the page {url}. Status code: {response.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"An error occurred while processing {url}: {e}")

print(f"Successfully parsed {len(parsed_articles)} news articles.")

Successfully parsed 3 news articles.


## Extract article body

### Subtask:
From the parsed article page, find the `<p>` tag with the attribute `{"class": "article__teaser"}` and extract its text content.


**Reasoning**:
Iterate through the parsed articles, find the article body tag, extract its text, and add it to the article dictionary.



In [13]:
for article in parsed_articles:
    soup = article['soup']
    body_tag = soup.find('p', {"class": "article__teaser"})
    if body_tag:
        article['body'] = body_tag.get_text(strip=True)
    else:
        article['body'] = 'Body not found'

for article in parsed_articles:
    print(f"URL: {article['url']}")
    print(f"Body: {article['body']}")
    print("-" * 20)

URL: https://www.nature.com/articles/d41586-020-03621-6
Body: US president-elect Joe Biden has nominated Michael Regan, North Carolina’s top environmental regulator, to lead the country’s Environmental Protection Agency (EPA) — and scientists and environmentalists are optimistic.
--------------------
URL: https://www.nature.com/articles/d41586-020-03593-7
Body: A week after granting an emergency-use authorization for the country’s first COVID-19 vaccine, US regulators have followed with a second: another RNA vaccine, this one made by Moderna of Cambridge, Massachusetts.
--------------------
URL: https://www.nature.com/articles/d41586-020-03561-1
Body: Lightning is striking the Arctic many times more often than it did a decade ago, a study suggests — and the rate could soon double. The findings demonstrate yet another way Earth’s climate could be changing as the planet warms, although not all researchers agree that the trend is real.
--------------------


## Format article title and save content

### Subtask:
Get the article title, format it by replacing spaces with underscores and removing punctuation. Save the extracted article body text to a file named after the formatted title with a `.txt` extension.


**Reasoning**:
Iterate through the parsed articles, extract and format the title, and save the body to a file.



In [15]:
import re
import os

path = '/content/drive/MyDrive/Colab Data/'

# Create the directory if it doesn't exist
os.makedirs(path, exist_ok=True)

for article in parsed_articles:
    soup = article['soup']
    title_tag = soup.find('title')
    title = title_tag.get_text() if title_tag else 'Untitled'
    formatted_title = re.sub(r'[^\w\s-]', '', title).replace(' ', '_')
    filename = f"{formatted_title}.txt"
    filepath = os.path.join(path, filename)  # Use os.path.join for creating the full path
    body = article['body']

    try:
        with open(filepath, 'wb') as f:
            f.write(body.encode('utf-8'))
        print(f"Content saved to {filepath}")
    except IOError as e:
        print(f"Error saving content to {filepath}: {e}")

Content saved to /content/drive/MyDrive/Colab Data/Bidens_pick_to_head_US_environment_agency_heartens_scientists.txt
Content saved to /content/drive/MyDrive/Colab Data/Moderna_COVID_vaccine_becomes_second_to_get_US_authorization.txt
Content saved to /content/drive/MyDrive/Colab Data/Is_lightning_striking_the_Arctic_more_than_ever_before.txt


## Summary:

### Data Analysis Key Findings

*   The initial web page at "https://www.nature.com/nature/articles?sort=PubDate&year=2020&page=3" was successfully downloaded and parsed.
*   Three news article URLs were identified from the main page.
*   The content of all three identified news articles was successfully downloaded and parsed.
*   For each news article, the body text was successfully extracted from the `<p>` tag with the class "article__teaser".
*   The title of each article was extracted and formatted by replacing spaces with underscores and removing punctuation.
*   The extracted article body of each news article was saved to a text file named after the formatted article title.

### Insights or Next Steps

*   Consider adding error handling or logging for articles where the body tag is not found.
*   The process could be extended to extract more information from each article page, such as author, publication date, or other sections of the article body.
