# Let's get scraping!
We're going to import the libraries we need for scraping the headline of CNR's website. Then, we'll run this on a schedule using GitHub Actions.

In [15]:
import requests
import bs4

Now, let's specify the URL we want to scrape, then use the `requests` library to fetch the page + headers using the `get` method.

In [16]:
url = 'https://www.rferl.org/'

In [17]:
rferl_homepage = requests.get(url)

We will want to scrape the page we retrieved. That's what we're using [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for. Let's load the page into Beautiful Soup for scraping.

In [20]:
soup = bs4.BeautifulSoup(rferl_homepage.text, 'html.parser')

Grab the front page headline by looking for the first appearance of the `h4` HTML tag. If we were grabbing the top headline from multiple services we would probably have to do this differently (maybe via RSS?) since page layouts are different from service to service.

In [23]:
headline = soup.find('h4')
print(headline.text)


Polish, Baltic Leaders Head To Kyiv To Bolster Zelenskiy, As Putin Vows To Press Military Campaign



Let's also add the date and time of retrieval. We'll put the headline and date into an array so we can easily add it as a row to a CSV. We're also adding `.text.strip()` to `headline` to retrieve a text only version of the headline (without HTML tags), and stripping it of newline characters.

**NOTE** We are using timezone to set UTC as the time on everything, because that is what Github Actions uses.

In [36]:
from datetime import datetime, timezone

current_time = datetime.now(timezone.utc).strftime("%m/%d/%Y, %H:%M") # no pun intended

row = [headline.text.strip(), current_time]
print(row)

['Polish, Baltic Leaders Head To Kyiv To Bolster Zelenskiy, As Putin Vows To Press Military Campaign', '04/13/2022, 10:39']


Time to write the result to a CSV, and for that we'll need the Python CSV library.

In [25]:
import os
import csv

And we'll also prepare the headers for the CSV file, and the filename.

In [37]:
HEADERS = ['headline','datetime']
FILENAME = 'cnr_headlines.csv'

Let's check if the file already exists. If it does, we'll just add a row, if not, then we create the file from scratch and add the headers.

In [40]:
if not os.path.isfile(FILENAME):
   with open(FILENAME, 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(HEADERS)
    writer.writerow(row)
else: # else it exists so append without writing the header
   with open(FILENAME, 'a', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(row)

Done! 👏