# Guide on how to scrape news stories

## Roadmap:
  1. Get the sitemap for the URLS
  2. Get the information we want from story
  
## Start with the sitemap

  The sitemap is a document (or series of documents) that websites use to guide search engines around their website. They generally list every link available on their site and that's exactly what we want. 

  We'll use Reuters as an example.
  
  First we go to reuters.com/robots.txt :
  <img src="reuters_robots.png">
  
  There we find the link to the sitemap and go there next:
  <img src="reuters_xml.PNG">
  
  From here we can see that Reuters has a sitemap for every day (they have a lot of stories). NYT does monthly (example in main script) and other just count up. Let's import some packages we'll use and then build a function to collect all the links from this site. 

In [91]:
import urllib.request # what we use to download html into python
from bs4 import BeautifulSoup # for parsing the html once we download
from tqdm import tqdm # useful little package for creating progress bars
from datetime import datetime, timedelta, date # for managing date formats
import re # for regular expressions

Below you'll find a function to use once you get to an xml file that looks like this: 

  <img src="reuters_links.PNG">

It'll pull every link on the site return a list of them. 

In [92]:
def read_sitemap(site_url, compressed=False):
    """
    Pulls all the links from a .xml url

    Keyword arguments:
    site_url -- the address of the .xml file
    compressed -- if the address is a .xml.gz
    """
    # call the website
    response = urllib.request.urlopen(site_url)
    # open with gzip if it's compressed (like the NYT is)
    if compressed:
        gunzip_response = gzip.GzipFile(fileobj=response)
        content = gunzip_response.read()
        c_d = content.decode('utf-8')
    else:
        c_d = response.read()
    # parse in BeautifulSoup
    re_soup = BeautifulSoup(c_d, 'lxml')
    # pull out all the urls
    urls = [loc.string for loc in re_soup.find_all('loc')]
    # sleep for a second so as to not accidentally DDoS
    # sleep(1)
    return urls

Now we want to make a system for creating the sitemap urls that we can feed into read_sitemap. This next function is **Reuters** specific, you'll have to make your own. 

In [93]:
def reuters_gen(start, end):
    """
    functions creates all the reuters sitemap links within set range
    start / date datetime object
    """
    # get the date ranges
    times = [start]
    # since reuters does daily sitemaps, this one steps one day at a time
    while start <= end:
        start += timedelta(days=1)
        times.append(start)
    alpha_omega = {}
    for ii in range(0, len(times)-1):
        alpha_omega[times[ii]] = times[ii+1]
    # create the links
    hold_links = [] # list to hold the urls from the loop
    for k, v in alpha_omega.items():
        # changes out datetime object to a string in the same format as the Reuters url
        date1 = k.strftime("%Y%m%d")
        date2 = v.strftime("%Y%m%d")
        # Add the sitemap url to our list
        hold_links.append(f'https://www.reuters.com/sitemap_{date1}-{date2}.xml')
    return hold_links

Great. We have a function to read xml files and a system for creating the xml urls. Now we combine the two to create a function that pulls the two together and saves the links to a .txt file. 

In [94]:
def reuters_collect():
    """
    main function for collecting reuters links
    """
    # sitemaps start in 2006
    reu_sitemaps = reuters_gen(datetime(2006, 9, 22), datetime(2018, 12, 1))
    # collect the data
    for sm in tqdm(reu_sitemaps):
        try:
            urls = read_sitemap(sm)
            # write to a file
            with open('links/reuters.txt', 'a') as f:
                for url in reu_urls:
                    f.write(f'{url}\n')
        except urllib.error.HTTPError:
            pass

## Story features

Ok now we completely change course. Up above we're just collecting links. Now we're going to go through how to process a *single* story. 

Here's what we want from each story:
  1. Title
  2. Authors
  3. Publishing date
  8. Publication section (ex: World)
  4. Text
  6. Images
  7. Image captions
  9. Location - but we'll come back to this later

Let's use https://www.reuters.com/article/us-nigeria-election/nigerian-opposition-candidate-absent-from-election-accord-ceremony-idUSKBN1OA2H7 

We'll start by pulling the html from that site to use in our example. 

In [95]:
url = 'https://www.reuters.com/article/us-nigeria-election/nigerian-opposition-candidate-absent-from-election-accord-ceremony-idUSKBN1OA2H7'
reuters_html = urllib.request.urlopen(url)

Now take the raw html and pull it in BeautifulSoup to parse:

In [96]:
soup = BeautifulSoup(reuters_html)

From here we want to pull the element tags for the things we want. Let's start with the things that are generally listed at the top of the article. 

I open the page, right click on the title, and select "Inspect Element":

  <img src="reuters_header.PNG">

Here we see that the title, authors, date, and section are all in a tag called: `<div class="ArticleHeader_container">` To make sure that I don't accidentally pull author names or dates from other stories mentioned or linked on the page I'll start by looking for those elements inside the header. 

In [97]:
header = soup.find('div', {'class':'ArticleHeader_container'})

### Title

Inside the header I note that the title is marked with the tag `<h1 class="ArticleHeader_headline">`:

  <img src="reuters_headline.PNG">
  
To pull that from our soupified html I do the following:

In [98]:
header.find('h1', {'class':'ArticleHeader_headline'}).text

'Nigerian opposition candidate absent from election accord ceremony'

### Authors

And now the the authors (make sure your code can handle multiple authors). First we'll select the byline with all the authors listed then pull each individual separately. Both authors are listed under the `<div class="BylineBar_byline">` tag. 

  <img src="reuters_byline.PNG">

Inside the byline the authors are identified with the `a` tag. We'll use a list comprehension to pull all individuals from the byline. 

  <img src="reuters_author.PNG">


In [99]:
# Get the whole byline
authors = header.find('div', {'class':'BylineBar_byline'})
# Pull out the individual authors
[author.text for author in authors.find_all('a')]

['Felix Onuah', 'Camillus Eboh']

### Date
Also in the header under the `<div class="ArticleHeader_date">` tag:

  <img src="reuters_date.PNG">

In [100]:
header.find('div', {'class':'ArticleHeader_date'}).text

'December 11, 2018 /  9:37 PM / Updated 5 hours ago'

### Section
`<div class="ArticleHeader_channel">`:

  <img src="reuters_section.PNG">

In [101]:
header.find('div', {'class':'ArticleHeader_channel'}).text

'World News'

### Text 

Time to get the text. We find a container `<div class="StandardArticleBody_body">` that holds all our text (note that we're no longer inside the header. We'll select that and then pull every paragraph from it into a list where each item is a paragraph:

  <img src="reuters_text.PNG">

In [102]:
body = soup.find('div', {'class':'StandardArticleBody_body'})
text = [paragraph.text for paragraph in body.find_all('p')]
text

['ABUJA (Reuters) - Nigeria’s main opposition candidate did not attend an event on Tuesday to sign an election agreement stating a commitment to hold a peaceful election early next year due to a “communication lapse”, his party said. ',
 'The opposition People’s Democratic Party (PDP) confirmed in an emailed statement on Tuesday that its candidate, Atiku Abubakar, had not participated in the signing ahead of February’s election. President Muhammadu Buhari attended the event in the capital, Abuja.  ',
 'The peace accord ceremony was held days after the PDP said authorities had frozen the bank accounts of its vice presidential candidate, Peter Obi. ',
 'Elections to choose the leader of Africa’s most populous country - the continent’s top oil producer and by many measures its largest economy - have in the past been marred by violence, vote-rigging and voter intimidation. ',
 'The ceremony was an attempt to mirror the signing of an acclaimed deal ahead of voting in 2015, when Buhari came 

You'll notice that this method leaves us with an additional attribution paragraph at the bottom. We don't want that but we also can't just cut the bottom paragraph because that extra bit isn't always included. We can get around that using the HTML or the text itself. The HTML method is preferred but I'll demonstrate both. 

Looking at the HTML, I notice that the text of the story just has plain `<p>` element tags with no other attributes:

  <img src="reuters_plain.PNG">

While the ending paragraph has a `<div class="Attribution_container">` wrapper:

  <img src="reuters_attribution.PNG">

We can use that to identify the text we want. 

In the list comprehension above we pulled the text from the BeautifulSoup search (`paragraph.text`), now we'll keep the BeautifulSoup element tags. 

In [103]:
tags = [paragraph for paragraph in body.find_all('p')]

# see the difference:
print(text[-1])
print(type(text[-1]))

print(tags[-1])
print(type(tags[-1]))

Additional reporting by Abraham Achirga; writing by Paul Carsten; editing by Alexis Akwagyiram
<class 'str'>
<p class="Attribution_content">Additional reporting by Abraham Achirga; writing by Paul Carsten; editing by Alexis Akwagyiram</p>
<class 'bs4.element.Tag'>


If every tag had a class we could filter them using `['class']`, but since only the things we don't want have a class we'll do the following:

In [104]:
text_better = [paragraph.text.strip() for paragraph in body.find_all('p') if not paragraph.has_attr('class')]
text_better

['ABUJA (Reuters) - Nigeria’s main opposition candidate did not attend an event on Tuesday to sign an election agreement stating a commitment to hold a peaceful election early next year due to a “communication lapse”, his party said.',
 'The opposition People’s Democratic Party (PDP) confirmed in an emailed statement on Tuesday that its candidate, Atiku Abubakar, had not participated in the signing ahead of February’s election. President Muhammadu Buhari attended the event in the capital, Abuja.',
 'The peace accord ceremony was held days after the PDP said authorities had frozen the bank accounts of its vice presidential candidate, Peter Obi.',
 'Elections to choose the leader of Africa’s most populous country - the continent’s top oil producer and by many measures its largest economy - have in the past been marred by violence, vote-rigging and voter intimidation.',
 'The ceremony was an attempt to mirror the signing of an acclaimed deal ahead of voting in 2015, when Buhari came to po

### Images 

A page could have a lot of extraneous images so we need to make sure to only look inside our article. With Reuters, this is possible with the same `body` tag we used for the text above but this will not always be the case, especially for sites that feature photos above the story. 

We'll start with the image urls, then get the caption. 

We see that the image and caption are available under `<div class="Image_container">`:

  <img src="reuters_imagecontainer.PNG">
  
Inside the image containers we find `<div class="LazyImage_image LazyImage_cover LazyImage_fallback">`. Be careful for very small versions of images that are often used as thumbnails.

  <img src="reuters_image.PNG">
    
And the caption can be found under `<div class="Image_caption">`:

  <img src="reuters_caption.PNG">

There's only one is this story but we need to build to code to capture multiple image urls if they exist. 




In [105]:
# image containers first
image_containers = body.find_all('div', {'class':'Image_container'})
# then I get the images, note that I use regular expressions to match here
# the first part. Not necessary but useful when tags are mostly consistent
# with slight variations
images = [image.find('div', {'class':re.compile('LazyImage_image')}) for 
          image in image_containers]

Looks like the actual url we want is buried inside some parentheses (this usually won't be the case). Either way, we'll use a regular expression to get it out. 

In [106]:
[re.search(r'\((.*?)\)', image['style']).group(1) for image in images]

['//s4.reutersmedia.net/resources/r/?m=02&d=20181211&t=2&i=1334162460&r=LYNXMPEEBA1LR&w=20']

In [107]:
# captions
[image.find('div', {'class':'Image_caption'}).text for image in image_containers]

["FILE PHOTO: Atiku Abubakar, a former vice president, attends the national convention of Nigeria's opposition People's Democratic Party (PDP), in the southern city of Port Harcourt in the Niger Delta, Nigeria October 6, 2018. REUTERS/Tife Owolabi/File Photo"]

## Put it all together

In [108]:
def reuters_story(html):
    """
    Function to pull the information we want from Retuers stories
    """
    # create a dictionary to hold everything in
    hold_dict = {}
    # first turn the html into BeautifulSoup
    soup = BeautifulSoup(html, 'lxml')
    # pull the data I want
    # all the initial data I want is in the header so I restrict my search
        # so as to not accidentially pull in other  data
    header = header = soup.find('div', {'class':'ArticleHeader_container'})
    # title
    hold_dict['title'] = header.find('h1', {'class':'ArticleHeader_headline'}).text
    # authors
    authors = header.find('div', {'class':'BylineBar_byline'})
    hold_dict['authors'] = [author.text for author in authors.find_all('a')]
    # date
    hold_dict['date'] = header.find('div', {'class':'ArticleHeader_date'}).text
    # section
    hold_dict['section'] = header.find('div', {'class':'ArticleHeader_channel'}).text
    # text
    body = soup.find('div', {'class':'StandardArticleBody_body'})
    hold_dict['text'] = [paragraph.text.strip() for paragraph in body.find_all('p')
                         if not paragraph.has_attr('class')]
    # images (in order)
    image_containers = body.find_all('div', {'class':'Image_container'})
    images = [image.find('div', {'class':re.compile('LazyImage_image')}) for 
              image in image_containers]
    hold_dict['image_urls'] = [re.search(r'\((.*?)\)', image['style']).group(1) for
                               image in images]
    # captions (in order)
    hold_dict['image_captions'] = [image.find('div', {'class':'Image_caption'}).text
                                   for image in image_containers]
    # return
    return hold_dict

Let's try it out now

In [109]:
url = 'https://www.reuters.com/article/uk-nigeria-election/nigerian-opposition-candidate-absent-from-election-accord-ceremony-idUSKBN1OA2HU'
html = urllib.request.urlopen(url).read()
reuters = reuters_story(html)

Let's see how it worked

In [110]:
reuters['authors']

['Felix Onuah', 'Camillus Eboh']

In [111]:
reuters['title']

'Nigerian opposition candidate absent from election accord ceremony'

In [112]:
reuters['date']

'December 11, 2018 /  9:37 PM / Updated 5 hours ago'

In [113]:
reuters['section']

'World News'

In [114]:
reuters['text']

['ABUJA (Reuters) - Nigeria’s main opposition candidate did not attend an event on Tuesday to sign an election agreement stating a commitment to hold a peaceful election early next year due to a “communication lapse”, his party said.',
 'The opposition People’s Democratic Party (PDP) confirmed in an emailed statement on Tuesday that its candidate, Atiku Abubakar, had not participated in the signing ahead of February’s election. President Muhammadu Buhari attended the event in the capital, Abuja.',
 'The peace accord ceremony was held days after the PDP said authorities had frozen the bank accounts of its vice presidential candidate, Peter Obi.',
 'Elections to choose the leader of Africa’s most populous country - the continent’s top oil producer and by many measures its largest economy - have in the past been marred by violence, vote-rigging and voter intimidation.',
 'The ceremony was an attempt to mirror the signing of an acclaimed deal ahead of voting in 2015, when Buhari came to po

In [115]:
reuters['image_urls']

['//s4.reutersmedia.net/resources/r/?m=02&d=20181211&t=2&i=1334162460&r=LYNXMPEEBA1LR&w=20']

In [116]:
reuters['image_captions']

["FILE PHOTO: Atiku Abubakar, a former vice president, attends the national convention of Nigeria's opposition People's Democratic Party (PDP), in the southern city of Port Harcourt in the Niger Delta, Nigeria October 6, 2018. REUTERS/Tife Owolabi/File Photo"]