## ABS-CBN Latest News Scraper
Scrapes news.abs-cbn.com for the latest news.

In [1]:
import requests
from bs4 import BeautifulSoup
import time

### Get initial list of news
this portion will retrieve the official list of **More Stories** news found in [abs-cbn news page](https://news.abs-cbn.com/news). Saves every link to the actual article for navigation later.

In [6]:
#initialize starting link
URL="https://news.abs-cbn.com/news"
all_links = [] #array to store links
next_page = URL
#get links of all articles in 'More Stories'. From first page to last page
while len(next_page) > 0:
    page=requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #get more stories div
    latest_news_div=soup.find(class_='section-more-stories')
    #get each article
    latest_news_list = latest_news_div.find_all('article')
    #save article links
    news_links_list = list(map(lambda x: x.find('a')['href'], latest_news_list))
    all_links = all_links + news_links_list
     
    print(f'Finished {URL}')
    next_page = soup.select('a[title="Next"]')
    if(len(next_page) > 0):
        URL = 'https://news.abs-cbn.com' + next_page[0]['href']
    
    time.sleep(11) #10 sec delay from robots.txt

print('Done!')

Finished https://news.abs-cbn.com/news
Finished https://news.abs-cbn.com/news?page=2
Finished https://news.abs-cbn.com/news?page=3
Finished https://news.abs-cbn.com/news?page=4
Finished https://news.abs-cbn.com/news?page=5
Finished https://news.abs-cbn.com/news?page=6
Finished https://news.abs-cbn.com/news?page=7
Finished https://news.abs-cbn.com/news?page=8
Done!


In [7]:
#view all article links
all_links

['/news/03/19/21/highest-ever-ph-reports-record-high-7103-new-covid-19-cases',
 '/news/03/19/21/manila-halts-holy-week-activities-as-covid-19-cases-surge',
 '/news/03/19/21/journey-for-history-pinoy-scientist-says-emden-deep-trip-a-homage-to-philippine-heritage',
 '/news/03/19/21/octa-group-suggests-hard-gcq-metro-manila-covid-19-surge',
 '/news/03/19/21/half-of-metro-manila-covid-19-quarantine-facilities-in-use-official',
 '/news/03/19/21/duque-says-eyeing-borrowing-vaccine-supplies-from-areas-with-low-covid-19-infection',
 '/news/03/19/21/fda-says-3-coronavirus-vaccine-makers-yet-to-file-emergency-use-application',
 '/video/news/03/19/21/just-a-joke-roque-says-of-duterte-quip-that-his-cough-might-be-cancer',
 '/news/03/19/21/sws-65-pct-of-filipinos-say-dangerous-to-publish-critical-news-vs-duterte-administration',
 '/news/03/19/21/sws-65-pct-of-filipinos-say-dangerous-to-publish-critical-news-vs-duterte-administration',
 '/news/03/19/21/palace-hits-query-on-slow-vaccine-procurement-d

### Get individual news article details
Navigates to every link of the actual news article obtained earlier to obtain details.

In [8]:
#array to store news data
news_list_json = []

multimedia = False
for link in all_links:
    #forge news link to navigate
    news_link = 'https://news.abs-cbn.com/' + link
    
    #get news article page
    page = requests.get(news_link)
    soup = BeautifulSoup(page.content, 'html.parser')
    
    #get details
    title=soup.find(class_='news-title').contents[0].strip()
    
    #author div
    author_block = soup.find(class_='author-block') 
    author = author_block.find(class_='editor').text.strip()
    date = author_block.find(class_='date-posted').text.strip()
    
    #full article contents
    article_content = soup.find(class_='article-content')
    
    #for media articles
    if(article_content == None):
        article_content = soup.find(class_='media-caption')
    
    #for articles with different DOM structure
    if(article_content == None):
        article_content = soup.find(class_='block-content')
        multimedia = True

    if(multimedia): #articles with different DOM architecture
        article = article_content.find_all('p') #get each paragraph<p> of article
        full_article = article[len(article) - 1].text.strip()  
    else:
        article_paragraphs = article_content.find_all('p') #get each paragraph<p> of article
        clean_paragraphs = list(map(lambda x: x.text.strip(), article_paragraphs)) #get text from each p
        
         #remove 'related videos:' if included
        if(clean_paragraphs[-1].lower() == 'related videos:' or clean_paragraphs[-1].lower() == 'related video:' ):
            clean_paragraphs = clean_paragraphs[0:-1]
        full_article = ' '.join(clean_paragraphs) #join paragraphs into one text
    
    #save to array
    news_list_json.append({
        'title': title,
        'author': author,
        'date': date,
        'content': full_article
    })
    
    multimedia = False
    print(f'Saved article {title}')
    time.sleep(11) #10 sec delay from robots.txt

print('Done!')

Saved article HIGHEST EVER: PH reports record-high 7,103 new COVID-19 cases
Saved article Manila halts Holy Week activities as COVID-19 cases surge
Saved article Journey for history: Pinoy scientist says Emden Deep trip a homage to Philippine heritage
Saved article ‘Hard GCQ’ sought as OCTA says new COVID-19 surge ‘more serious’ than 2020 crisis
Saved article Half of Metro Manila COVID-19 quarantine facilities in use— official
Saved article Duque says eyeing 'borrowing' vaccine supplies from areas with low COVID-19 infection
Saved article FDA says 3 coronavirus vaccine makers yet to file emergency use application
Saved article Just a joke, Roque says of Duterte quip that his cough 'might be cancer'
Saved article SWS: 65 pct of Filipinos say it's dangerous to publish critical news vs Duterte administration
Saved article SWS: 65 pct of Filipinos say it's dangerous to publish critical news vs Duterte administration
Saved article 'Nasaan ka bakuna?': Palace hits query on slow vaccine procu

Saved article Group blasts arrest of 'red-tagged' Butuan vice principal, demands release
Saved article 'Wala sa critical zone': Number of COVID-19 cases in Cebu 'stable' - DOH
Saved article Trade chief Lopez tests positive for COVID-19 again
Saved article Kaso ng COVID-19 sa Jose Reyes Memorial Medical Center, nasa 26 na
Saved article PH likely to sign COVID-19 jab supply deal with Johnson & Johnson soon
Saved article Ospital ng Sampaloc nears full capacity for COVID-19 patients
Saved article DOH: Majority of P.3 variant cases came from Region 7
Saved article PH seeks 'steady supply' of COVID-19 vaccines for 'better Christmas'
Saved article Bacoor tallies 494 COVID-19 cases in half a month: mayor
Saved article Lawmakers seek probe into police profiling of activists' lawyers
Saved article Hot, humid weather to persist until weekend: PAGASA
Saved article Curfew violators sa Baclaran, pinaawit ng ‘Lupang Hinirang’
Saved article 2 arestado sa pagbebenta ng marijuana sa Maynila
Saved articl

In [9]:
#check contents
news_list_json

[{'title': 'HIGHEST EVER: PH reports record-high 7,103 new COVID-19 cases',
  'author': 'Job Manahan, ABS-CBN News',
  'date': 'Mar 19 2021 04:02 PM',
  'content': 'MANILA - The Philippines on Friday reported 7,103 new COVID-19 cases, the highest recorded daily tally in the country since the pandemic began over a year ago, with the number of active infections also at its highest in nearly 7 months. This raises the country’s total number of infections to 648,066. The day’s figure is considered the highest ever since the coronavirus outbreak hit the country, surpassing the previous record of 6,861 cases announced by the Department of Health (DOH) on Aug. 10 last year. Friday’s tally still does not include results from 5 accredited laboratories after failing to submit data on time. Active cases reached 73,264, with 97.3 percent of those still battling the disease experiencing mild symptoms or are asymptomatic. The ABS-CBN Investigative and Research Group also noted that the number of rema

In [10]:
#export
import json

with open('news_data.json', 'w', encoding='utf-8') as outfile:
    json.dump(news_list_json, outfile) 