# Web Scrapping of the **news article** in ABS-CBN News.

The code made use of https://news.abs-cbn.com/news?page=1.

It iterated through the list of news articles and scrapped different page to find the target date range (Mar 11-12) of the articles

### Import requests library

In [3]:
import requests
list_page = 1
URL="https://news.abs-cbn.com/news?page=" + str(list_page)

## Load list of news on ABS-CBS News for analysis of the website

In [4]:
page=requests.get(URL)
print(page.content)



In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.content, 'html.parser')

Identifying the last page of the list of articles

In [6]:
last = soup.find("a", {"title": "Last page"})['href'].split('=')
last_page = int(last[len(last) - 1])
last_page
# last

8

### Printing of the details of the first and most latest article.

In [None]:
more_stories = soup.find_all("article", {"class": "clearfix"})
more_stories[0]

In [94]:
more_stories = soup.find_all("article", {"class": "clearfix"})
title = more_stories[0].div.p.text.strip()
date = more_stories[0].div.findChildren('span', {"class": "datetime"})[0].text.strip()
author = more_stories[0].div.findChildren('span', {"class": "author"})[0].text.strip()
article_link='https://news.abs-cbn.com/' + more_stories[0].div.findChildren('a')[0]['href']

article_page=requests.get(article_link)


soup1 = BeautifulSoup(article_page.content, 'html.parser')

article_text = soup1.find("div", {"class": "article-content"}).findChildren('p')

full_text = " ".join([elem.text.strip()+"\n\n" for elem in article_text])
full_text

print(f'Title: {title}\nDate: {date}\nAuthor: {author}\nLink: {article_link}\nArticle:\n {full_text}')

Title: Philippines to receive around 2.3 million COVID-19 vaccines by April: Galvez
Date: Mar 15 10:54 PM
Author: ABS-CBN News
Link: https://news.abs-cbn.com//news/03/15/21/philippines-to-receive-around-23-million-covid-19-vaccines-by-april-galvez
Article:
 MANILA (UPDATE) - The Philippines is set to receive around 2.3 million additional COVID-19 vaccine doses this month or early April, the country's vaccine czar said Monday, amid a spike in new cases.

 National COVID-19 task force chief implementer Sec. Carlito Galvez Jr. expects 1.4 million more vaccine doses from China-based Sinovac, and almost a million doses from UK-based AstraZeneca.

 "Within this month or early April, 979,200 AstraZeneca [shots will arrive], with a total of 2,379,200 doses by March or early April," he said.

 President Rodrigo Duterte meanwhile said the Philippines does not yet have enough stocks of vaccines, amid criticisms that the country's COVID-19 vaccination rollout has been slow since it started on Marc

Building function that will take the required information from the article

In [140]:
def getInfo(article):
  title = article.div.p.text.strip()
  date = article.div.findChildren('span', {"class": "datetime"})[0].text.strip()
  author = article.div.findChildren('span', {"class": "author"})[0].text.strip()
  article_link='https://news.abs-cbn.com/' + article.div.findChildren('a')[0]['href']

  article_page=requests.get(article_link)
  # print(f'Title: {title}\nDate: {date}\nAuthor: {author}\nLink: {article_link}\n')

  '''Opens article to get article text'''
  soup1 = BeautifulSoup(article_page.content, 'html.parser')
  full_text=""

  '''Checks if the news article is a text or a video. If text = get article text, if video = get video caption'''
  if (soup1.find("div", {"class": "article-content"}) is not None):
    article_text = soup1.find("div", {"class": "article-content"}).findChildren('p')

    full_text = " ".join([elem.text.strip()+"\n\n" for elem in article_text])
    full_text
  elif (soup1.find({"class": "media-caption"}) is not None):
    article_text = soup1.find({"class": "media-caption"}).findChildren('p')
    full_text = " ".join([elem.text.strip()+"\n\n" for elem in article_text])
    full_text


  # print(f'Title: {title}\nDate: {date}\nAuthor: {author}\nLink: {article_link}\nArticle:\n {full_text}')

  element = {
            "title": title,
            "datetime": date,
            "author": author,
            "article": full_text
        }

  return element

In [143]:
list_page = 1
article_json = []

'''Loops through page of article list'''
while (True):
  print(f"LIST PAGE: {list_page}")
  URL="https://news.abs-cbn.com/news?page=" + str(list_page)
  page=requests.get(URL)

  soup = BeautifulSoup(page.content, 'html.parser')
  more_stories = soup.find_all("article", {"class": "clearfix"})

  '''Skips page if it does not contain articles within the date range '''
  if "Mar 12" not in getInfo(more_stories[len(more_stories) - 1])['datetime'] and "Mar 11" not in getInfo(more_stories[len(more_stories) - 1])['datetime'] and "Mar 11" not in getInfo(more_stories[0])['datetime']:
    list_page = list_page + 1

  '''Breaks the loop if all the article in the page is older than the date range'''
  elif "Mar 10" in getInfo(more_stories[0])['datetime']:
    break

  '''If page contains articles within the date range'''
  else:
    print("page has mar 12 or 11")
    print(f"First date {getInfo(more_stories[0])}")
    print(f"Last date {getInfo(more_stories[len(more_stories) - 1])}")

    '''loops through the the list of articles'''
    for article in more_stories:
      article_detail = getInfo(article)

      '''If article is within date range, gets info and saves data to a list'''
      if ("Mar 12" in article_detail['datetime'] or "Mar 11" in article_detail['datetime']):
        article_json.append(article_detail)
    # if (more_stories[0])
    list_page = list_page + 1

  '''Breaks the loop if it reaches the last page'''
  if list_page is last_page + 1:
    break


import json
with open('articles.json', 'w') as outfile:
    json.dump(article_json, outfile)

len(article_json)

LIST PAGE: 1
LIST PAGE: 2
LIST PAGE: 3
LIST PAGE: 4
LIST PAGE: 5
LIST PAGE: 6
LIST PAGE: 7
page has mar 12 or 11
First date {'title': 'Lacson laments slow pace of vaccination rollout in Philippines', 'datetime': 'Mar 13 03:42 AM', 'author': 'ABS-CBN News', 'article': ''}
Last date {'title': 'Lalaki timbog sa tangkang pagsunog sa dating pinapasukang restoran', 'datetime': 'Mar 12 06:51 PM', 'author': 'ABS-CBN News', 'article': 'MALOLOS CITY, Bulacan — Timbog ang isang lalaki sa lungsod na ito matapos tangkaing sunugin ang dating pinagtatrabahuhang restoran noong Linggo ng madaling araw.\n\n Sa inisyal na imbestigasyon ng Malolos City police, nasunog ang ilang bahagi ng bubungan ng establisimyento sa Barangay Mojon pero naagapan din ito matapos ang ilang minuto.\n\n Kinilala ang suspek na si Louie Mangabat, na nakuhanan pa sa CCTV na naghagis ng tila bote na may apoy at mabilis na tumakas paalis angkas ng isang motorsiklo.\n\n Kinumpirma rin ng guard on duty na si Mangabat ang nakita niy

47