Sharmaine Gaw

* [Pokemon](#Pokemon)
* [News](#News-Sites)

# Pokemon

**Task:** Bulbapedia: Get the list of the next generations of Pokemons. Make sure you get the following information:
- kdex
- ndex
- name
- types
- URL to the Pokemon's wiki page

In [28]:
import requests
from bs4 import BeautifulSoup

Get the page using `requests` and `BeautifulSoup`.

In [29]:
URL = "https://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

### Find all tables that contain Pokemon details

Get table with id "mw-content-text" as it contains the tables of all generations of Pokemon.

In [30]:
poke_content = soup.find(id='mw-content-text')

Retrieve all `table` elements.

In [31]:
# Get all <table> elements
poke_tables = poke_content.find_all('table')

There are only 8 generations of Pokemon in the page. Upon closer inspection, we need to remove first table and trim out the rest as it doesn't contain the information we need.

We then store it in the list `all_generations` where each item in the list is a table for 1 generation.

In [32]:
all_generations = poke_tables[1:9]

### Get Pokemon Information

We define a function `get_pokemon_info` in which for each row of pokemon in a table, we extract the `kdex`, `ndex`, `name`, `type1`, `URL`, (and `type2` if applicable) by accessing the odd values in each row. We use `strip()` to remove unnecessary characters and `replace()` to remove the "#" character.

We store the extracted information for a single Pokemon in a JSON object and return the JSON object.

In [33]:
def get_pokemon_info(contents, has_multiple_types):
    json_object = {
                "kdex": contents[1].text.strip().replace("#", ""),
                "ndex": contents[3].text.strip().replace("#", ""),
                "name": contents[7].text.strip(),
                "type1": contents[9].text.strip(),
                "URL": contents[5].find('a')['href']
            }
    
    if has_multiple_types:
        json_object['type2'] = contents[11].text.strip()
    return json_object

We define another function `get_list_to_json` that takes a generation table as an input, iterates through the rows in the table, and uses the previous function in a loop to get the json objects. We return the list of json objects.

In [34]:
def gen_list_to_json(gen_list, start_index):
    json = []
    
    contents = gen_list.contents
    
    for i in range(start_index, len(contents), 2):
        poke_info = contents[i].contents
        json.append(get_pokemon_info(poke_info, len(poke_info) > 10))
    return json

Test if it works. We define `start_index` = 3 as the first row in each table is the table header, not a Pokemon.

In [35]:
gen_list_to_json(all_generations[0], 3)

[{'kdex': '001',
  'ndex': '001',
  'name': 'Bulbasaur',
  'type1': 'Grass',
  'URL': '/wiki/Bulbasaur_(Pok%C3%A9mon)',
  'type2': 'Poison'},
 {'kdex': '002',
  'ndex': '002',
  'name': 'Ivysaur',
  'type1': 'Grass',
  'URL': '/wiki/Ivysaur_(Pok%C3%A9mon)',
  'type2': 'Poison'},
 {'kdex': '003',
  'ndex': '003',
  'name': 'Venusaur',
  'type1': 'Grass',
  'URL': '/wiki/Venusaur_(Pok%C3%A9mon)',
  'type2': 'Poison'},
 {'kdex': '004',
  'ndex': '004',
  'name': 'Charmander',
  'type1': 'Fire',
  'URL': '/wiki/Charmander_(Pok%C3%A9mon)'},
 {'kdex': '005',
  'ndex': '005',
  'name': 'Charmeleon',
  'type1': 'Fire',
  'URL': '/wiki/Charmeleon_(Pok%C3%A9mon)'},
 {'kdex': '006',
  'ndex': '006',
  'name': 'Charizard',
  'type1': 'Fire',
  'URL': '/wiki/Charizard_(Pok%C3%A9mon)',
  'type2': 'Flying'},
 {'kdex': '007',
  'ndex': '007',
  'name': 'Squirtle',
  'type1': 'Water',
  'URL': '/wiki/Squirtle_(Pok%C3%A9mon)'},
 {'kdex': '008',
  'ndex': '008',
  'name': 'Wartortle',
  'type1': 'Water',

We loop through `all_generations`, which contain the tables for all generations of Pokemon, and save a json file for each generation. We store all the JSON files inside the folder "pokemon_data".

In [36]:
import json
import os

if not os.path.exists('pokemon'):
    os.makedirs("pokemon")

path = os.path.abspath("pokemon_data")
path

for index, generation in enumerate(all_generations):
    with open(path + "\\generation_" + str(index + 1), 'w') as f:
        json.dump(gen_list_to_json(generation, 3), f)

---

# News Sites

**Task:** Get the news articles published on March 11-12. Decide which news site you want to get articles from. Make sure you get the following information:

- date
- title
- full article
- author

The news site I chose is [Manila Bulletin](https://mb.com.ph/) after inspecting the robots.txt and the Privacy Policy of the site.

One reason I chose this site is that because the articles they publish for each day can be accessed easily by modifying the URL. For example, their archive for March 11 can be accessed through https://mb.com.ph/2021/03/11.

In addition, upon closer inspection, the HTML of the website is well-strucutured and the classes are well defined, which we'll see later on.

---

First import `requests` and `BeautifulSoup`.

In [1]:
import requests
from bs4 import BeautifulSoup

We first define a function for getting the contents of the URL using requests and BeautifulSoup. A header is added as a parameter, as not doing so we won't have access to the page (i.e. it returns 403).

The function returns the page content as well as the status code.

In [2]:
def get_page(URL):
    headers = {
        'User-Agent': "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Mobile Safari/537.36"
    }
    page = requests.get(URL, headers=headers)
    return [BeautifulSoup(page.content.decode(), 'html.parser'), page.status_code]

Test if it's working for March 11, 2021.

In [3]:
day_URL = "https://mb.com.ph/2021/03/11/page/1"
day_page = get_page(day_URL)
day_page

[<!DOCTYPE html>
 
 <html lang="en-US" prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb#">
 <head>
 <meta charset="utf-8"/>
 <link href="https://gmpg.org/xfn/11" rel="profile"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <meta content="yes" name="mobile-web-app-capable"/>
 <meta content="#25387A" name="theme-color"/>
 <meta content="#25387A" name="msapplication-navbutton-color"/>
 <meta content="yes" name="apple-mobile-web-app-capable"/>
 <meta content="black-translucent" name="apple-mobile-web-app-status-bar-style"/>
 <meta content="Manila Bulletin The Nation's Leading Newspaper" name="apple-mobile-web-app-title"/>
 <meta content="telephone=no" name="format-detection"/>
 <title>March 11, 2021 – Manila Bulletin</title>
 <link href="//cdnjs.cloudflare.com" rel="dns-prefetch"/>
 <link href="//maxcdn.bootstrapcdn.com" rel="dns-prefetch"/>
 <link href="//static.addtoany.com" rel="dns-prefetch"/>
 <link href="//fonts.googleapis.com" rel="dns-prefetch"/>
 <l

### Extracting links from an archive page

Next, we define a function `get_page_links` for getting the href links of the articles in a single archive page. We can get the links by getting the href attribute in the anchor tags nested within h4 tags with the class "title".

We store the links inside a list and return that list.

In [4]:
def get_page_links(day_page):
    return [article.find("a")["href"] for article in day_page.find_all("h4", class_ = "title")]

Test it out for March 11, 2021.

In [5]:
march_11_links = get_page_links(get_page(day_URL)[0])
march_11_links

['https://mb.com.ph/2021/03/11/hong-kongs-new-electoral-system-to-provide-socio-political-stability-benefit-ofws-chinese-envoy-to-ph/',
 'https://mb.com.ph/2021/03/11/introducing-the-real-enteng-and-joey-from-the-song-spoliarium/',
 'https://mb.com.ph/2021/03/11/abalos-no-need-to-give-special-pass-to-essential-workers-during-uniform-metro-curfew/',
 'https://mb.com.ph/2021/03/11/dilg-to-issue-memo-reminding-lgus-of-the-relaxed-travel-requirements/',
 'https://mb.com.ph/2021/03/11/where-and-when-sinas-learned-that-hes-covid-positive/',
 'https://mb.com.ph/2021/03/11/uniform-curfew-to-be-imposed-in-metro-manila-starting-march-15-abalos/',
 'https://mb.com.ph/2021/03/11/govt-assures-availability-of-covid-19-vaccine-second-dose-for-healthcare-workers/',
 'https://mb.com.ph/2021/03/11/variants-of-concern-not-yet-dominant-in-ph-up-pgc-official/',
 'https://mb.com.ph/2021/03/11/dfa-confirms-four-more-overseas-filipinos-infected-with-covid-19/',
 'https://mb.com.ph/2021/03/11/senators-call-for

### Extracting information from an article page

Next, for each article, we're tasked to extract the following: date, title, full article, author.

We define a function `get_article_text` that extracts the text within all p tag nested within the section tag with class "article-content", and return all extracted texts inside a list.

We define another function `split_article_text` that splits each sentence on ". " to clean the data later on.

In [6]:
def get_article_text(article_page):
    article_content = article_page.find("section", class_ = "article-content").find_all("p")
    if article_content:
        return [paragraph.text.strip() for paragraph in article_content if paragraph.text.strip()]
    return ""

def split_article_text(article_text):
    return [sentence for article in article_text for sentence in article.split(". ")]

We define `get_article_title` that gets the article title through accessing the tag with class "title".

In [7]:
def get_article_title(article_page):
    return article_page.find(class_ = "title").text.strip()

We define `get_article_author` that gets the article author through accessing the text in an anchor tag nested within a paragraph tag with class "author".

The text returns in the format "by *author name*", so we must add an additional `split()` function to remove "by ".

In [8]:
def get_article_author(article_page):
    temp1 = article_page.find(class_ = "author")
    if temp1:
        temp2 = temp1.find("a")
        if temp2:
            return temp2.text.strip().split(' ', 1)[1]
    return ""

We define `get_article_date` that gets the article date through accessing a paragraph tag with class "published". Again, we use `split()` to remove the first word in the string "Published".

In [9]:
def get_article_date(article_page):
    if article_page.find(class_ = "published"):
        return article_page.find(class_ = "published").text.strip().split(' ', 1)[1]
    return ""

Test it out.

In [10]:
article_page = get_page(march_11_links[0])[0]
article_page

print(get_article_title(article_page))
print(get_article_author(article_page))
print(get_article_date(article_page))
print(split_article_text(get_article_text(article_page)))

Hong Kong’s new electoral system to provide socio-political stability, benefit OFWs – Chinese envoy to PH
Roy Mabasa
March 11, 2021, 11:47 PM
['China’s passage of electoral reforms in Hong Kong will also benefit thousands of Filipinos working there, who will be provided with a “more stable socio-political environment”, and predictable business climate, Chinese Ambassador to the Philippines Huang Xilian said on Thursday.', 'Huang made this comment several hours after the parliament in Beijing passed a resolution overhauling Hong Kong’s electoral system, and implementing what it described as “patriots governing Hong Kong” that would ensure long-term stability, prosperity, and steady implementation of “One Country, Two Systems”.', '“I believe that the improving of the electoral system of the HKSAR (Hong Kong Special Administrative Region) will not only provide a more peaceful and stable social environment for the Filipinos in Hong Kong but also create a more stable political environment a

### Store into a JSON object

We use the functions above and store the information for each article in a single JSON object.

In [11]:
def article_to_json(article_page):
    return {
                "date": get_article_date(article_page),
                "title": get_article_title(article_page),
                "full_article": split_article_text(get_article_text(article_page)),
                "author": get_article_author(article_page)
            }

Lastly, we define another funciton `json_articles_per_day` that crawls through every article of an archive for a day. It goes through the pagination one by one until the status of the page is not 200 (i.e., the page does not exist anymore).

In [12]:
def json_articles_per_day(year, month, date):
    json = []
    page = 1
    while True:
        day_URL = "https://mb.com.ph/" + year + "/"+ month + "/" + date
        if page > 1:
            day_URL += "/page/" + str(page)
        day_page = get_page(day_URL)

        if day_page[1] != 200:
            return json

        article_links = get_page_links(day_page[0])
        for link in article_links:
            json.append(article_to_json(get_page(link)[0]))

        page += 1
            
    return json

Get the JSON objects for all articles for March 11 and March 12 and store it into variables.

In [13]:
march_11 = json_articles_per_day(year = "2021", month = "03", date = "11")
march_11

[{'date': 'March 11, 2021, 11:47 PM',
  'title': 'Hong Kong’s new electoral system to provide socio-political stability, benefit OFWs – Chinese envoy to PH',
  'full_article': ['China’s passage of electoral reforms in Hong Kong will also benefit thousands of Filipinos working there, who will be provided with a “more stable socio-political environment”, and predictable business climate, Chinese Ambassador to the Philippines Huang Xilian said on Thursday.',
   'Huang made this comment several hours after the parliament in Beijing passed a resolution overhauling Hong Kong’s electoral system, and implementing what it described as “patriots governing Hong Kong” that would ensure long-term stability, prosperity, and steady implementation of “One Country, Two Systems”.',
   '“I believe that the improving of the electoral system of the HKSAR (Hong Kong Special Administrative Region) will not only provide a more peaceful and stable social environment for the Filipinos in Hong Kong but also crea

In [15]:
march_12 = json_articles_per_day(year = "2021", month = "03", date = "12")
march_12

[{'date': 'March 12, 2021, 11:31 PM',
  'title': 'Tourist destinations in Rizal province seen to boost tourism recovery in the new normal',
  'full_article': ['Many tourist attractions in the province of Rizal, from churches to restaurants, to resorts and art galleries are being groomed to boost domestic tourism in the Southern Tagalog Region.',
   'During a recent visit of Department of Tourism Secretary Bernadette Romulo-Puyat in many parts of Rizal, including Angono and Antipolo, the secretary has emphasized the importance of the so-called Green Corridor Initiative in Rizal as one of the key strategies to hasten tourism recovery in the new normal.',
   'Puyat visited the wellness activity at Luljetta’s Garden Suites at the Loreland Farm Resort in Antipolo City, the Blanco Family Museum, and the Nemiranda Art Gallery in Angono.',
   'Prior to the recent tour in the province, the DOT secretary also had experienced Rizal Province’s Faith, Food, Art, Adventure and Nature Experience, whi

In [19]:
print(len(march_11), len(march_12))

252 262


### Export as JSON

We create a folder "mb" and store the JSON files inside. 

In [16]:
import json
import os

if not os.path.exists('mb'):
    os.makedirs("mb")

path = os.path.abspath("mb")
path

with open(path + "\\mb_march_11.json", 'w') as f1:
    json.dump(march_11, f1)
    
with open(path + '\\mb_march_12.json', 'w') as f2:
    json.dump(march_12, f2)