# Scraping What's in Hong Kong's new Law from ABC News

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.abc.net.au/news/2020-07-01/what-is-in-hong-kongs-new-china-imposed-national-security-law/12409024'
content = requests.get(url).content
url

'https://www.abc.net.au/news/2020-07-01/what-is-in-hong-kongs-new-china-imposed-national-security-law/12409024'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [7]:
for date in soup.find('time',{'class':'_21SmZ _3_Aqg _1hGzz _1-RZJ P8HGV'}):
    print(date.text.strip())

Posted
WedWednesday 1 JulJuly 2020 at 5:16am
WedWednesday 1 JulJuly 2020 at 5:16am


In [8]:
date

<span class="_3I9MJ" hidden=""><span data-component="Abbreviation"><abbr aria-hidden="true" class="_2t5cr" title="Wednesday">Wed</abbr><span class="_1gVuJ" data-component="ScreenReaderOnly">Wednesday</span></span> <!-- -->1<!-- --> <span data-component="Abbreviation"><abbr aria-hidden="true" class="_2t5cr" title="July">Jul</abbr><span class="_1gVuJ" data-component="ScreenReaderOnly">July</span></span> <!-- -->2020<!-- --> at <!-- -->5:16am</span>

The date is in there but needs to be cleaned.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [9]:
for title in soup.findAll('h1'):
    print(title.text.strip())

What's in Hong Kong's new national security law imposed by China, and why is it so controversial?


In [10]:
title

<h1 class="_2aMR4 Yui2m _2sflh jwLlj _1Yxlo _1GKnS _2o9MN _1-RZJ" data-component="Heading">What's in Hong Kong's new national security law imposed by China, and why is it so controversial?</h1>

In [11]:
TAG_RE = re.compile(r'<[^>]+>')

In [12]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [13]:
remove_tags(str(title))

"What's in Hong Kong's new national security law imposed by China, and why is it so controversial?"

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [14]:
for bodies in soup.find_all('div',{'class','_3b5Y5 _1BraJ'}):
    print(bodies.text.strip())

The full details of the controversial national security law thrust upon Hong Kong by Beijing have been released, and it goes much further than had previously been predicted.Key points:Life sentences are possible for serious anti-state crimesThe law covers many of the activities of Hong Kong's protest movementAustralia is among around two-dozen countries criticising the lawWhile there has been growing concern both in Hong Kong and around the world about the changes, the details of the legislation were only made public after it came into effect at 11:00pm yesterday, hours after Beijing passed it into law by decree.We now know the laws will punish crimes of secession, subversion, terrorism and collusion with foreign forces with up to life in prison.Even asking foreign countries to sanction or take any form of action against Hong Kong or China could be considered as collusion with foreign forces under the law.Hong Kong police said they arrested a man holding a flag advocating for independe

In [37]:
len(bodies)

90

Needs internal work.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [15]:
for date in soup.find('time',{'class':'_21SmZ _3_Aqg _1hGzz _1-RZJ P8HGV'}):
    date.text.strip()

In [16]:
date

<span class="_3I9MJ" hidden=""><span data-component="Abbreviation"><abbr aria-hidden="true" class="_2t5cr" title="Wednesday">Wed</abbr><span class="_1gVuJ" data-component="ScreenReaderOnly">Wednesday</span></span> <!-- -->1<!-- --> <span data-component="Abbreviation"><abbr aria-hidden="true" class="_2t5cr" title="July">Jul</abbr><span class="_1gVuJ" data-component="ScreenReaderOnly">July</span></span> <!-- -->2020<!-- --> at <!-- -->5:16am</span>

In [17]:
date = remove_tags(str(date))
date

'WedWednesday 1 JulJuly 2020 at 5:16am'

In [20]:
day_pub = re.sub('JulJuly','07',date)
day_pub = re.sub('WedWednesday ','',day_pub)
day_pub = re.sub(' at 5:16am','',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()
day_pub

datetime.date(2020, 7, 1)

In [21]:
df_date = pd.DataFrame([day_pub])

In [22]:
type(df_date)

pandas.core.frame.DataFrame

In [23]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [24]:
for title in soup.findAll('h1'):
    print(title.text.strip())

What's in Hong Kong's new national security law imposed by China, and why is it so controversial?


In [25]:
title

<h1 class="_2aMR4 Yui2m _2sflh jwLlj _1Yxlo _1GKnS _2o9MN _1-RZJ" data-component="Heading">What's in Hong Kong's new national security law imposed by China, and why is it so controversial?</h1>

In [26]:
title = remove_tags(str(title))

In [27]:
title

"What's in Hong Kong's new national security law imposed by China, and why is it so controversial?"

In [28]:
df_title = pd.DataFrame([title])

In [29]:
df_title

Unnamed: 0,0
0,What's in Hong Kong's new national security la...


In [30]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [31]:
country = 'US'
df_country = pd.DataFrame([country])
source = 'ABC News'
df_source = pd.DataFrame([source])
file_name = 'abc_6'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [49]:
for bodies in soup.find_all('div',{'class','_3b5Y5 _1BraJ'}):
    bodies.text.strip()

In [50]:
bodies

<div class="_3b5Y5 _1BraJ" data-component="LayoutContainer"><div><p class="_1HzXw">The full details of the controversial national security law thrust upon Hong Kong by Beijing have been released, and it goes much further than had previously been predicted.</p><section aria-label="key points" class="vB11C _1tnlo _1w6Cw _1pc-9 _2EbnQ" data-component="KeyPoints" data-uri="coremedia://teaser/12410876" role="contentinfo"><h2 class="_3mduI _2gTdF _1deB8 jwLlj _1O6ck _1GKnS _2o9MN _1-RZJ" data-component="Heading">Key points:</h2><ul class="W8pqX" data-component="List" role="list"><li class="" data-component="ListItem"><span class="XoNws _1Hwjq"></span>Life sentences are possible for serious anti-state crimes</li><li class="" data-component="ListItem"><span class="XoNws _1Hwjq"></span>The law covers many of the activities of Hong Kong's protest movement</li><li class="" data-component="ListItem"><span class="XoNws _1Hwjq"></span>Australia is among around two-dozen countries criticising the law

In [51]:
body = remove_tags(str(bodies))
body

'The full details of the controversial national security law thrust upon Hong Kong by Beijing have been released, and it goes much further than had previously been predicted.Key points:Life sentences are possible for serious anti-state crimesThe law covers many of the activities of Hong Kong\'s protest movementAustralia is among around two-dozen countries criticising the lawWhile there has been growing concern both in Hong Kong and around the world about the changes, the details of the legislation were only made public after it came into effect at 11:00pm yesterday, hours after Beijing passed it into law by decree.We now know the laws will punish crimes of secession, subversion, terrorism and collusion with foreign forces with up to life in prison.Even asking foreign countries to sanction or take any form of action against Hong Kong or China could be considered as collusion with foreign forces under the law.Hong Kong police said they arrested a man holding a flag advocating for indepen

In [52]:
body = body.split('ABC/WiresPosted')
body = [body[0]]

In [53]:
df_news = pd.DataFrame()

In [54]:
df_news['article_body'] = body

In [55]:
df_news.head(2)

Unnamed: 0,article_body
0,The full details of the controversial national...


In [56]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [57]:
df_news = df_news.article_body[0]

In [58]:
df_news = df_news.replace(r'\\?','')

In [59]:
df_news = pd.DataFrame([df_news])

In [60]:
type(df_news)

pandas.core.frame.DataFrame

In [61]:
df_news.columns = ['Article']

In [62]:
df_news.head()

Unnamed: 0,Article
0,The full details of the controversial national...


**Bringing it together.**<a id='2.5_bit'></a>

In [63]:
df_6_abc = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [64]:
df_6_abc.columns = ['file_name','date','source','country','title','article']

In [65]:
df_6_abc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,abc_6,2020-07-01,ABC News,US,What's in Hong Kong's new national security la...,The full details of the controversial national...


**Saving**<a id='2.6_save'></a>

In [66]:
cd

C:\Users\rands


Saving it to Excel.

In [67]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_6_abc.to_csv('./_Capstone_Two_NLP/data/_news/abc_6.csv', index=False)

print('Complete')

Complete
