# Scraping CE welcomes passage from the Hong Kong Government

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [2]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [56]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.info.gov.hk/gia/general/202006/30/P2020063000767.htm'
content = requests.get(url).content
url

'https://www.info.gov.hk/gia/general/202006/30/P2020063000767.htm'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'mB15 f15'}):
    print(date.text.strip())

Ends/Tuesday, June 30, 2020
				
				 Issued at HKT 18:49


That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [38]:
for title in soup.findAll('div',{'class':'fontSize1','id':'PRHeadline'}):
    print(title.text.strip())

CE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC


In [40]:
title.get_text()

"\nCE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC\n"

In [41]:
title

<div class="fontSize1" id="PRHeadline">
<span id="PRHeadlineSpan"><span>CE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC</span></span>
</div>

In [42]:
title

<div class="fontSize1" id="PRHeadline">
<span id="PRHeadlineSpan"><span>CE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC</span></span>
</div>

In [43]:
TAG_RE = re.compile(r'<[^>]+>')

In [44]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [45]:
remove_tags(str(title))

"\nCE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC\n"

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [47]:
for bodies in soup.find_all('span',{'id':'pressrelease'}):
    print(bodies.text.strip())

In response to the passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (the National Security Law) by the Standing Committee of the National People's Congress (NPCSC) today (June 30), the Chief Executive, Mrs Carrie Lam, made the following statement:
 
     Safeguarding national security is the constitutional duty of the Hong Kong Special Administrative Region (HKSAR). The HKSAR Government welcomes the passage of the National Security Law by the NPCSC today. This national law has been listed in Annex III of the Basic Law in accordance with Article 18 of the Basic Law after consulting the NPCSC's Committee for the Basic Law of the HKSAR and the HKSAR Government.  
 
     The HKSAR is an inalienable part of the People's Republic of China and a local administrative region which enjoys a high degree of autonomy and comes directly under the Central People's Government. Safeguarding national sovereignty, se

In [48]:
len(bodies)

35

Confirmed that's the bottom with the date.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [49]:
for date in soup.find_all('div',{'class':'mB15 f15'}):
    print(date.text.strip())

Ends/Tuesday, June 30, 2020
				
				 Issued at HKT 18:49


In [50]:
date

<div class="mB15 f15">Ends/Tuesday, June 30, 2020
				<br/>
				 Issued at HKT 18:49
				</div>

In [51]:
date = remove_tags(str(date))
date

'Ends/Tuesday, June 30, 2020\n\t\t\t\t\n\t\t\t\t Issued at HKT 18:49\n\t\t\t\t'

In [52]:
len(date)

62

In [54]:
day_pub = date[14:27]
day_pub

'June 30, 2020'

In [59]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()

day_pub

datetime.date(2020, 6, 30)

In [60]:
df_date = pd.DataFrame([day_pub])

In [61]:
type(df_date)

pandas.core.frame.DataFrame

In [62]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [70]:
for title in soup.findAll('div',{'class':'fontSize1','id':'PRHeadline'}):
    title = (title.text.strip())

In [71]:
title

"CE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC"

In [72]:
title = remove_tags(str(title))

In [73]:
title

"CE welcomes passage of The Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region by NPCSC"

In [74]:
df_title = pd.DataFrame([title])

In [75]:
df_title

Unnamed: 0,0
0,CE welcomes passage of The Law of the People's...


In [76]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [78]:
country = 'Hong Kong'
df_country = pd.DataFrame([country])
source = 'Hong Kong Gov\'t'
df_source = pd.DataFrame([source])
file_name = 'hkgovt_15'
df_file_name = pd.DataFrame([file_name])

Finally, the news.

In [79]:
news1 = []
for bodies in soup.find_all('span',{'id':'pressrelease'}):
    news1.append(bodies.text.strip())

In [80]:
news1

['In response to the passage of The Law of the People\'s Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (the National Security Law) by the Standing Committee of the National People\'s Congress (NPCSC) today (June 30), the Chief Executive, Mrs Carrie Lam, made the following statement:\r\n\xa0\r\n\xa0\xa0\xa0\xa0\xa0Safeguarding national security is the constitutional duty of the Hong Kong Special Administrative Region (HKSAR). The HKSAR Government welcomes the passage of the National Security Law by the NPCSC today. This national law has been listed in Annex III of the Basic Law in accordance with Article 18 of the Basic Law after consulting the NPCSC\'s Committee for the Basic Law of the HKSAR and the HKSAR Government.\xa0\xa0\r\n\xa0\r\n\xa0\xa0\xa0\xa0 The HKSAR is an inalienable part of the People\'s Republic of China and a local administrative region which enjoys a high degree of autonomy and comes directly under the Central Peopl

In [81]:
df_news = pd.DataFrame()

In [82]:
df_news['article_body'] = news1

In [83]:
df_news.head(2)

Unnamed: 0,article_body
0,In response to the passage of The Law of the P...


In [84]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [85]:
df_news = df_news.article_body[0]

In [86]:
df_news = df_news.replace(r'\\?','')

In [87]:
df_news = pd.DataFrame([df_news])

In [88]:
type(df_news)

pandas.core.frame.DataFrame

In [89]:
df_news.columns = ['Article']

In [90]:
df_news.head()

Unnamed: 0,Article
0,In response to the passage of The Law of the P...


**Bringing it together.**<a id='2.5_bit'></a>

In [91]:
df_15_hkgovt = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [92]:
df_15_hkgovt.columns = ['file_name','date','source','country','title','article']

In [93]:
df_15_hkgovt.head()

Unnamed: 0,file_name,date,source,country,title,article
0,hkgovt_15,2020-06-30,Hong Kong Gov't,Hong Kong,CE welcomes passage of The Law of the People's...,In response to the passage of The Law of the P...


**Saving**<a id='2.6_save'></a>

In [94]:
cd

C:\Users\rands


Saving it to Excel.

In [95]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_15_hkgovt.to_csv('./_Capstone_Two_NLP/data/_news/hkgovt_15.csv', index=False)

print('Complete')

Complete
