# Scraping Wide support for proposed national security law from China Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.chinadailyhk.com/article/134455'
content = requests.get(url).content
url

'https://www.chinadailyhk.com/article/134455'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [18]:
for date in soup.findAll('span',{'class':'news-date'}):
    print(date.text.strip())

Saturday, June 20, 2020, 23:33
Thursday, January 01, 1970, 08:00


The first one is correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [10]:
for title in soup.find_all('h5',{'class':'title_txt'}):
    print(title.text.strip())

Wide support for proposed national security law



Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [15]:
for bodies in soup.find_all('div',{'class','news-cut'}):
    print(bodies.text.strip())

Residents sign to support the national security legislation in Hong Kong, May 24, 2020. (EDMOND TANG/CHINA DAILY)HONG KONG - Various sectors of Hong Kong voiced strong support for the proposed national security law for the special administrative region as details of the draft were released on Saturday following a three-day session of the Standing Committee of the National People’s Congress in Beijing. READ MORE: Top legislature reviews draft HK national security lawTam Yiu-chung, Hong Kong’s delegate to the Standing Committee, said in an interview live-streamed on Hong Kong’s i-Cable News that the legislators gave full consideration to suggestions for the draft from Hong Kong, such as stating that Hong Kong’s government will take charge of enforcement and prosecutions under the law except in certain circumstances.Hong Kong’s common law system has been taken into consideration in writing the draft law, said Tam Yiu-chung, Hong Kong’s delegate to the NPC Standing CommitteeThe draft clear

In [16]:
len(bodies)

23

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [34]:
for date in soup.find_all('div',{'class':'news-info cl am-hide-lg-only'}):
    print(date.text.strip())

Saturday, June 20, 2020, 23:33
By China Daily


In [35]:
date

<div class="news-info cl am-hide-lg-only">
<span class="news-date">Saturday, June 20, 2020, 23:33</span>
<span class="news-edit-p">By China Daily</span>
</div>

In [36]:
date = remove_tags(str(date))
date

'\nSaturday, June 20, 2020, 23:33\nBy China Daily\n'

In [37]:
len(date)

47

In [38]:
day_pub = date[11:24]
day_pub

'June 20, 2020'

In [39]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()

day_pub

datetime.date(2020, 6, 20)

In [40]:
df_date = pd.DataFrame([day_pub])

In [41]:
type(df_date)

pandas.core.frame.DataFrame

In [42]:
df_date

Unnamed: 0,0
0,2020-06-20


Now the title.

In [93]:
for title in soup.findAll(('div',{'class':'news-hd'}),('h5',{'class':'title_txt'})):
    print(title.text.strip())

In [97]:
title = remove_tags(str(title))
title

'\n\nSaturday, June 20, 2020, 23:33\nPDF View\nRevert\n\nPDF View\nWide support for proposed national security law\n\nBy China Daily\n\n\n\n\n\n\n\n\n\nSaturday, June 20, 2020, 23:33\nBy China Daily\n\n'

In [101]:
title_new = title.split('\n')
title = title_new[7]
title

'Wide support for proposed national security law'

In [103]:
df_title = pd.DataFrame([title])

In [104]:
df_title

Unnamed: 0,0
0,Wide support for proposed national security law


In [105]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [106]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'China Daily'
df_source = pd.DataFrame([source])
file_name = 'china_daily_14'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [107]:
news1 = []
for bodies in soup.find_all('div',{'class','news-cut'}):
    news1.append(bodies.text.strip())

In [108]:
news1

["Residents sign to support the national security legislation in Hong Kong, May 24, 2020. (EDMOND TANG/CHINA DAILY)HONG KONG - Various sectors of Hong Kong voiced strong support for the proposed national security law for the special administrative region as details of the draft were released on Saturday following a three-day session of the Standing Committee of the National People’s Congress in Beijing.\xa0READ MORE:\xa0Top legislature reviews draft HK national security lawTam Yiu-chung, Hong Kong’s delegate to the Standing Committee, said in an interview live-streamed on Hong Kong’s i-Cable News that the legislators gave full consideration to suggestions for the draft from Hong Kong, such as stating that Hong Kong’s government will take charge of enforcement and prosecutions under the law except in certain circumstances.Hong Kong’s common law system has been taken into consideration in writing the draft law, said\xa0Tam Yiu-chung, Hong Kong’s delegate to the NPC Standing CommitteeThe 

In [109]:
df_news = pd.DataFrame()

In [110]:
df_news['article_body'] = news1

In [111]:
df_news.head(2)

Unnamed: 0,article_body
0,Residents sign to support the national securit...


In [112]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [113]:
df_news = df_news.article_body[0]

In [114]:
df_news = df_news.replace(r'\\?','')

In [115]:
df_news = pd.DataFrame([df_news])

In [116]:
type(df_news)

pandas.core.frame.DataFrame

In [117]:
df_news.columns = ['Article']

In [118]:
df_news.head()

Unnamed: 0,Article
0,Residents sign to support the national securit...


**Bringing it together.**<a id='2.5_bit'></a>

In [119]:
df_14_china_daily = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [120]:
df_14_china_daily.columns = ['file_name','date','source','country','title','article']

In [121]:
df_14_china_daily.head()

Unnamed: 0,file_name,date,source,country,title,article
0,china_daily_14,2020-06-20,China Daily,China,Wide support for proposed national security law,Residents sign to support the national securit...


**Saving**<a id='2.6_save'></a>

In [122]:
cd

C:\Users\rands


Saving it to Excel.

In [123]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_14_china_daily.to_csv('./_Capstone_Two_NLP/data/_news/china_daily_14.csv', index=False)

print('Complete')

Complete
