# Scraping Safeguarding long-term prosperity from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'http://en.people.cn/n3/2020/0702/c90000-9706242.html'
content = requests.get(url).content
url

'http://en.people.cn/n3/2020/0702/c90000-9706242.html'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'w980 wb_10 clear'}):
    print(date.text.strip())

HKSAR national security law to put HK back on track: official
 (Xinhua)    09:52, July 02, 2020


That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [7]:
for title in soup.findAll('div',{'class','w980 wb_10 clear'}):
    print(title)

<div class="w980 wb_10 clear">
<h1>HKSAR national security law to put HK back on track: official</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    09:52, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>


In [8]:
TAG_RE = re.compile(r'<[^>]+>')

In [9]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [10]:
remove_tags(str(title))

'\nHKSAR national security law to put HK back on track: official\n (Xinhua)\xa0\xa0\xa0\xa009:52, July 02, 2020\n'

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [11]:
for bodies in soup.find_all('div',{'class','wb_12 clear'}):
    print(bodies.text.strip())

Zhang Xiaoming, deputy director of the Hong Kong and Macao Affairs Office of the State Council, speaks during a press conference held by the State Council Information Office about the Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (HKSAR) in Beijing, capital of China, July 1, 2020. (Xinhua/Jin Liangkuai)

	BEIJING, July 1 (Xinhua) -- The law on safeguarding national security in the Hong Kong Special Administrative Region (HKSAR) marks a turning point and will bring the region back on track of its development, an official said Wednesday.

	The law is designed to bring tranquility to Hong Kong, Zhang Xiaoming, deputy director of the Hong Kong and Macao Affairs Office of the State Council, made the remarks at a press conference in Beijing.

	The law targets only a tiny group of criminals who endanger national security and will be a "sharp sword" hanging over their heads, Zhang said, adding that it will also serve as a

In [12]:
len(bodies)

24

Confirmed that's the bottom but need to pull out the title & numbers.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [13]:
for day in soup.findAll('div',{'class':'w980 wb_10 clear'}):
    print(day.text.strip())

HKSAR national security law to put HK back on track: official
 (Xinhua)    09:52, July 02, 2020


In [14]:
day

<div class="w980 wb_10 clear">
<h1>HKSAR national security law to put HK back on track: official</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    09:52, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>

In [15]:
day = remove_tags(str(day))

In [16]:
len(day)

97

In [17]:
day_pub = day[83:96]
day_pub

'July 02, 2020'

In [18]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 7, 2)

In [19]:
df_date = pd.DataFrame([day_pub])

In [20]:
type(df_date)

pandas.core.frame.DataFrame

In [21]:
df_date

Unnamed: 0,0
0,2020-07-02


Now the title.

In [22]:
for title in soup.find_all('div',{'class','w980 wb_10 clear'}):
    print(title)

<div class="w980 wb_10 clear">
<h1>HKSAR national security law to put HK back on track: official</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    09:52, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>


In [23]:
#Throw it in a list to work with
title_list = list(title)
#Take the title only & reassign back
title = title_list[1]
#Remove html 
title = remove_tags(str(title))
title

'HKSAR national security law to put HK back on track: official'

In [24]:
df_title = pd.DataFrame([title])

In [25]:
df_title

Unnamed: 0,0
0,HKSAR national security law to put HK back on ...


In [26]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [27]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'People\'s Daily'
df_source = pd.DataFrame([source])
file_name = 'peoples_daily_5'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [28]:
news1 = []
for bodies in soup.find_all('div',{'class','wb_12 clear'}):
    news1.append(bodies.text.strip())

In [29]:
df_news = pd.DataFrame()

In [30]:
df_news['article_body'] = news1

In [31]:
df_news.head(2)

Unnamed: 0,article_body
0,"Zhang Xiaoming, deputy director of the Hong Ko..."


In [32]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [33]:
df_news = df_news.article_body[0]

In [34]:
df_news = df_news.replace(r'\\?','')

In [35]:
df_news = pd.DataFrame([df_news])

In [36]:
type(df_news)

pandas.core.frame.DataFrame

In [37]:
df_news.columns = ['Article']

In [38]:
df_news.head()

Unnamed: 0,Article
0,"Zhang Xiaoming, deputy director of the Hong Ko..."


**Bringing it together.**<a id='2.5_bit'></a>

In [39]:
df_5_peoples_daily = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [40]:
df_5_peoples_daily.columns = ['file_name','date','source','country','title','article']

In [41]:
df_5_peoples_daily.head()

Unnamed: 0,file_name,date,source,country,title,article
0,peoples_daily_5,2020-07-02,People's Daily,China,HKSAR national security law to put HK back on ...,"Zhang Xiaoming, deputy director of the Hong Ko..."


**Saving**<a id='2.6_save'></a>

In [42]:
cd

C:\Users\rands


Saving it to Excel.

In [43]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_5_peoples_daily.to_csv('./_Capstone_Two_NLP/data/_news/peoples_daily_5.csv', index=False)

print('Complete')

Complete
