# Scraping Safeguarding long-term prosperity from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [92]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [93]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [94]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [95]:
url = 'http://en.people.cn/n3/2020/0702/c90000-9706131.html'
content = requests.get(url).content
url

'http://en.people.cn/n3/2020/0702/c90000-9706131.html'

In [96]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [97]:
for date in soup.find_all('div',{'class':'w980 wb_10 clear'}):
    print(date.text.strip())

Law and order dawns in Hong Kong as new law takes effect
 (Xinhua)    08:42, July 02, 2020


That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [98]:
for title in soup.findAll('div',{'class','w980 wb_10 clear'}):
    print(title)

<div class="w980 wb_10 clear">
<h1>Law and order dawns in Hong Kong as new law takes effect</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    08:42, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>


In [99]:
TAG_RE = re.compile(r'<[^>]+>')

In [100]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [101]:
remove_tags(str(title))

'\nLaw and order dawns in Hong Kong as new law takes effect\n (Xinhua)\xa0\xa0\xa0\xa008:42, July 02, 2020\n'

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [102]:
for bodies in soup.find_all('div',{'class','wb_12 clear'}):
    print(bodies.text.strip())

-- The festive mood across Hong Kong on the 23rd anniversary of its return to the motherland stood in sharp contrast to the scenes a year ago when rioters stormed the Legislative Council complex and wreaked havoc inside.

-- The newly enacted law on safeguarding national security in Hong Kong would help restore stability, said HKSAR Chief Executive Carrie Lam.

-- The promulgation of the law marks a significant turning point for Hong Kong to move from turmoil to stability, and a major milestone for the practice of "one country, two systems," said Luo Huining, director of the Liaison Office of the Central People's Government in the HKSAR.

-- "If it were not for the national security legislation, I would decide to leave Hong Kong at once," said Angelo Giuliano, a Swiss expatriate.

	HONG KONG, July 1 (Xinhua) -- Hong Kong reached a significant turning point on the 23rd anniversary of its return to the motherland as a law on safeguarding national security came into force.

	Celebrations 

In [103]:
len(bodies)

97

Confirmed that's the bottom but need to pull out the title & numbers.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [104]:
for day in soup.findAll('div',{'class':'w980 wb_10 clear'}):
    print(day.text.strip())

Law and order dawns in Hong Kong as new law takes effect
 (Xinhua)    08:42, July 02, 2020


In [105]:
day

<div class="w980 wb_10 clear">
<h1>Law and order dawns in Hong Kong as new law takes effect</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    08:42, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>

In [106]:
day = remove_tags(str(day))

In [107]:
len(day)

92

In [108]:
day_pub = day[78:91]
day_pub

'July 02, 2020'

In [109]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 7, 2)

In [110]:
df_date = pd.DataFrame([day_pub])

In [111]:
type(df_date)

pandas.core.frame.DataFrame

In [112]:
df_date

Unnamed: 0,0
0,2020-07-02


Now the title.

In [113]:
for title in soup.find_all('div',{'class','w980 wb_10 clear'}):
    print(title)

<div class="w980 wb_10 clear">
<h1>Law and order dawns in Hong Kong as new law takes effect</h1>
<div> (<a href="http://www.xinhuanet.com/english/">Xinhua</a>)    08:42, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>
</div>


In [114]:
#Throw it in a list to work with
title_list = list(title)
#Take the title only & reassign back
title = title_list[1]
#Remove html 
title = remove_tags(str(title))
title

'Law and order dawns in Hong Kong as new law takes effect'

In [115]:
df_title = pd.DataFrame([title])

In [116]:
df_title

Unnamed: 0,0
0,Law and order dawns in Hong Kong as new law ta...


In [117]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [118]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'People\'s Daily'
df_source = pd.DataFrame([source])
file_name = 'peoples_daily_3'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [124]:
news1 = []
for bodies in soup.find_all('div',{'class','wb_12 clear'}):
    news1.append(bodies.text.strip())

In [125]:
df_news = pd.DataFrame()

In [126]:
df_news['article_body'] = news1

In [127]:
df_news.head(2)

Unnamed: 0,article_body
0,-- The festive mood across Hong Kong on the 23...


In [128]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [129]:
df_news = df_news.article_body[0]

In [130]:
df_news = df_news.replace(r'\\?','')

In [131]:
df_news = pd.DataFrame([df_news])

In [132]:
type(df_news)

pandas.core.frame.DataFrame

In [133]:
df_news.columns = ['Article']

In [134]:
df_news.head()

Unnamed: 0,Article
0,-- The festive mood across Hong Kong on the 23...


**Bringing it together.**<a id='2.5_bit'></a>

In [135]:
df_3_peoples_daily = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [136]:
df_3_peoples_daily.columns = ['file_name','date','source','country','title','article']

In [137]:
df_3_peoples_daily.head()

Unnamed: 0,file_name,date,source,country,title,article
0,peoples_daily_3,2020-07-02,People's Daily,China,Law and order dawns in Hong Kong as new law ta...,-- The festive mood across Hong Kong on the 23...


**Saving**<a id='2.6_save'></a>

In [138]:
cd

C:\Users\rands


Saving it to Excel.

In [139]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_3_peoples_daily.to_csv('./_Capstone_Two_NLP/data/_news/peoples_daily_3.csv', index=False)

print('Complete')

Complete
