# Scraping Long-term stability from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'http://en.people.cn/n3/2020/0702/c90000-9706124.html'
content = requests.get(url).content
url

'http://en.people.cn/n3/2020/0702/c90000-9706124.html'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'wb_1 clear'}):
    print(date.text.strip())

(People's Daily)    08:39, July 02, 2020


That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [7]:
for title in soup.find('h2'):
    print(title.text.strip())

Hong Kong national security law helps ensure long-term stability of "one country, two systems"


In [8]:
title

<span id="p_title">Hong Kong national security law helps ensure long-term stability of "one country, two systems"</span>

In [9]:
TAG_RE = re.compile(r'<[^>]+>')

In [10]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [11]:
remove_tags(str(title))

'Hong Kong national security law helps ensure long-term stability of "one country, two systems"'

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [12]:
for bodies in soup.find_all('div',{'class','wb_12 wb_12b clear'}):
    print(bodies.text.strip())

On June 30, the Standing Committee of the 13th National People's Congress (NPC) passed the Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (HKSAR) and adopted a decision to list the law in Annex III to the HKSAR Basic Law.

	As a major move of the central government to manage Hong Kong affairs since its return to the motherland in 1997, the law will fully and faithfully implement the principle of “one country, two systems” and the HKSAR Basic Law, help safeguard national sovereignty, security and development interests, maintain Hong Kong's lasting prosperity and stability, and ensure the long-term stability of "one country, two systems". It bears both practical and historical significance.

	The practice of "one country, two systems" has achieved a universally recognized success in Hong Kong since its return to the motherland. However, it has also encountered new circumstances and problems.

	Especially since the di

In [13]:
len(bodies)

51

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [14]:
for date in soup.find_all('div',{'class':'wb_1 clear'}):
    print(date.text.strip())

(People's Daily)    08:39, July 02, 2020


In [15]:
date

<div class="wb_1 clear"> (<a href="http://en.people.cn">People's Daily</a>)    08:39, July 02, 2020<em><a href="https://apple.news/TkJFGBPvhTwyCvquJc-AqRQ" target="_blank"><img alt="" src="/img/FOREIGN/2015/03/212677/images/icon36.gif"/></a></em></div>

In [16]:
date = remove_tags(str(date))
date

" (People's Daily)\xa0\xa0\xa0\xa008:39, July 02, 2020"

In [17]:
len(date)

41

In [18]:
day_pub = date[28:41]
day_pub

'July 02, 2020'

In [19]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()

day_pub

datetime.date(2020, 7, 2)

In [20]:
df_date = pd.DataFrame([day_pub])

In [21]:
type(df_date)

pandas.core.frame.DataFrame

In [22]:
df_date

Unnamed: 0,0
0,2020-07-02


Now the title.

In [23]:
for title in soup.find('h2'):
    print(title.text.strip())

Hong Kong national security law helps ensure long-term stability of "one country, two systems"


In [24]:
title

<span id="p_title">Hong Kong national security law helps ensure long-term stability of "one country, two systems"</span>

In [25]:
title = remove_tags(str(title))

In [26]:
title

'Hong Kong national security law helps ensure long-term stability of "one country, two systems"'

In [27]:
df_title = pd.DataFrame([title])

In [28]:
df_title

Unnamed: 0,0
0,Hong Kong national security law helps ensure l...


In [29]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [30]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'People\'s Daily'
df_source = pd.DataFrame([source])
file_name = 'peoples_daily_2'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [31]:
news1 = []
for bodies in soup.find_all('div',{'class','wb_12 wb_12b clear'}):
    news1.append(bodies.text.strip())

In [32]:
news1

['On June 30, the Standing Committee of the 13th National People\'s Congress (NPC) passed the Law of the People\'s Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (HKSAR) and adopted a decision to list the law in Annex III to the HKSAR Basic Law.\n\n\tAs a major move of the central government to manage Hong Kong affairs since its return to the motherland in 1997, the law will fully and faithfully implement the principle of “one country, two systems” and the HKSAR Basic Law, help safeguard national sovereignty, security and development interests, maintain Hong Kong\'s lasting prosperity and stability, and ensure the long-term stability of "one country, two systems". It bears both practical and historical significance.\n\n\tThe practice of "one country, two systems" has achieved a universally recognized success in Hong Kong since its return to the motherland. However, it has also encountered new circumstances and problems.\n\n\tEspeciall

In [33]:
df_news = pd.DataFrame()

In [34]:
df_news['article_body'] = news1

In [35]:
df_news.head(2)

Unnamed: 0,article_body
0,"On June 30, the Standing Committee of the 13th..."


In [36]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [37]:
df_news = df_news.article_body[0]

In [38]:
df_news = df_news.replace(r'\\?','')

In [39]:
df_news = pd.DataFrame([df_news])

In [40]:
type(df_news)

pandas.core.frame.DataFrame

In [41]:
df_news.columns = ['Article']

In [42]:
df_news.head()

Unnamed: 0,Article
0,"On June 30, the Standing Committee of the 13th..."


**Bringing it together.**<a id='2.5_bit'></a>

In [43]:
df_2_peoples_daily = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [44]:
df_2_peoples_daily.columns = ['file_name','date','source','country','title','article']

In [45]:
df_2_peoples_daily.head()

Unnamed: 0,file_name,date,source,country,title,article
0,peoples_daily_2,2020-07-02,People's Daily,China,Hong Kong national security law helps ensure l...,"On June 30, the Standing Committee of the 13th..."


**Saving**<a id='2.6_save'></a>

In [46]:
cd

C:\Users\rands


Saving it to Excel.

In [47]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_2_peoples_daily.to_csv('./_Capstone_Two_NLP/data/_news/peoples_daily_2.csv', index=False)

print('Complete')

Complete
