# Scraping Safeguarding long-term prosperity from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'http://en.people.cn/n3/2020/0702/c90000-9706374.html'
content = requests.get(url).content
url

'http://en.people.cn/n3/2020/0702/c90000-9706374.html'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'wb_1 clear'}):
    print(date.text.strip())

By Md Enamul Hassan (People's Daily Online)    16:02, July 02, 2020


That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [7]:
for title in soup.find_all('span',{'id','p_title'}):
    print(title.get_text.strip())

In [8]:
for title in soup.find_all('span',{'id':'p_title'}):
    print(title)

<span id="p_title">National Security Law paves the way for more prosperous Hong Kong</span>


In [9]:
TAG_RE = re.compile(r'<[^>]+>')

In [10]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [11]:
remove_tags(str(title))

'National Security Law paves the way for more prosperous Hong Kong'

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [12]:
for bodies in soup.find_all('div',{'class','wb_12 wb_12b clear'}):
    print(bodies.text.strip())

Hong Kong, the popular Asian city, returned to its motherland China after a century of British colonial rule on this very day 23 years ago, on July 1, 1997. Hong Kong was an inseparable part of China, but became a British colony after the Qing dynasty ceded it to the British Empire in 1842 through the Treaty of Nanjing, ending the First Opium War.



Hong Kong citizens celebrate the passage of the Law of the People's Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region (HKSAR) in Causeway Bay of south China's Hong Kong, June 30, 2020. The law was passed at the 20th session of the Standing Committee of the 13th National People's Congress (NPC). (Xinhua/Wang Shen)

	To commemorate its return to the motherland, Hongkongers are celebrating the historic occasion in a befitting manner today. The central government of China made the celebrations very special by passing the National Security Law for Hong Kong on June 30. Hongkongers have warmly wel

In [13]:
len(bodies)

37

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [14]:
for date in soup.find_all('div',{'class':'wb_1 clear'}):
    print(date.text.strip())

By Md Enamul Hassan (People's Daily Online)    16:02, July 02, 2020


In [15]:
date = str(date)

In [16]:
date = remove_tags(date)
date

"By Md\xa0Enamul\xa0Hassan (People's Daily Online)\xa0\xa0\xa0\xa016:02, July 02, 2020"

In [17]:
len(date)

67

In [18]:
day_pub = date[54:67]
day_pub

'July 02, 2020'

In [19]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()

day_pub

datetime.date(2020, 7, 2)

In [20]:
df_date = pd.DataFrame([day_pub])

In [21]:
type(df_date)

pandas.core.frame.DataFrame

In [22]:
df_date

Unnamed: 0,0
0,2020-07-02


Now the title.

In [23]:
for title in soup.find_all('span',{'id','p_title'}):
    print(title.get_text.strip())

In [24]:
title

<span id="p_title">National Security Law paves the way for more prosperous Hong Kong</span>

In [25]:
title = remove_tags(str(title))

In [26]:
title

'National Security Law paves the way for more prosperous Hong Kong'

In [27]:
df_title = pd.DataFrame([title])

In [28]:
df_title

Unnamed: 0,0
0,National Security Law paves the way for more p...


In [29]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [30]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'People\'s Daily'
df_source = pd.DataFrame([source])
file_name = 'peoples_daily_4'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [31]:
news1 = []
for bodies in soup.find_all('div',{'class','wb_12 wb_12b clear'}):
    news1.append(bodies.text.strip())

In [32]:
df_news = pd.DataFrame()

In [33]:
df_news['article_body'] = news1

In [34]:
df_news.head(2)

Unnamed: 0,article_body
0,"Hong Kong, the popular Asian city, returned to..."


In [35]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [36]:
df_news = df_news.article_body[0]

In [37]:
df_news = df_news.replace(r'\\?','')

In [38]:
df_news = pd.DataFrame([df_news])

In [39]:
type(df_news)

pandas.core.frame.DataFrame

In [40]:
df_news.columns = ['Article']

In [41]:
df_news.head()

Unnamed: 0,Article
0,"Hong Kong, the popular Asian city, returned to..."


**Bringing it together.**<a id='2.5_bit'></a>

In [42]:
df_4_peoples_daily = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [43]:
df_4_peoples_daily.columns = ['file_name','date','source','country','title','article']

In [44]:
df_4_peoples_daily.head()

Unnamed: 0,file_name,date,source,country,title,article
0,peoples_daily_4,2020-07-02,People's Daily,China,National Security Law paves the way for more p...,"Hong Kong, the popular Asian city, returned to..."


**Saving**<a id='2.6_save'></a>

In [45]:
cd

C:\Users\rands


Saving it to Excel.

In [46]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_4_peoples_daily.to_csv('./_Capstone_Two_NLP/data/_news/peoples_daily_4.csv', index=False)

print('Complete')

Complete
