# Scraping Safeguarding long-term prosperity from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'http://www.xinhuanet.com/english/2020-05/16/c_139060570.htm'
content = requests.get(url).content
url

'http://www.xinhuanet.com/english/2020-05/16/c_139060570.htm'

In [5]:
soup = soup(content,'html.parser')

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'wzzy'}):
    print(date.text.strip())

Source: Xinhua| 2020-05-16 02:17:00|Editor: huaxia


In [7]:
date

<div class="wzzy">
<i class="source">Source: Xinhua</i>|<i class="time"> 2020-05-16 02:17:00</i>|<i class="editor">Editor: huaxia</i>
</div>

That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [8]:
for title in soup.findAll('h1',{'class':'Btitle'}):
    print(title)

<h1 class="Btitle">
IPCC's report comprehensive, objective, fact-based, weighty: HKSAR chief executive 
</h1>


In [9]:
TAG_RE = re.compile(r'<[^>]+>')

In [10]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [11]:
remove_tags(str(title))

"\r\nIPCC's report comprehensive, objective, fact-based, weighty: HKSAR chief executive \r\n"

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [12]:
for bodies in soup.find_all('div',{'class','content'}):
    print(bodies.text.strip())

Video PlayerClose



Carrie Lam, chief executive of China's Hong Kong Special Administrative Region (HKSAR), attends a press conference in Hong Kong, south China, May 15, 2020. (Xinhua/Lui Siu Wai)
HONG KONG, May 15 (Xinhua) -- Carrie Lam, chief executive of China's Hong Kong Special Administrative Region (HKSAR) said Friday that the report released by the Independent Police Complaints Council (IPCC) was comprehensive, objective, fact-based and weighty.
The IPCC, Hong Kong's police watchdog, released on Friday a report in regard to the social unrest in Hong Kong since last June.
The IPCC submitted to Lam its "Thematic Study Report on the Public Order Events arising from the Fugitive Offenders Bill since June 2019 and the Police Actions in Response" on Friday, which is available for viewing by members of the public.
"The IPCC has examined a large volume of information and has made detailed and objective representation of facts in the report," Lam said.
She condemned the violence amid th

In [13]:
len(bodies)

33

Confirmed that's the bottom but needs work.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [14]:
for day in soup.findAll('div',{'class':'wzzy'}):
    print(day.text.strip())

Source: Xinhua| 2020-05-16 02:17:00|Editor: huaxia


In [15]:
day

<div class="wzzy">
<i class="source">Source: Xinhua</i>|<i class="time"> 2020-05-16 02:17:00</i>|<i class="editor">Editor: huaxia</i>
</div>

In [16]:
day = remove_tags(str(day))

In [17]:
len(day)

52

In [18]:
day

'\nSource: Xinhua| 2020-05-16 02:17:00|Editor: huaxia\n'

In [19]:
day_pub = day[17:27]
day_pub

'2020-05-16'

In [20]:
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%Y-%m-%d').date()
day_pub

datetime.date(2020, 5, 16)

In [21]:
df_date = pd.DataFrame([day_pub])

In [22]:
type(df_date)

pandas.core.frame.DataFrame

In [23]:
df_date

Unnamed: 0,0
0,2020-05-16


Now the title.

In [24]:
for title in soup.find_all('h1',{'class':'Btitle'}):
    title = title.text.strip()

In [25]:
title

"IPCC's report comprehensive, objective, fact-based, weighty: HKSAR chief executive"

In [26]:
title_list = [title]

In [27]:
df_title = pd.DataFrame([title_list])

In [28]:
df_title

Unnamed: 0,0
0,"IPCC's report comprehensive, objective, fact-b..."


In [29]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [30]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Xinhua'
df_source = pd.DataFrame([source])
file_name = 'xinhua_11'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [31]:
news1 = []
for body_other in soup.find_all('div',{'class','content'}):
    news1.append(body_other.text.strip())

In [32]:
len(news1)

1

In [33]:
df_news = pd.DataFrame()

In [34]:
df_news['article_body'] = news1

In [35]:
df_news.head(2)

Unnamed: 0,article_body
0,"Video PlayerClose\n\n\n\nCarrie Lam, chief exe..."


In [36]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [37]:
df_news = df_news.article_body[0]

In [38]:
df_news = df_news.replace(r'\\?','')

In [39]:
df_news = pd.DataFrame([df_news])

In [40]:
type(df_news)

pandas.core.frame.DataFrame

In [41]:
df_news.columns = ['Article']

In [42]:
df_news.head()

Unnamed: 0,Article
0,"Video PlayerClose\n\n\n\nCarrie Lam, chief exe..."


**Bringing it together.**<a id='2.5_bit'></a>

In [43]:
df_11_xinhua = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [44]:
df_11_xinhua.columns = ['file_name','date','source','country','title','article']

In [45]:
df_11_xinhua.head()

Unnamed: 0,file_name,date,source,country,title,article
0,xinhua_11,2020-05-16,Xinhua,China,"IPCC's report comprehensive, objective, fact-b...","Video PlayerClose\n\n\n\nCarrie Lam, chief exe..."


**Saving**<a id='2.6_save'></a>

In [46]:
cd

C:\Users\rands


Saving it to Excel.

In [47]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_11_xinhua.to_csv('./_Capstone_Two_NLP/data/_news/xinhua_11.csv', index=False)

print('Complete')

Complete
