# Scraping Safeguarding long-term prosperity from Ministry of Foreign Affairs

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.fmprc.gov.cn/mfa_eng/wjb_663304/zwjg_665342/zwbd_665378/t1798919.shtml'
content = requests.get(url).content
url

'https://www.fmprc.gov.cn/mfa_eng/wjb_663304/zwjg_665342/zwbd_665378/t1798919.shtml'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('div',{'class':'date'}):
    print(date.text.strip())

2020/07/20


Beautiful.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [7]:
for title in soup.find_all('h2',{'class':'title'}):
    print(title)

<h2 class="title">Ambassador Liu Xiaoming Gives Exclusive Live Interview on BBC's Andrew Marr Show</h2>


In [8]:
TAG_RE = re.compile(r'<[^>]+>')

In [9]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [10]:
remove_tags(str(title))

"Ambassador Liu Xiaoming Gives Exclusive Live Interview on BBC's Andrew Marr Show"

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [11]:
for bodies in soup.find_all('div',{'class','content'}):
    print(bodies.text.strip())

On 19 July 2020, H.E. Ambassador Liu Xiaoming gave an exclusive live interview on BBC ONE's Andrew Marr Show about Hong Kong, Huawei, and Xinjiang. The full text is as follows:

Marr: Ambassador, welcome. Can I, first of all, ask you about Hong Kong? Are rights of dissent and freedom of speech still valued in Hong Kong?
Ambassador Liu: Fully respected. I think people talk about this National Security Law. National Security law is about restoring order and protecting the rights of the majority of people. It's targeted on a very small group of criminals who intend to endanger the national security.
Marr: But let me remind people what the National Security Law actually says. It says that Beijing now decides what breaks the law, not Hong Kong itself. Protesters can be arrested just for using placards. Police can search buildings without warrants and trials can be held in secret without a jury. Surely those laws break that "One Country, Two Systems" promise China originally made.
Ambassador

In [12]:
len(bodies)

115

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [13]:
for date in soup.find_all('div',{'class':'date'}):
    print(date.text.strip())

2020/07/20


In [14]:
date

<div class="date" id="News_Body_Time">2020/07/20</div>

In [15]:
date = str(date)

In [16]:
day_pub = remove_tags(date)
day_pub

'2020/07/20'

In [17]:
#Replacing commas
day_pub = re.sub('/','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%Y-%m-%d').date()

day_pub

datetime.date(2020, 7, 20)

In [18]:
df_date = pd.DataFrame([day_pub])

In [19]:
type(df_date)

pandas.core.frame.DataFrame

In [20]:
df_date

Unnamed: 0,0
0,2020-07-20


Now the title.

In [21]:
for title in soup.find_all('h2',{'class':'title'}):
    print(title)

<h2 class="title">Ambassador Liu Xiaoming Gives Exclusive Live Interview on BBC's Andrew Marr Show</h2>


In [22]:
title = remove_tags(str(title))

In [23]:
title

"Ambassador Liu Xiaoming Gives Exclusive Live Interview on BBC's Andrew Marr Show"

In [24]:
df_title = pd.DataFrame([title])

In [25]:
df_title

Unnamed: 0,0
0,Ambassador Liu Xiaoming Gives Exclusive Live I...


In [26]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [27]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Ministry of Foreign Affairs'
df_source = pd.DataFrame([source])
file_name = 'min_foreign_aff_9'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [28]:
news1 = []
for bodies in soup.find_all('div',{'class','content'}):
    news1.append(bodies.text.strip())

In [29]:
df_news = pd.DataFrame()

In [30]:
df_news['article_body'] = news1

In [31]:
df_news.head(2)

Unnamed: 0,article_body
0,"On 19 July 2020, H.E. Ambassador Liu Xiaoming ..."


In [32]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [33]:
df_news = df_news.article_body[0]

In [34]:
df_news = df_news.replace(r'\\?','')

In [35]:
df_news = pd.DataFrame([df_news])

In [36]:
type(df_news)

pandas.core.frame.DataFrame

In [37]:
df_news.columns = ['Article']

In [38]:
df_news.head()

Unnamed: 0,Article
0,"On 19 July 2020, H.E. Ambassador Liu Xiaoming ..."


**Bringing it together.**<a id='2.5_bit'></a>

In [39]:
df_9_min_foreign_aff = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [40]:
df_9_min_foreign_aff.columns = ['file_name','date','source','country','title','article']

In [41]:
df_9_min_foreign_aff.head()

Unnamed: 0,file_name,date,source,country,title,article
0,min_foreign_aff_9,2020-07-20,Ministry of Foreign Affairs,China,Ambassador Liu Xiaoming Gives Exclusive Live I...,"On 19 July 2020, H.E. Ambassador Liu Xiaoming ..."


**Saving**<a id='2.6_save'></a>

In [42]:
cd

C:\Users\rands


Saving it to Excel.

In [43]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_9_min_foreign_aff.to_csv('./_Capstone_Two_NLP/data/_news/min_foreign_aff_9.csv', index=False)

print('Complete')

Complete
