# Scraping Safeguarding long-term prosperity from Ministry of Foreign Affairs

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.fmprc.gov.cn/mfa_eng/xwfw_665399/s2510_665401/2511_665403/t1793861.shtml'
content = requests.get(url).content
url

'https://www.fmprc.gov.cn/mfa_eng/xwfw_665399/s2510_665401/2511_665403/t1793861.shtml'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('h2',{'class':'title'}):
    print(date.text.strip())

Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020


Hitting 2 birds with one stone but okay.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [7]:
for title in soup.find_all('h2',{'class':'title'}):
    print(title.text.strip())

Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020


In [8]:
TAG_RE = re.compile(r'<[^>]+>')

In [9]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [10]:
remove_tags(str(title))

"Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020"

Confirmed that's the title; need to remove the date.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [11]:
for bodies in soup.find_all('div',{'class','content'}):
    print(bodies.text.strip())

In recent years, the US government has placed unwarranted restrictions on Chinese media agencies and personnel in the US, purposely made things difficult for their normal reporting assignments, and subjected them to growing discrimination and politically-motivated oppression. On February 18, 2020, the US designated five Chinese media agencies, namely, Xinhua News Agency, China Daily Distribution Corporation, China Global Television Network, China Radio International, and distributor for People's Daily in the US Hai Tian Development USA, as foreign missions. In the spirit of reciprocity, China demanded on March 18 that the China-based branches of Voice of America, the New York Times, the Wall Street Journal, Time, and the Washington Post declare in written form information about their staff, finance, operation and real estate in China. 
On June 22, the US issued a new determination to designate four additional Chinese media agencies, namely, China Central Television, the People's Daily,

In [12]:
len(bodies)

81

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [13]:
for date in soup.find_all('h2',{'class':'title'}):
    print(date.text.strip())

Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020


In [14]:
date

<h2 class="title">Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020</h2>

In [15]:
date = str(date)

In [16]:
day_pub = remove_tags(date)
day_pub

"Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020"

In [17]:
len(day_pub)

84

In [18]:
day_pub = day_pub[72:84]
day_pub

'July 1, 2020'

In [19]:
day_pub = re.sub(',','',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = re.sub('July','07',day_pub)
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 7, 1)

In [20]:
df_date = pd.DataFrame([day_pub])

In [21]:
type(df_date)

pandas.core.frame.DataFrame

In [22]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [23]:
for title in soup.find_all('h2',{'class':'title'}):
    print(title)

<h2 class="title">Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020</h2>


In [24]:
title = remove_tags(str(title))

In [25]:
title

"Foreign Ministry Spokesperson Zhao Lijian's Regular Press Conference on July 1, 2020"

In [26]:
df_title = pd.DataFrame([title])

In [27]:
df_title

Unnamed: 0,0
0,Foreign Ministry Spokesperson Zhao Lijian's Re...


In [28]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [29]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Ministry of Foreign Affairs'
df_source = pd.DataFrame([source])
file_name = 'min_foreign_aff_10'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [30]:
news1 = []
for bodies in soup.find_all('div',{'class','content'}):
    news1.append(bodies.text.strip())

In [31]:
df_news = pd.DataFrame()

In [32]:
df_news['article_body'] = news1

In [33]:
df_news.head(2)

Unnamed: 0,article_body
0,"In recent years, the US government has placed ..."


In [34]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [35]:
df_news = df_news.article_body[0]

In [36]:
df_news = df_news.replace(r'\\?','')

In [37]:
df_news = pd.DataFrame([df_news])

In [38]:
type(df_news)

pandas.core.frame.DataFrame

In [39]:
df_news.columns = ['Article']

In [40]:
df_news.head()

Unnamed: 0,Article
0,"In recent years, the US government has placed ..."


**Bringing it together.**<a id='2.5_bit'></a>

In [41]:
df_10_min_foreign_aff = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [42]:
df_10_min_foreign_aff.columns = ['file_name','date','source','country','title','article']

In [43]:
df_10_min_foreign_aff.head()

Unnamed: 0,file_name,date,source,country,title,article
0,min_foreign_aff_10,2020-07-01,Ministry of Foreign Affairs,China,Foreign Ministry Spokesperson Zhao Lijian's Re...,"In recent years, the US government has placed ..."


**Saving**<a id='2.6_save'></a>

In [44]:
cd

C:\Users\rands


Saving it to Excel.

In [45]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_10_min_foreign_aff.to_csv('./_Capstone_Two_NLP/data/_news/min_foreign_aff_10.csv', index=False)

print('Complete')

Complete
