# Scraping Safeguarding long-term prosperity from People’s Daily

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [163]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [164]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [165]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [166]:
url = 'https://www.globaltimes.cn/page/202104/1221482.shtml'.format(d)
url

'https://www.globaltimes.cn/page/202104/1221482.shtml'

In [167]:
html = requests.get(url)

In [168]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [169]:
for date in bsobj.find('span',{'class':'pub_time'}):
    print(date.strip())

Published: Apr 19, 2021 09:39 PM


Needs minor work.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [170]:
for title in bsobj.findAll('div',{'class':'article_title'}):
    print(format(title.text))

Biased tone, misinformation ‘major mistakes’ by BBC on HK riot: Chinese scholar


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [171]:
for news in bsobj.findAll('div',{'class':'article_right'}):
    print(news.text.strip())

Victor Gao (center), chair professor of Soochow University and vice president of the Center for China and Globalization, was invited to the BBC's Newsnight to debate Hong Kong secessionist Nathan Law (left) on Saturday. Screenshot from BBC News YouTube channel.There were some major mistakes that the BBC made in its latest TV program Newsnight, including a biased tone and misinformation on the illegal protests led by radical anti-government figures who were wrongfully deemed "pro-democracy activists," a Chinese scholar who recently engaged in an online debate with Hong Kong secessionist Nathan Law told the Global Times. Victor Gao, chair professor of Soochow University and vice president of the Center for China and Globalization, was invited to BBC's Newsnight to debate with Law, after other opposition figures such as Martin Lee Chu-ming and secessionists like Jimmy Lai Chee-ying were sentenced, which the BBC host described as 60- and- 70-something men being thrown into jail for partici

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [172]:
for day in bsobj.find('span',{'class':'pub_time'}):
    day_pub = day.strip()

In [173]:
day_pub

'Published: Apr 19, 2021 09:39 PM'

In [174]:
day_pub = re.sub('Apr','04',day_pub)
day_pub = re.sub('Published: ','',day_pub)
day_pub = re.sub(' 09:39 PM','',day_pub)
day_pub = re.sub(', ',' ',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2021, 4, 19)

In [175]:
df_date = pd.DataFrame([day_pub])

In [176]:
type(df_date)

pandas.core.frame.DataFrame

In [177]:
df_date

Unnamed: 0,0
0,2021-04-19


Now the title.

In [178]:
for title_s in bsobj.findAll('div',{'class':'article_title'}):
    title_list = format(title_s.text)

In [179]:
df_title = pd.DataFrame([title_list])

In [180]:
df_title

Unnamed: 0,0
0,"Biased tone, misinformation ‘major mistakes’ b..."


In [181]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [182]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Global Times'
df_source = pd.DataFrame([source])
file_name = 'globaltimes_6'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [183]:
news1 = []
for news in bsobj.findAll('div',{'class':'article_right'}):
    news1.append(news.text.strip())

In [184]:
df_news = pd.DataFrame()

In [185]:
df_news['article_body'] = news1

In [186]:
df_news.head(2)

Unnamed: 0,article_body
0,"Victor Gao (center), chair professor of Soocho..."


In [187]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [188]:
df_news = df_news.article_body[0]

In [189]:
df_news = df_news.replace(r'\\?','')

In [190]:
df_news = pd.DataFrame([df_news])

In [191]:
type(df_news)

pandas.core.frame.DataFrame

In [192]:
df_news.columns = ['Article']

In [193]:
df_news.head()

Unnamed: 0,Article
0,"Victor Gao (center), chair professor of Soocho..."


**Bringing it together.**<a id='2.5_bit'></a>

In [194]:
df_6_global_times = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [195]:
df_6_global_times.columns = ['file_name','date','source','country','title','article']

In [196]:
df_6_global_times.head()

Unnamed: 0,file_name,date,source,country,title,article
0,globaltimes_6,2021-04-19,Global Times,China,"Biased tone, misinformation ‘major mistakes’ b...","Victor Gao (center), chair professor of Soocho..."


**Saving**<a id='2.6_save'></a>

In [197]:
cd

C:\Users\rands


Saving it to Excel.

In [198]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_6_global_times.to_csv('./_Capstone_Two_NLP/data/_news/globaltimes_6.csv', index=False)

print('Complete')

Complete
