# Scraping UK bluffs from Global Times

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [5]:
url = 'https://www.globaltimes.cn/content/1193696.shtml'.format(d)
url

'https://www.globaltimes.cn/content/1193696.shtml'

In [6]:
html = requests.get(url)

In [9]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [13]:
for date in bsobj.find('div',{'class':'span8 text-left'}):
    print(date.strip())

By Yang Sheng and Shen Weiduo Source:Global Times Published: 2020/7/6 22:23:40


Needs a lot of work.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [17]:
for title in bsobj.findAll('div',{'class':'row-fluid article-title'}):
    print(format(title.text))


 UK bluffs over Hong Kong 

 London’s measures meaningless, ‘bluff rather than bite’ 


Two titles here but the second one helps explain the first one.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [16]:
for title in bsobj.findAll('div',{'class':'span12 row-content'}):
    print(format(title.text))

 
A Huawei store stands next to a Globe Telecom booth in Makati City, the Philippines on April 14, 2019. Photo: cnsphotoThe UK, a country with massive untold interests in Hong Kong and wants to retain its colonial influence in the city as much as possible, is now acting tough against China's national security law for its Hong Kong Special Administrative Region (HKSAR). Chinese analysts said on Monday that the actions of the UK are more of a bluff since they can't harm China, but will only damage itself.The measures that the UK would take, including phasing out the use of Chinese firm's technology in its 5G network and offering up to 3 million Hong Kong residents the chance to settle in UK, will cost a huge amount of money and resources for the UK, rather than harm China, experts said.British Prime Minister Boris Johnson declared on Wednesday that China's new national security law in Hong Kong was a "clear and serious breach" of the 1984 Sino-British Joint Declaration. It was among the 

That's not the bottom; multiple news topics combined into one. Will need to see how to delete below.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [22]:
for day in bsobj.find('div',{'class':'span8 text-left'}):
    day_pub = day.strip()

In [23]:
day_pub

'By Yang Sheng and Shen Weiduo Source:Global Times Published: 2020/7/6 22:23:40'

In [25]:
#Replacing the first part
day_pub = re.sub('By Yang Sheng and Shen Weiduo Source:Global Times Published: ','',day_pub)
#Replacing the second part
day_pub = re.sub(' 22:23:40','',day_pub)
day_pub = re.sub('/','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%Y-%m-%d').date()
day_pub

datetime.date(2020, 7, 6)

In [26]:
df_date = pd.DataFrame([day_pub])

In [27]:
type(df_date)

pandas.core.frame.DataFrame

In [28]:
df_date

Unnamed: 0,0
0,2020-07-06


Now the title.

In [32]:
for title_s in bsobj.findAll('div',{'class':'row-fluid article-title'}):
    title_list = format(title_s.text)
title_list

' London’s measures meaningless, ‘bluff rather than bite’ '

That's the second part but fully encapsulates the story so we will work with it.

In [33]:
title_list = re.sub(' London’s measures meaningless, ‘bluff rather than bite’ ','London’s measures meaningless, ‘bluff rather than bite’',title_list)
title_list

'London’s measures meaningless, ‘bluff rather than bite’'

In [34]:
df_title = pd.DataFrame([title_list])

In [35]:
df_title

Unnamed: 0,0
0,"London’s measures meaningless, ‘bluff rather t..."


In [36]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [37]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Global Times'
df_source = pd.DataFrame([source])
file_name = 'globaltimes_7'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [38]:
news1 = []
for news in bsobj.findAll('div',{'class':'span12 row-content'}):
    news1.append(news.text.strip())

In [40]:
len(news1)

1

In [41]:
df_news = pd.DataFrame()

In [42]:
df_news['article_body'] = news1

In [43]:
df_news.head(2)

Unnamed: 0,article_body
0,A Huawei store stands next to a Globe Telecom ...


In [44]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [45]:
df_news = df_news.article_body[0]

In [46]:
df_news = df_news.replace(r'\\?','')

In [47]:
df_news = pd.DataFrame([df_news])

In [48]:
type(df_news)

pandas.core.frame.DataFrame

In [49]:
df_news.columns = ['Article']

In [50]:
df_news.head()

Unnamed: 0,Article
0,A Huawei store stands next to a Globe Telecom ...


**Bringing it together.**<a id='2.5_bit'></a>

In [51]:
df_7_global_times = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [52]:
df_7_global_times.columns = ['file_name','date','source','country','title','article']

In [53]:
df_7_global_times.head()

Unnamed: 0,file_name,date,source,country,title,article
0,globaltimes_7,2020-07-06,Global Times,China,"London’s measures meaningless, ‘bluff rather t...",A Huawei store stands next to a Globe Telecom ...


**Saving**<a id='2.6_save'></a>

In [54]:
cd

C:\Users\rands


Saving it to Excel.

In [55]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_7_global_times.to_csv('./_Capstone_Two_NLP/data/_news/globaltimes_7.csv', index=False)

print('Complete')

Complete
