# Scraping Protect HK democracy, freedom from Global Times

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [99]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [100]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [101]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [102]:
url = 'https://www.globaltimes.cn/content/1193131.shtml'.format(d)
url

'https://www.globaltimes.cn/content/1193131.shtml'

In [103]:
html = requests.get(url)

In [104]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [105]:
for date in bsobj.findAll('div',{'class':'span8 text-left'}):
    print(date.text.strip())

Source:Global Times Published: 2020/6/30 23:55:07


Needs work but moderate.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [106]:
for date in bsobj.findAll("h3"):
    print(date.text.strip())

National Security Law to protect HK democracy, freedom: Global Times editorial


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [107]:
for news in bsobj.findAll('div',{'class':'span12 row-content'}):
    print(news.text.strip())

Hong Kong citizens on Tuesday gather to support the National Security Law for Hong Kong. Photo: cnsphotoThe Standing Committee of the National People's Congress (NPC) passed the National Security Law for Hong Kong on Tuesday. The law took effect at 11 pm on Tuesday. The full text of the law shows that the law's goal is in line with national security laws across the world. There is nothing in it that suppresses democracy and freedom in Hong Kong. The four categories of crimes the law strikes have nothing to do with freedom of speech, assembly and association. Claims that the law was enacted to strengthen control on Hong Kong society are either prejudiced interpretations or ill-intentioned propaganda. Hong Kong needs a law to safeguard national security. This is the principle established in the Basic Law. Article 23 of the Basic Law stipulates that the Hong Kong Special Administrative Region "shall enact laws on its own" for the sake of national security. But there had been a vacuum in r

Confirmed that's near the bottom; one line to remove.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [108]:
for day in bsobj.find('div',{'class':'span8 text-left'}):
    day_pub = day.strip()

In [109]:
day_pub

'Source:Global Times Published: 2020/6/30 23:55:07'

In [110]:
day_pub = re.sub(' 23:55:07','',day_pub)
day_pub = re.sub('Source:Global Times Published: ','',day_pub)
day_pub = re.sub('/','-',day_pub)
day_pub = datetime.strptime(day_pub, '%Y-%m-%d').date()
day_pub

datetime.date(2020, 6, 30)

In [111]:
df_date = pd.DataFrame([day_pub])

In [112]:
type(df_date)

pandas.core.frame.DataFrame

In [113]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [114]:
for title_s in bsobj.findAll("h3"):
    title_list = format(title_s.text)

In [115]:
df_title = pd.DataFrame([title_list])

In [116]:
df_title

Unnamed: 0,0
0,National Security Law to protect HK democracy...


In [117]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [118]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'Global Times'
df_source = pd.DataFrame([source])
file_name = 'globaltimes_8'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [119]:
news1 = []
for news in bsobj.findAll('div',{'class':'span12 row-content'}):
    news1.append(news.text.strip())

In [120]:
df_news = pd.DataFrame()

In [121]:
df_news['article_body'] = news1

In [122]:
df_news.head(2)

Unnamed: 0,article_body
0,Hong Kong citizens on Tuesday gather to suppor...


In [123]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [124]:
df_news = df_news.article_body[0]

In [125]:
df_news = df_news.replace(r'\\?','')

In [126]:
df_news = pd.DataFrame([df_news])

In [127]:
type(df_news)

pandas.core.frame.DataFrame

In [128]:
df_news.columns = ['Article']

In [129]:
df_news.head()

Unnamed: 0,Article
0,Hong Kong citizens on Tuesday gather to suppor...


**Bringing it together.**<a id='2.5_bit'></a>

In [130]:
df_8_globaltimes = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [131]:
df_8_globaltimes.columns = ['file_name','date','source','country','title','article']

In [132]:
df_8_globaltimes.head()

Unnamed: 0,file_name,date,source,country,title,article
0,globaltimes_8,2020-06-30,Global Times,China,National Security Law to protect HK democracy...,Hong Kong citizens on Tuesday gather to suppor...


**Saving**<a id='2.6_save'></a>

In [133]:
cd

C:\Users\rands


Saving it to Excel.

In [134]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_8_globaltimes.to_csv('./_Capstone_Two_NLP/data/_news/globaltimes_8.csv', index=False)

print('Complete')

Complete
