# Scraping Beijing security office opens in Hong Kong from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [120]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [121]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [122]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [123]:
url = 'https://www.bbc.com/news/world-asia-china-53330650'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53330650'

In [124]:
html = requests.get(url)

In [125]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [126]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

8 July 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [127]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: Beijing security office opens in Hong Kong


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [128]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

A new national security office has been officially opened in Hong Kong, placing mainland Chinese agents in the heart of the territory for the first time.
The office is one element of a sweeping new law which outlaws criticism of China's government.
Hong Kong was, until the law was passed, the only part of China not subject to such policies.
The law has caused alarm in Hong Kong but officials say it will restore stability after violent protests.
Chief Executive Carrie Lam said on Tuesday that it was "actually relatively mild as far as national security laws are concerned" and would enable Hong Kongers to "exercise their rights and freedoms without being intimidated and attacked".
The temporary base of the new office is a hotel in Causeway Bay, the commercial district next to Victoria Park, which had long been the focal point of pro-democracy protest marches and rallies in Hong Kong.
An opening ceremony was held on Wednesday morning, attended by dignitaries including Chief Executive Carr

`Approximately 4` lines need to be removed below.

It should end at `TikTok has said it is pulling out of Hong Kong entirely`

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [129]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [130]:
day_pub

'8 July 2020'

In [131]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [132]:
day_pub

datetime.date(2020, 7, 8)

In [133]:
df_date = pd.DataFrame([day_pub])

In [134]:
type(df_date)

pandas.core.frame.DataFrame

In [135]:
df_date

Unnamed: 0,0
0,2020-07-08


Now the title.

In [136]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [137]:
df_title = pd.DataFrame([title_list])

In [138]:
df_title

Unnamed: 0,0
0,Hong Kong security law: Beijing security offic...


In [139]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [140]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])
file_name = 'bbc_20'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [141]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [142]:
news1

['A new national security office has been officially opened in Hong Kong, placing mainland Chinese agents in the heart of the territory for the first time.',
 "The office is one element of a sweeping new law which outlaws criticism of China's government.",
 'Hong Kong was, until the law was passed, the only part of China not subject to such policies.',
 'The law has caused alarm in Hong Kong but officials say it will restore stability after violent protests.',
 'Chief Executive Carrie Lam said on Tuesday that it was "actually relatively mild as far as national security laws are concerned" and would enable Hong Kongers to "exercise their rights and freedoms without being intimidated and attacked".',
 'The temporary base of the new office is a hotel in Causeway Bay, the commercial district next to Victoria Park, which had long been the focal point of pro-democracy protest marches and rallies in Hong Kong.',
 'An opening ceremony was held on Wednesday morning, attended by dignitaries incl

In [143]:
len(news1)

19

In [144]:
n = 4
news1 = news1[:len(news1)-n]

In [145]:
len(news1)

15

In [146]:
news1[-1]

'Several social media companies have said they will stop co-operating with the Hong Kong police on requests for user data over concerns about how it will be used, while  TikTok has said it is pulling out of Hong Kong entirely.'

Complete.

In [147]:
df_news = pd.DataFrame()

In [148]:
df_news['article_body'] = news1

In [149]:
df_news.head(2)

Unnamed: 0,article_body
0,A new national security office has been offici...
1,The office is one element of a sweeping new la...


In [150]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [151]:
df_news = df_news.article_body[0]

In [152]:
df_news = df_news.replace(r'\\?','')

In [153]:
df_news = pd.DataFrame([df_news])

In [154]:
type(df_news)

pandas.core.frame.DataFrame

In [155]:
df_news.columns = ['Article']

In [156]:
df_news.head()

Unnamed: 0,Article
0,A new national security office has been offici...


**Bringing it together.**<a id='2.5_bit'></a>

In [157]:
df_20_bbc = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [158]:
df_20_bbc.columns = ['file_name','date','source','country','title','article']

In [159]:
df_20_bbc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,bbc_20,2020-07-08,BBC,UK,Hong Kong security law: Beijing security offic...,A new national security office has been offici...


**Saving**<a id='2.6_save'></a>

In [160]:
cd

C:\Users\rands


Saving it to Excel.

In [161]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_20_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_20.csv', index=False)

print('Complete')

Complete
