# Scraping Minutes after new law from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [132]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [133]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [134]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [135]:
url = 'https://www.bbc.com/news/world-asia-china-53231158'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53231158'

In [136]:
html = requests.get(url)

In [137]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [138]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

30 June 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [139]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: Minutes after new law, pro-democracy voices quit


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [140]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

On Tuesday morning, the news started to break from Beijing: China had passed a new security law in Hong Kong.
The law criminalises any act of secession, subversion, terrorism or collusion with foreign forces.
And within minutes, the effect was obvious. Pro-democracy activists in Hong Kong began to quit, fearful of the new law, and the punishment it allows.
Here is some of the reaction from them, other governments, and campaign groups.
Secretary-general and founding member of pro-democracy group Demosisto, and key figure in 2014 Umbrella movement
"It [the law] marks the end of Hong Kong that the world knew before," said Mr Wong, after announcing he was quitting Demosisto.
"From now on, Hong Kong enters a new era of reign of terror, just like Taiwan's White Terror, with arbitrary prosecutions, black jails, secret trials, forced confessions, media clampdowns and political censorship.
"With sweeping powers and ill-defined law, the city will turn into a secret police state. Hong Kong protes

Need to clean the bottom line at the end.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [141]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [142]:
day_pub

'30 June 2020'

In [143]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [144]:
day_pub

datetime.date(2020, 6, 30)

In [145]:
df_date = pd.DataFrame([day_pub])

In [146]:
type(df_date)

pandas.core.frame.DataFrame

In [147]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [148]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [149]:
df_title = pd.DataFrame([title_list])

In [150]:
df_title

Unnamed: 0,0
0,"Hong Kong security law: Minutes after new law,..."


In [151]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [152]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])
file_name = 'bbc_19'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [153]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [154]:
type(news1)

list

In [155]:
len(news1)

27

In [156]:
del news1[-1]

In [157]:
len(news1)

26

Double confirming the last element was removed.

In [158]:
news1

['On Tuesday morning, the news started to break from Beijing: China had passed a new security law in Hong Kong.',
 'The law criminalises any act of secession, subversion, terrorism or collusion with foreign forces.',
 'And within minutes, the effect was obvious. Pro-democracy activists in Hong Kong began to quit, fearful of the new law, and the punishment it allows.',
 'Here is some of the reaction from them, other governments, and campaign groups.',
 'Secretary-general and founding member of pro-democracy group Demosisto, and key figure in 2014 Umbrella movement',
 '"It [the law] marks the end of Hong Kong that the world knew before," said Mr Wong, after announcing he was quitting Demosisto.',
 '"From now on, Hong Kong enters a new era of reign of terror, just like Taiwan\'s White Terror, with arbitrary prosecutions, black jails, secret trials, forced confessions, media clampdowns and political censorship.',
 '"With sweeping powers and ill-defined law, the city will turn into a secret

Confirmed.

In [159]:
df_news = pd.DataFrame()

In [160]:
df_news['article_body'] = news1

In [161]:
df_news.head(2)

Unnamed: 0,article_body
0,"On Tuesday morning, the news started to break ..."
1,"The law criminalises any act of secession, sub..."


In [162]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [163]:
df_news = df_news.article_body[0]

In [164]:
df_news = df_news.replace(r'\\?','')

In [165]:
df_news = pd.DataFrame([df_news])

In [166]:
type(df_news)

pandas.core.frame.DataFrame

In [167]:
df_news.columns = ['Article']

In [168]:
df_news.head()

Unnamed: 0,Article
0,"On Tuesday morning, the news started to break ..."


**Bringing it together.**<a id='2.5_bit'></a>

In [169]:
df_19_bbc = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [170]:
df_19_bbc.columns = ['file_name','date','source','country','title','article']

In [171]:
df_19_bbc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,bbc_19,2020-06-30,BBC,UK,"Hong Kong security law: Minutes after new law,...","On Tuesday morning, the news started to break ..."


**Saving**<a id='2.6_save'></a>

Saving it to Excel.

In [172]:
cd

C:\Users\rands


In [173]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_19_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_19.csv', index=False)

print('Complete')

Complete
