# Scraping Pro-democracy books from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [69]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [70]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [71]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [72]:
url = 'https://www.bbc.com/news/world-asia-china-53296810'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53296810'

In [73]:
html = requests.get(url)

In [74]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [75]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

5 July 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [76]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: Pro-democracy books pulled from libraries


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [77]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

Books by pro-democracy figures have been removed from public libraries in Hong Kong in the wake of a controversial new security law.
The works will be reviewed to see if they violate the new law, the authority which runs the libraries said.
The legislation targets secession, subversion and terrorism with punishments of up to life in prison.
Opponents say it erodes the territory's freedoms as a semi-autonomous region of China. Beijing rejects this.
Hong Kong's sovereignty was handed back to China by Britain in 1997 and certain rights were supposed to be guaranteed for at least 50 years under the "one country, two systems" agreement.
Since the security law came into effect on Tuesday, several leading pro-democracy activists have stepped down from their roles. One of them - one-time student leader and local legislator Nathan Law - has fled the territory.
At least nine books have become unavailable or marked as "under review", according to the South China Morning Post newspaper. They inclu

`Approximately 1` line need to be removed below.

It should end at `There are also concerns over online freedom as internet providers might have to hand over data if requested by police.`

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [78]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [79]:
day_pub

'5 July 2020'

In [80]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [81]:
day_pub

datetime.date(2020, 7, 5)

In [82]:
df_date = pd.DataFrame([day_pub])

In [83]:
type(df_date)

pandas.core.frame.DataFrame

In [84]:
df_date

Unnamed: 0,0
0,2020-07-05


Now the title.

In [85]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [86]:
df_title = pd.DataFrame([title_list])

In [87]:
df_title

Unnamed: 0,0
0,Hong Kong security law: Pro-democracy books pu...


In [88]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [89]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])

Finally, the news.

In [90]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [91]:
news1

['Books by pro-democracy figures have been removed from public libraries in Hong Kong in the wake of a controversial new security law.',
 'The works will be reviewed to see if they violate the new law, the authority which runs the libraries said.',
 'The legislation targets secession, subversion and terrorism with punishments of up to life in prison.',
 "Opponents say it erodes the territory's freedoms as a semi-autonomous region of China. Beijing rejects this.",
 'Hong Kong\'s sovereignty was handed back to China by Britain in 1997 and certain rights were supposed to be guaranteed for at least 50 years under the "one country, two systems" agreement.',
 'Since the security law came into effect on Tuesday, several leading pro-democracy activists have stepped down from their roles. One of them - one-time student leader and local legislator Nathan Law - has fled the territory.',
 'At least nine books have become unavailable or marked as "under review", according to the South China Morning

In [92]:
len(news1)

23

In [93]:
news1[-1]

"THE TEXT: What it is and why Hong Kong is worriedWHAT COULD HAPPEN: Life sentences for breaking the law and moreRESIDENTS REACT: 'End of one country, two systems'"

In [94]:
n = 1
news1 = news1[:len(news1)-1]

In [95]:
len(news1)

22

In [96]:
news1[-1]

'There are also concerns over online freedom as internet providers might have to hand over data if requested by police.'

Complete.

In [97]:
df_news = pd.DataFrame()

In [98]:
df_news['article_body'] = news1

In [99]:
df_news.head(2)

Unnamed: 0,article_body
0,Books by pro-democracy figures have been remov...
1,The works will be reviewed to see if they viol...


In [100]:
df_news['article_body'] = df_news.article_body.str.cat(sep='')

In [101]:
df_news = df_news.article_body[0]

In [102]:
df_news = df_news.replace(r'\\?','')

In [103]:
df_news = pd.DataFrame([df_news])

In [104]:
type(df_news)

pandas.core.frame.DataFrame

In [105]:
df_news.columns = ['Article']

In [106]:
df_news.head()

Unnamed: 0,Article
0,Books by pro-democracy figures have been remov...


**Bringing it together.**<a id='2.5_bit'></a>

In [107]:
df_21_bbc = pd.concat([df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [108]:
df_21_bbc.columns = ['date','source','country','title','article']

In [109]:
df_21_bbc.head()

Unnamed: 0,date,source,country,title,article
0,2020-07-05,BBC,UK,Hong Kong security law: Pro-democracy books pu...,Books by pro-democracy figures have been remov...


**Saving**<a id='2.6_save'></a>

In [110]:
cd

C:\Users\rands


Saving it to Excel.

In [111]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_21_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_21.csv', index=False)

print('Complete')

Complete
