# Scraping Life sentences from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [43]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [44]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [45]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [46]:
url = 'https://www.bbc.com/news/world-asia-china-53238004'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53238004'

In [47]:
html = requests.get(url)

In [48]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [49]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

30 June 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [50]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: Life sentences for breaking China-imposed law


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [51]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

People in Hong Kong could face life in jail for breaking a controversial and sweeping new security law imposed by China.
The legislation came into force on Tuesday but the full text was only revealed hours afterwards.
It was brought in by Beijing following increasing unrest and a widening pro-democracy movement.
Critics say the new law effectively curtails protest and undermines Hong Kong's freedoms.
The territory was handed back to China from British control in 1997, but under a unique agreement supposed to protect certain freedoms that people in mainland China do not enjoy - including freedom of speech.
Hong Kong's leader, Carrie Lam, defended the law, saying it filled a "gaping hole" in national security.
Details have been closely guarded and the Beijing-backed politician admitted she had not seen the draft before commenting.
Why are there protests in Hong Kong? All the context you needDo protests ever work in China?
But Ted Hui, an opposition legislator, told the BBC: "Our rights a

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [52]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [53]:
day_pub

'30 June 2020'

In [54]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [55]:
day_pub

datetime.date(2020, 6, 30)

In [56]:
df_date = pd.DataFrame([day_pub])

In [57]:
type(df_date)

pandas.core.frame.DataFrame

In [58]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [59]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [60]:
df_title = pd.DataFrame([title_list])

In [61]:
df_title

Unnamed: 0,0
0,Hong Kong security law: Life sentences for bre...


In [62]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [63]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])

Finally, the news.

In [64]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [65]:
df_news = pd.DataFrame()

In [66]:
df_news['article_body'] = news1

In [67]:
df_news.head(2)

Unnamed: 0,article_body
0,People in Hong Kong could face life in jail fo...
1,The legislation came into force on Tuesday but...


In [68]:
df_news['article_body'] = df_news.article_body.str.cat(sep='')

In [69]:
df_news = df_news.article_body[0]

In [70]:
df_news = df_news.replace(r'\\?','')

In [71]:
df_news = pd.DataFrame([df_news])

In [72]:
type(df_news)

pandas.core.frame.DataFrame

In [73]:
df_news.columns = ['Article']

In [74]:
df_news.head()

Unnamed: 0,Article
0,People in Hong Kong could face life in jail fo...


**Bringing it together.**<a id='2.5_bit'></a>

In [75]:
df_22_bbc = pd.concat([df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [76]:
df_22_bbc.columns = ['date','source','country','title','article']

In [77]:
df_22_bbc.head()

Unnamed: 0,date,source,country,title,article
0,2020-06-30,BBC,UK,Hong Kong security law: Life sentences for bre...,People in Hong Kong could face life in jail fo...


**Saving**<a id='2.6_save'></a>

In [78]:
cd

C:\Users\rands


Saving it to Excel.

In [79]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_22_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_22.csv', index=False)

print('Complete')

Complete
