# Scraping Why it scares people from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [87]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [88]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [89]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [90]:
url = 'https://www.bbc.com/news/world-asia-china-53256034'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53256034'

In [91]:
html = requests.get(url)

In [92]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [93]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

1 July 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [94]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong's new security law: Why it scares people


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [95]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

China has introduced a new national security law for Hong Kong. The BBC's Michael Bristow takes a closer look at the detail, and what it will mean in practice.
Lawyers and legal experts have said China's national security law for Hong Kong will fundamentally change the territory's legal system.
It introduces new crimes with severe penalties - up to life in prison - and allows mainland security personnel to legally operate in Hong Kong with impunity.
The legislation gives Beijing extensive powers it has never had before to shape life in the territory far beyond the legal system.
Analysis of the law by NPC Observer, a team of legal experts from the United States and Hong Kong, identified what they consider a number of worrying aspects.
"Its criminal provisions are worded in such a broad manner as to encompass a swath of what has so far been considered protected speech," said a posting on its website.
Article 29 is perhaps an example of this broad wording.
It states that anyone who conspi

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [96]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [97]:
day_pub

'1 July 2020'

In [98]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [99]:
day_pub

datetime.date(2020, 7, 1)

In [100]:
df_date = pd.DataFrame([day_pub])

In [101]:
type(df_date)

pandas.core.frame.DataFrame

In [102]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [103]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [104]:
df_title = pd.DataFrame([title_list])

In [105]:
df_title

Unnamed: 0,0
0,Hong Kong's new security law: Why it scares pe...


In [106]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [109]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])
file_name = 'bbc_2'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [110]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [111]:
df_news = pd.DataFrame()

In [112]:
df_news['article_body'] = news1

In [113]:
df_news.head(2)

Unnamed: 0,article_body
0,China has introduced a new national security l...
1,Lawyers and legal experts have said China's na...


In [114]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [115]:
df_news = df_news.article_body[0]

In [116]:
df_news = df_news.replace(r'\\?','')

In [117]:
df_news = pd.DataFrame([df_news])

In [118]:
type(df_news)

pandas.core.frame.DataFrame

In [119]:
df_news.columns = ['Article']

In [120]:
df_news.head()

Unnamed: 0,Article
0,China has introduced a new national security l...


**Bringing it together.**<a id='2.5_bit'></a>

In [121]:
df_2_bbc = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [122]:
df_2_bbc.columns = ['file_name','date','source','country','title','article']

In [123]:
df_2_bbc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,bbc_2,2020-07-01,BBC,UK,Hong Kong's new security law: Why it scares pe...,China has introduced a new national security l...


**Saving**<a id='2.6_save'></a>

In [124]:
cd

C:\Users\rands


Saving it to Excel.

In [125]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_2_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_2.csv', index=False)

print('Complete')

Complete
