# Scraping Anger as China's Xi signs legislation from BBC

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [80]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [81]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [82]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [83]:
url = 'https://www.bbc.com/news/world-asia-china-53234255'.format(d)
url

'https://www.bbc.com/news/world-asia-china-53234255'

In [84]:
html = requests.get(url)

In [85]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [86]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

30 June 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [87]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: Anger as China's Xi signs legislation


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [88]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

The UK, EU and Nato have expressed concern and anger after China passed a controversial security law giving it new powers over Hong Kong.
The law makes secession, subversion of the central government, terrorism or collusion with foreign forces punishable by up to life in prison.
It took effect from 2300 local time (1500 GMT) on Tuesday.
Hong Kong's leader Carrie Lam defended the law, saying it filled a "gaping hole" in national security.
One key pro-democracy group said it was now ceasing all operations.
Demosisto announced the move on Facebook after Joshua Wong, one of Hong Kong's most prominent activists, said he was leaving the group, which he had spearheaded.
Minutes after new law, pro-democracy voices quit
The law has come into effect just a day before the 23rd anniversary of the return of sovereignty to China - a day that usually draws large pro-democracy protests.
China says the law is needed to tackle unrest and instability linked to a broadening pro-democracy movement.
China's

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [89]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [90]:
day_pub

'30 June 2020'

In [91]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [92]:
day_pub

datetime.date(2020, 6, 30)

In [93]:
df_date = pd.DataFrame([day_pub])

In [94]:
type(df_date)

pandas.core.frame.DataFrame

In [95]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [96]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [97]:
df_title = pd.DataFrame([title_list])

In [98]:
df_title

Unnamed: 0,0
0,Hong Kong security law: Anger as China's Xi si...


In [99]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [100]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])
file_name = 'bbc_18'
df_file_name = pd.DataFrame([file_name])

Finally, the news.

In [101]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [102]:
df_news = pd.DataFrame()

In [103]:
df_news['article_body'] = news1

In [104]:
df_news.head(2)

Unnamed: 0,article_body
0,"The UK, EU and Nato have expressed concern and..."
1,"The law makes secession, subversion of the cen..."


In [105]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [106]:
df_news = df_news.article_body[0]

In [107]:
df_news = df_news.replace(r'\\?','')

In [108]:
df_news = pd.DataFrame([df_news])

In [109]:
type(df_news)

pandas.core.frame.DataFrame

In [110]:
df_news.columns = ['Article']

In [111]:
df_news.head()

Unnamed: 0,Article
0,"The UK, EU and Nato have expressed concern and..."


**Bringing it together.**<a id='2.5_bit'></a>

In [115]:
df_18_bbc = pd.concat([df_file_name, df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [117]:
df_18_bbc.columns = ['file_name','date','source','country','title','article']

In [118]:
df_18_bbc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,bbc_18,2020-06-30,BBC,UK,Hong Kong security law: Anger as China's Xi si...,"The UK, EU and Nato have expressed concern and..."


**Saving**<a id='2.6_save'></a>

In [119]:
cd

C:\Users\rands


Saving it to Excel.

In [120]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_18_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_18.csv', index=False)

print('Complete')

Complete
