# Scraping _______ from _______

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [74]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [75]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [76]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [77]:
url = 'https://www.bbc.com/news/world-asia-china-52765838'.format(d)
url

'https://www.bbc.com/news/world-asia-china-52765838'

In [78]:
html = requests.get(url)

In [79]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [80]:
for date in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    print(date.text.strip())

30 June 2020


Correct.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [81]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong security law: What is it and is it worrying?


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [82]:
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    print(news.text.strip())

China has passed a wide-ranging new security law for Hong Kong which makes it easier to punish protesters and reduces the city's autonomy.
Critics have called it "the end of Hong Kong" - so what do we know, and what do people fear the most?
Hong Kong was always meant to have a security law, but could never pass one because it was so unpopular. So this is about China stepping in to ensure the city has a legal framework to deal with what it sees as serious challenges to its authority.
The details of the law's 66 articles were kept secret until after it was passed. It criminalises any act of:
secession - breaking away from the countrysubversion - undermining the power or authority of the central governmentterrorism - using violence or intimidation against peoplecollusion with foreign or external forces
The law came into effect at 23:00 local time on 30 June, an hour before the 23rd anniversary of the city's handover to China from British rule.
It gives Beijing powers to shape life in Hong

Confirmed that's the bottom.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [83]:
for day in bsobj.find('span',{'class':'ssrcss-8g95ls-MetadataSnippet ecn1o5v2'}):
    day_pub = day.text.strip()

In [84]:
day_pub

'30 June 2020'

In [85]:
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('June','06',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()

In [86]:
day_pub

datetime.date(2020, 6, 30)

In [87]:
df_date = pd.DataFrame([day_pub])

In [88]:
type(df_date)

pandas.core.frame.DataFrame

In [89]:
df_date

Unnamed: 0,0
0,2020-06-30


Now the title.

In [90]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [91]:
df_title = pd.DataFrame([title_list])

In [92]:
df_title

Unnamed: 0,0
0,Hong Kong security law: What is it and is it w...


In [93]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [94]:
country = 'UK'
df_country = pd.DataFrame([country])
source = 'BBC'
df_source = pd.DataFrame([source])
file_name = 'was_not_updated'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [95]:
news1 = []
for news in bsobj.findAll('div',{'class':'ssrcss-uf6wea-RichTextComponentWrapper e1xue1i84'}):
    news1.append(news.text.strip())

In [96]:
df_news = pd.DataFrame()

In [97]:
df_news['article_body'] = news1

In [98]:
df_news.head(2)

Unnamed: 0,article_body
0,China has passed a wide-ranging new security l...
1,"Critics have called it ""the end of Hong Kong"" ..."


In [99]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [100]:
df_news = df_news.article_body[0]

In [101]:
df_news = df_news.replace(r'\\?','')

In [102]:
df_news = pd.DataFrame([df_news])

In [103]:
type(df_news)

pandas.core.frame.DataFrame

In [104]:
df_news.columns = ['Article']

In [105]:
df_news.head()

Unnamed: 0,Article
0,China has passed a wide-ranging new security l...


**Bringing it together.**<a id='2.5_bit'></a>

In [106]:
df_1_bbc = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [107]:
df_1_bbc.columns = ['file_name','date','source','country','title','article']

In [108]:
df_1_bbc.head()

Unnamed: 0,file_name,date,source,country,title,article
0,was_not_updated,2020-06-30,BBC,UK,Hong Kong security law: What is it and is it w...,China has passed a wide-ranging new security l...


**Saving**<a id='2.6_save'></a>

In [36]:
cd

C:\Users\rands


Saving it to Excel.

In [37]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_1_bbc.to_csv('./_Capstone_Two_NLP/data/_news/bbc_1.csv', index=False)

print('Complete')

Complete
