# Scraping First Arrests from Forbes

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [219]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [220]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [221]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [222]:
url = 'https://www.forbes.com/sites/carlieporterfield/2020/07/01/hong-kong-makes-first-arrests-under-beijings-new-national-security-law/?sh=a149a385e529'.format(d)
url

'https://www.forbes.com/sites/carlieporterfield/2020/07/01/hong-kong-makes-first-arrests-under-beijings-new-national-security-law/?sh=a149a385e529'

In [223]:
html = requests.get(url)

In [224]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [225]:
for date in bsobj.findAll('div',{'class':'topline-updated-timestamp'}):
    print(date.text.strip())

Updated Jul 1, 2020, 01:51pm EDT


Needs some moderate work; to be done at the end.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [226]:
for title in bsobj.findAll("h1"):
    print(format(title.text))

Hong Kong Makes First Arrests Under Beijing’s New National Security Law


Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [227]:
for news in bsobj.findAll('div',{'class':'article-body fs-article fs-responsive-text current-article article-topline'}):
    print(news.text.strip())

Share to FacebookShare to TwitterShare to LinkedinTOPLINE
 Authorities in Hong Kong made the first arrests under the new controversial national security law Wednesday after thousands gathered for an annual pro-democracy rally to mark the anniversary of Hong Kong’s handover to China 23 years ago.






A protester is detained by police during a rally against in Hong Kong Wednesday.

AFP via Getty Images



KEY FACTS


Police arrested 370 people Wednesday during the city’s largest protest in months in defiance of Covid-19 restrictions; of those arrests, 10 were reportedly arrested for breaking the new law, according to the Wall Street Journal. 


The national security law, adopted Tuesday in Beijing, passed after the widespread protests in Hong Kong last year and allows authorities to crack down on dissenters with long jail sentences for charges that experts say are defined in vague terms that could allow for broad interpretation by China.



Critics of the new law say it effectively end

In [228]:
type(news)

bs4.element.Tag

Needs work but easy to handle.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [229]:
for day in bsobj.findAll('div',{'class':'topline-updated-timestamp'}):
    day_pub = day.text.strip()

In [230]:
day_pub

'Updated Jul 1, 2020, 01:51pm EDT'

In [231]:
day_pub = re.sub('Updated ','',day_pub)
day_pub = re.sub(', 01:51pm EDT','',day_pub)
day_pub = re.sub(', ',' ',day_pub)
day_pub = re.sub('Jul','07',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 7, 1)

In [232]:
df_date = pd.DataFrame([day_pub])

In [233]:
type(df_date)

pandas.core.frame.DataFrame

In [234]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [235]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [236]:
df_title = pd.DataFrame([title_list])

In [237]:
df_title

Unnamed: 0,0
0,Hong Kong Makes First Arrests Under Beijing’s ...


In [238]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [240]:
country = 'US'
df_country = pd.DataFrame([country])
source = 'Forbes'
df_source = pd.DataFrame([source])
file_name = 'forbes_6'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [241]:
news1 = []
for news in bsobj.findAll('div',{'class':'article-body fs-article fs-responsive-text current-article article-topline'}):
    news1.append(news.text.strip())

In [242]:
news1

['Share to FacebookShare to TwitterShare to LinkedinTOPLINE\n Authorities in Hong Kong made the first arrests under the new controversial national security law Wednesday after thousands gathered for an annual pro-democracy rally to mark the anniversary of Hong Kong’s handover to China 23 years ago.\n\n\n\n\n\n\nA protester is detained by police during a rally against in Hong Kong Wednesday.\n\nAFP via Getty Images\n\n\n\nKEY FACTS\n\n\nPolice arrested 370 people Wednesday during the city’s largest protest in months in defiance of Covid-19 restrictions; of those arrests, 10 were reportedly arrested for breaking the new law, according to the Wall Street Journal.\xa0\n\n\nThe national security law, adopted Tuesday in Beijing, passed after the widespread protests in Hong Kong last year and allows authorities to crack down on dissenters with long jail sentences for charges that experts say are defined in vague terms that could allow for broad interpretation by China.\n\n\n\nCritics of the n

In [243]:
df_news = pd.DataFrame(news1)

In [244]:
df_news.head(2)

Unnamed: 0,0
0,Share to FacebookShare to TwitterShare to Link...


In [245]:
# df_news['article_body'] = df_news.article_body.str.cat(sep='')

In [246]:
# df_news = df_news.article_body[0]

In [247]:
df_news = df_news.replace(r'\\?','')

In [248]:
type(df_news)

pandas.core.frame.DataFrame

In [249]:
df_news.columns = ['Article']

In [250]:
df_news.head()

Unnamed: 0,Article
0,Share to FacebookShare to TwitterShare to Link...


**Bringing it together.**<a id='2.5_bit'></a>

In [251]:
df_6_forbes = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [252]:
df_6_forbes.columns = ['file_name','date','source','country','title','article']

In [253]:
df_6_forbes.head()

Unnamed: 0,file_name,date,source,country,title,article
0,forbes_6,2020-07-01,Forbes,US,Hong Kong Makes First Arrests Under Beijing’s ...,Share to FacebookShare to TwitterShare to Link...


**Saving**<a id='2.6_save'></a>

In [254]:
cd

C:\Users\rands


Saving it to Excel.

In [255]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_6_forbes.to_csv('./_Capstone_Two_NLP/data/_news/forbes_6.csv', index=False)

print('Complete')

Complete
