# Scraping HK police arrest over 300 from NY Post

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [114]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [115]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [116]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [117]:
url = 'https://nypost.com/2020/07/01/hong-kong-police-arrest-nearly-200-in-first-protest-under-new-security-law/'.format(d)
url

'https://nypost.com/2020/07/01/hong-kong-police-arrest-nearly-200-in-first-protest-under-new-security-law/'

In [118]:
html = requests.get(url)

In [None]:
bsobj = soup(html.content,'lxml')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [133]:
for date in bsobj.findAll('p',{'class':'byline-date'}):
    print(date.text.strip())

July 1, 2020 | 9:26am				| Updated July 1, 2020 | 1:03pm


Lot's of cleaning to do.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [11]:
for title in bsobj.findAll("h1"):
    print(format(title.text))


				Hong Kong police arrest over 300 in first protest under new security law			


Looks like some `\n\` & `\t`'s to remove.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [29]:
for news in bsobj.findAll('div',{'class':'entry-content entry-content-read-more'}):
    print(news.text.strip())

Hong Kong police arrested more than 300 people Wednesday as thousands of defiant demonstrators gathered in Hong Kong to protest Beijing’s new national security law — while the UK offered a new path to citizenship for almost 3 million eligible residents of its former colony.
Hong Kong police fired water cannon and tear gas as protesters took to the streets to vent against the sweeping security legislation introduced by China that they say is aimed at stifling dissent.
Protesters remained undaunted.
“We are on street to [protest] against national security law. We shall never surrender. Now is not the time to give up,” tweeted Joshua Wong, a Hong Kong activist with more than 600,000 followers on Twitter.’
Reuters
On the first full day the tough new law was in place, Hong Kong Chief Executive Carrie Lam admitted the civil unrest that had rocked the city for months last year was sparked by past failures, but claimed that the national security law showed “Beijing’s confidence” in the city, t

Confirmed that's the bottom; just need to remove the author.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [147]:
for date in bsobj.findAll('p',{'class':'byline-date'}):
    day_pub = date.text.strip()
day_pub

'July 1, 2020 | 9:26am\t\t\t\t| Updated July 1, 2020 | 1:03pm'

In [148]:
day_pub = re.sub('|','',day_pub)

day_pub

'July 1, 2020 | 9:26am\t\t\t\t| Updated July 1, 2020 | 1:03pm'

In [145]:
# day_pub = re.sub(' | 9:26am				| Updated July 1, 2020 | 1:03pm','',day_pub)
# day_pub = re.sub('Updated June 30, 2020','',day_pub)
# day_pub = re.sub(',','',day_pub)
# day_pub = re.sub('June','06',day_pub)
# day_pub = re.sub(' ','-',day_pub)
# day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
# day_pub

In [17]:
df_date = pd.DataFrame([day_pub])

In [18]:
type(df_date)

pandas.core.frame.DataFrame

In [19]:
df_date

Unnamed: 0,0
0,2020-06-28


Now the title.

In [30]:
for title_s in bsobj.findAll("h1"):
    title_list = format(title_s.text)

In [31]:
title_list = re.sub('\n\t\t\t\t','',title_list)
title_list = re.sub('\t\t\t','',title_list)

In [32]:
title_list

'Hong Kong police arrest over 300 in first protest under new security law'

In [33]:
df_title = pd.DataFrame([title_list])

In [34]:
df_title

Unnamed: 0,0
0,Hong Kong police arrest over 300 in first prot...


In [35]:
type(df_title)

pandas.core.frame.DataFrame

These two items are manually added.

In [36]:
country = 'US'
df_country = pd.DataFrame([country])
source = 'NY Post'
df_source = pd.DataFrame([source])

Finally, the news.

In [91]:
news1 = []
for news in bsobj.findAll('div',{'class':'entry-content entry-content-read-more'}):
    news1.append(news.text.strip())

In [92]:
news1



In [93]:
news1 = str(news1)

In [94]:
news1 = re.sub('With Reuters','',news1)

In [95]:
news1



In [96]:
df_news = pd.DataFrame([news1])

In [97]:
df_news.head(2)

Unnamed: 0,0
0,['Hong Kong police arrested more than 300 peop...


In [101]:
# df_news['article_body'] = df_news.article_body.str.cat(sep='')

In [103]:
# df_news = df_news.article_body[0]

In [104]:
df_news = df_news.replace(r'\\?','')

In [107]:
# df_news = pd.DataFrame([df_news])

In [108]:
type(df_news)

pandas.core.frame.DataFrame

In [109]:
df_news.columns = ['Article']

In [110]:
df_news.head()

Unnamed: 0,Article
0,['Hong Kong police arrested more than 300 peop...


**Bringing it together.**<a id='2.5_bit'></a>

In [112]:
df_5_nytimes = pd.concat([df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [76]:
df_5_nytimes.columns = ['date','source','country','title','article']

In [77]:
df_5_nytimes.head()

Unnamed: 0,date,source,country,title,article
0,2020-06-28,NY Times,US,What China’s New National Security Law Means f...,Chinese lawmakers have approved a national sec...


**Saving**<a id='2.6_save'></a>

In [74]:
cd

C:\Users\rands


Saving it to Excel.

In [78]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_5_nytimes.to_excel('./_Capstone_Two_NLP/data/_news/nytimes_5.xlsx', index=False)

print('Complete')

Complete
