# Scraping arrest over 300 from NY Post

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://nypost.com/2020/07/01/hong-kong-police-arrest-nearly-200-in-first-protest-under-new-security-law/'
content = requests.get(url).content
url

'https://nypost.com/2020/07/01/hong-kong-police-arrest-nearly-200-in-first-protest-under-new-security-law/'

In [5]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [45]:
for date in soup.find_all('div',{'p class':'byline-date'}):
    print(date.text.strip())

In [46]:
date



The date is in there but needs to be cleaned.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [22]:
for title in soup.findAll('h1'):
    print(title.text.strip())

Hong Kong police arrest over 300 in first protest under new security law


In [23]:
title

<h1 class="postid-15918875">
				Hong Kong police arrest over 300 in first protest under new security law			</h1>

In [24]:
TAG_RE = re.compile(r'<[^>]+>')

In [25]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [26]:
remove_tags(str(title))

'\n\t\t\t\tHong Kong police arrest over 300 in first protest under new security law\t\t\t'

Confirmed that's the title; a little longer of a journey to get it.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [27]:
for bodies in soup.find_all('div',{'class','box article modal-enabled'}):
    print(bodies.text.strip())

News


Share this:FacebookTwitter


Flipboard

WhatsAppEmailCopy 

 

				Hong Kong police arrest over 300 in first protest under new security law			

By Bob Fredericks 





View author archive




email the author




Get author RSS feed





 
Most Popular Today

 
1





Parents at elite NYC school enraged over 'masturbation' videos for first graders 

2





NYC’s iconic Washington Square Park now a ‘drug den’ that’s terrifying neighbors 

3





NBA recklessly gives LeBron James another pass 

4





Inside Sinead O'Connor's horrifying and downright bizarre night with Prince 

5





Piers Morgan says 'Good Morning Britain' wants him back after Meghan Markle comments 
 


Name(required)



Email(required)



Comment(required)



Submit 






 

		July 1, 2020 | 9:26am				| Updated July 1, 2020 | 1:03pm



Enlarge Image




A police officer raises his pepper spray handgun as he detains a man during a march against the national security law at the anniversary of Hong Kong's handov

In [28]:
len(bodies)

15

Needs internal work.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [56]:
for date in soup.findAll('div',{'p class':'byline-date'}):
    date.text.strip()

In [57]:
type(date)

str

In [67]:
date_new = date.split('Updated ')

In [74]:
date_new = date.split('Updated ')
date_new = date_new[1]
day_pub = date_new[0:12]
day_pub

'July 1, 2020'

In [75]:
day_pub = re.sub('July','07',day_pub)
day_pub = re.sub(',','',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 7, 1)

In [76]:
df_date = pd.DataFrame([day_pub])

In [77]:
type(df_date)

pandas.core.frame.DataFrame

In [78]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [84]:
for title in soup.findAll('h1'):
    print(title.text.strip())

Hong Kong police arrest over 300 in first protest under new security law


In [85]:
title = title.text.strip()
title = [title]

In [86]:
type(title)

list

In [87]:
df_title = pd.DataFrame([title])

In [88]:
df_title

Unnamed: 0,0
0,Hong Kong police arrest over 300 in first prot...


In [89]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [90]:
country = 'US'
df_country = pd.DataFrame([country])
source = 'NY Post'
df_source = pd.DataFrame([source])
file_name = 'nypost_8'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [187]:
for bodies in soup.find_all('div',{'class','box article modal-enabled'}):
    bodies.text.strip()

In [212]:
body = remove_tags(str(bodies))

In [190]:
body = body.split('\n')

In [191]:
check_beg = body.index('Hong Kong police arrested more than 300 people Wednesday as thousands of defiant demonstrators gathered in Hong Kong to protest Beijing’s new national security law — while the UK offered a new path to citizenship for almost 3 million eligible residents of its former colony.')
check_beg

153

In [192]:
check_end = body.index('The US also imposed visa restrictions on Communist Party officials involved in the crackdown.')
check_end

176

In [193]:
body = body[153:176]

In [194]:
type(body)

list

In [195]:
body = ', '.join(body)

In [200]:
body_lists = [body]

In [213]:
df_news = pd.DataFrame()

In [214]:
df_news['article_body'] = body_lists

In [215]:
df_news.head()

Unnamed: 0,article_body
0,Hong Kong police arrested more than 300 people...


In [216]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [217]:
df_news = df_news.article_body[0]

In [218]:
df_news = df_news.replace(r'\\?','')

In [219]:
df_news = pd.DataFrame([df_news])

In [220]:
type(df_news)

pandas.core.frame.DataFrame

In [221]:
df_news.columns = ['Article']

In [222]:
df_news.head()

Unnamed: 0,Article
0,Hong Kong police arrested more than 300 people...


**Bringing it together.**<a id='2.5_bit'></a>

In [223]:
df_8_nypost = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [224]:
df_8_nypost.columns = ['file_name','date','source','country','title','article']

In [225]:
df_8_nypost.head()

Unnamed: 0,file_name,date,source,country,title,article
0,nypost_8,2020-07-01,NY Post,US,Hong Kong police arrest over 300 in first prot...,Hong Kong police arrested more than 300 people...


**Saving**<a id='2.6_save'></a>

In [226]:
cd

C:\Users\rands


Saving it to Excel.

In [227]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_8_nypost.to_csv('./_Capstone_Two_NLP/data/_news/nypost_8.csv', index=False)

print('Complete')

Complete
