# Scraping HK Water and Fire from the Diplomat

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [96]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle
from pprint import pprint

%reload_ext watermark

In [97]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [98]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [99]:
url = 'https://thediplomat.com/2020/07/hong-kong-through-water-and-fire/'
content = requests.get(url).content
url

'https://thediplomat.com/2020/07/hong-kong-through-water-and-fire/'

In [100]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [101]:
for date in soup.find_all('div',{'class':'td-date'}):
    print(date.text.strip())

July 01, 2020


Very workable.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [102]:
for title in soup.findAll('h1',{'id':'td-headline', 'itemprop':'headline'}):
    print(title.text.strip())

Hong Kong Through Water and Fire


In [103]:
title = title.text.strip()

In [104]:
TAG_RE = re.compile(r'<[^>]+>')

In [105]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [106]:
remove_tags(str(title))

'Hong Kong Through Water and Fire'

Confirmed.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [107]:
for bodies in soup.findAll('article',{'id':'td-story'}):
    print(bodies.text.strip())

Features | Politics | Society | East Asia 

						Hong Kong Through Water and Fire                    
From the mass protests of 2019 to the national security law of 2020.





By Sebastian Veg for The Diplomat July 01, 2020





 

 

 








Police detain protesters after a protest in Causeway Bay before the annual handover march in Hong Kong, July. 1, 2020. 
Credit: AP Photo/Kin CheungAdvertisementIt has been a year since the beginning of the anti-extradition law movement, and it seems safe to say that Hong Kong will never be the same. The movement began as a massive pushback across Hong Kong society to resist the government’s proposed bill allowing extradition to mainland China and more generally the growing erosion of Hong Kong’s constitutionally guaranteed “high degree of autonomy.” It therefore began as a “reactive” (in the framework of Charles Tilly) movement, whereas the 2014 Umbrella Movement was arguably more “proactive” in trying to advance a deeper form of democratizatio

In [108]:
len(bodies)

10

Needs a lot of work.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [112]:
for date in soup.find_all('div',{'class':'td-date'}):
    print(date.text.strip())

July 01, 2020


In [113]:
date

<div class="td-date"><span itemprop="datePublished">July 01, 2020</span>
<meta content="2020-07-02T02:35:50Z" itemprop="dateModified">
</meta></div>

In [114]:
date = remove_tags(str(date))
date

'July 01, 2020\n\n'

In [115]:
len(date)

15

In [116]:
day_pub = date[0:13]
day_pub

'July 01, 2020'

In [117]:
#Replacing commas
day_pub = re.sub(',','',day_pub)
#Replacing words to bring to datetimeformat
day_pub = re.sub('July','07',day_pub)
#Replacing the spaces
day_pub = re.sub(' ','-',day_pub)
#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()

day_pub

datetime.date(2020, 7, 1)

In [118]:
df_date = pd.DataFrame([day_pub])

In [119]:
type(df_date)

pandas.core.frame.DataFrame

In [120]:
df_date

Unnamed: 0,0
0,2020-07-01


Now the title.

In [135]:
for title in soup.findAll('h1',{'id':'td-headline', 'itemprop':'headline'}):
    print(title.text.strip())

Hong Kong Through Water and Fire


In [136]:
df_title = pd.DataFrame([title.text.strip()])

In [137]:
df_title

Unnamed: 0,0
0,Hong Kong Through Water and Fire


In [138]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [139]:
country = 'US'
df_country = pd.DataFrame([country])
source = 'the Diplomat'
df_source = pd.DataFrame([source])
file_name = 'theDiplomat_23'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [190]:
for bodies in soup.findAll('article',{'id':'td-story'}):
    body = (bodies.text.strip())

In [191]:
body

'Features\xa0|\xa0Politics\xa0|\xa0Society\xa0|\xa0East Asia \n\n\t\t\t\t\t\tHong Kong Through Water and Fire                    \nFrom the mass protests of 2019 to the national security law of 2020.\n\n\n\n\n\nBy Sebastian Veg for The Diplomat July 01, 2020\n\n\n\n\n\n \n\n \n\n \n\n\n\n\n\n\n\n\nPolice detain protesters after a protest in Causeway Bay before the annual handover march in Hong Kong, July. 1, 2020. \nCredit: AP Photo/Kin CheungAdvertisementIt has been a year since the beginning of the anti-extradition law movement, and it seems safe to say that Hong Kong will never be the same. The movement began as a massive pushback across Hong Kong society to resist the government’s proposed bill allowing extradition to mainland China and more generally the growing erosion of Hong Kong’s constitutionally guaranteed “high degree of autonomy.” It therefore began as a “reactive” (in the framework of Charles Tilly) movement, whereas the 2014 Umbrella Movement was arguably more “proactive

In [192]:
body_new = body.split('\n')
body_new = body_new[29:]
del body_new[-1]

In [193]:
df_news = pd.DataFrame()

In [194]:
df_news['article_body'] = body_new

In [203]:
df_news.head()

Unnamed: 0,Article
0,Credit: AP Photo/Kin CheungAdvertisementIt has...


In [196]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [197]:
df_news = df_news.article_body[0]

In [198]:
df_news = df_news.replace(r'\\?','')

In [199]:
df_news = pd.DataFrame([df_news])

In [200]:
type(df_news)

pandas.core.frame.DataFrame

In [201]:
df_news.columns = ['Article']

In [202]:
df_news.head()

Unnamed: 0,Article
0,Credit: AP Photo/Kin CheungAdvertisementIt has...


**Bringing it together.**<a id='2.5_bit'></a>

In [204]:
df_23_theDiplomat = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [205]:
df_23_theDiplomat.columns = ['file_name','date','source','country','title','article']

In [206]:
df_23_theDiplomat.head()

Unnamed: 0,file_name,date,source,country,title,article
0,theDiplomat_23,2020-07-01,the Diplomat,US,Hong Kong Through Water and Fire,Credit: AP Photo/Kin CheungAdvertisementIt has...


**Saving**<a id='2.6_save'></a>

In [207]:
cd

C:\Users\rands


Saving it to Excel.

In [208]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_23_theDiplomat.to_csv('./_Capstone_Two_NLP/data/_news/theDiplomat_23.csv', index=False)

print('Complete')

Complete
