# Scraping How Should Japan Respond from Nippon

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [10]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [11]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [12]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [13]:
url = 'https://www.nippon.com/en/japan-topics/g00896/'
content = requests.get(url).content
url

'https://www.nippon.com/en/japan-topics/g00896/'

In [14]:
soup = soup(content,'html.parser')
# bsobj

**Grab | Date**<a id='2.4_scrape_date'></a>

In [16]:
for date in soup.findAll('div',{'class','c-date'}):
    print(date.text.strip())

The date is in there but needs to be cleaned a lot.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [18]:
for title in soup.findAll('h1'):
    print(title.text.strip())

Hong Kong’s Security Law: How Should Japan Respond?


In [19]:
title

<h1 class="c-h1">Hong Kong’s Security Law: How Should Japan Respond?</h1>

In [20]:
TAG_RE = re.compile(r'<[^>]+>')

In [21]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [22]:
remove_tags(str(title))

'Hong Kong’s Security Law: How Should Japan Respond?'

Confirmed.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [23]:
for bodies in soup.find_all('div',{'class','editArea'}):
    print(bodies.text.strip())

Growing Private and Governmental Support for Hong Kong Democracy
Members of the Standing Committee of the National People’s Congress of China in Beijing held a vote on June 30 on the Law of the People’s Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region, and it passed unanimously. This National Security Law was immediately added to Annex III of Hong Kong’s Basic Law. At a press conference the next day, Japan’s Chief Cabinet Secretary Suga Yoshihide called the law “regrettable.”
However, others in Japan have offered more direct support on the issue. As Hong Kong marked one year from the million-strong protest against an amended extradition bill, which would have allowed suspects in the territory to be sent to China for trial, on June 9, various nongovernmental organizations hosted an online symposium from Tokyo’s National Diet Building’s Chamber of the House of Representatives, titled “Global Solidarity: From Hong Kong to the International

In [24]:
len(bodies)

55

Needs internal work.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [131]:
for day in soup.find_all('div',{'class','c-content'}):
    day.text.strip()

In [132]:
day_str = remove_tags(str(day))
day_str = day_str.split('\n')
day_str = day_str[3]
day_pub = re.sub('Aug','08',day_str)
day_pub = re.sub(',','',day_pub)
day_pub = re.sub(' ','-',day_pub)
day_pub = datetime.strptime(day_pub, '%m-%d-%Y').date()
day_pub

datetime.date(2020, 8, 14)

In [133]:
df_date = pd.DataFrame([day_pub])

In [134]:
type(df_date)

pandas.core.frame.DataFrame

In [135]:
df_date

Unnamed: 0,0
0,2020-08-14


Now the title.

In [65]:
for title in soup.findAll('h1'):
    print(title.text.strip())

Hong Kong’s Security Law: How Should Japan Respond?


In [66]:
title = title.text.strip()
title = [title]

In [67]:
type(title)

list

In [68]:
df_title = pd.DataFrame([title])

In [69]:
df_title

Unnamed: 0,0
0,Hong Kong’s Security Law: How Should Japan Res...


In [70]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [71]:
country = 'Japan'
df_country = pd.DataFrame([country])
source = 'Nippon'
df_source = pd.DataFrame([source])
file_name = 'nippon_9'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [72]:
for bodies in soup.find_all('div',{'class','editArea'}):
    bodies.text.strip()

In [73]:
body = remove_tags(str(bodies))

In [74]:
body

'\nGrowing Private and Governmental Support for Hong Kong Democracy\nMembers of the Standing Committee of the National People’s Congress of China in Beijing held a vote on June 30 on the Law of the People’s Republic of China on Safeguarding National Security in the Hong Kong Special Administrative Region, and it passed unanimously. This National Security Law was immediately added to Annex III of Hong Kong’s Basic Law. At a press conference the next day, Japan’s Chief Cabinet Secretary Suga Yoshihide called the law “regrettable.”\nHowever, others in Japan have offered more direct support on the issue. As Hong Kong marked one year from the million-strong protest against an amended extradition bill, which would have allowed suspects in the territory to be sent to China for trial, on June 9, various nongovernmental organizations hosted an online symposium from Tokyo’s National Diet Building’s Chamber of the House of Representatives, titled “Global Solidarity: From Hong Kong to the Internat

In [75]:
body = body.split('\n')

In [83]:
body_core = body[1:27]

In [89]:
type(body_core)

list

In [90]:
body_core = ', '.join(body_core)

In [91]:
body_lists = [body_core]

In [92]:
df_news = pd.DataFrame()

In [93]:
df_news['article_body'] = body_lists

In [94]:
df_news.head()

Unnamed: 0,article_body
0,Growing Private and Governmental Support for H...


In [95]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [96]:
df_news = df_news.article_body[0]

In [97]:
df_news = df_news.replace(r'\\?','')

In [98]:
df_news = pd.DataFrame([df_news])

In [99]:
type(df_news)

pandas.core.frame.DataFrame

In [100]:
df_news.columns = ['Article']

In [101]:
df_news.head()

Unnamed: 0,Article
0,Growing Private and Governmental Support for H...


**Bringing it together.**<a id='2.5_bit'></a>

In [136]:
df_9_nippon = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [137]:
df_9_nippon.columns = ['file_name','date','source','country','title','article']

In [138]:
df_9_nippon.head()

Unnamed: 0,file_name,date,source,country,title,article
0,nippon_9,2020-08-14,Nippon,Japan,Hong Kong’s Security Law: How Should Japan Res...,Growing Private and Governmental Support for H...


**Saving**<a id='2.6_save'></a>

In [139]:
cd

C:\Users\rands


Saving it to Excel.

In [140]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_9_nippon.to_csv('./_Capstone_Two_NLP/data/_news/nippon_9.csv', index=False)

print('Complete')

Complete
