# Scraping "Toxic Apple" from the Standard

## 2.1 Contents<a id='2.1_Contents'></a>
* [2.1 Importing Relevant Tools](#2.1_Importing)
* [2.2 Defining the Request](#2.2_URL)
* [2.3 Grab | Date](#2.4_scrape_date)
* [2.4 Grab | Header](#2.4_scrape_header)
* [2.5 Grab | Content](#2.4_scrape_content)
* [2.6 Clean | Send to DataFrame](#2.5_review)
* [2.7 Save](#2.6_save)


**Importing Relevant Tools**<a id='2.1_Importing'></a>

In [1]:
import json
import pandas as pd
from bs4 import BeautifulSoup as soup
import requests
import csv
import re
import pickle

%reload_ext watermark

In [2]:
#the below needs to be reviewed for all websites; notably the time format

from datetime import date
from datetime import datetime
today = date.today()
d = today.strftime("%m-%d-%y")

For reference.

In [3]:
%watermark -d -t -v -p pandas

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

pandas: 1.1.3



**Defining the Request**<a id='2.2_URL'></a>

In [4]:
url = 'https://www.thestandard.com.hk/breaking-news/section/4/153941/State-media-says-Hong-Kong-has-a-toxic-%E2%80%98Apple%E2%80%99'
content = requests.get(url).content
url

'https://www.thestandard.com.hk/breaking-news/section/4/153941/State-media-says-Hong-Kong-has-a-toxic-%E2%80%98Apple%E2%80%99'

In [5]:
soup = soup(content,'html.parser')

**Grab | Date**<a id='2.4_scrape_date'></a>

In [6]:
for date in soup.find_all('span',{'class':'pull-left'}):
    print(date.text.strip())

Local | 26 Aug 2020 4:55 pm


In [7]:
date

<span class="pull-left"><a href="https://www.thestandard.com.hk/section-news-list/section/local/">Local</a> | 26 Aug 2020 4:55 pm</span>

That is the date but needs some work cleaning it up.

**Grab | Header**<a id='2.4_scrape_header'></a>

In [8]:
for title in soup.find('h1'):
    print(title)

State media says Hong Kong has a toxic ‘Apple’


In [9]:
TAG_RE = re.compile(r'<[^>]+>')

In [10]:
def remove_tags(text):
    return TAG_RE.sub('', text)

In [11]:
remove_tags(str(title))

'State media says Hong Kong has a toxic ‘Apple’'

Confirmed that's the title.

**Grab | Content**<a id='2.4_scrape_content'></a>

In [12]:
for bodies in soup.find_all('div',{'class','content'}):
    print(bodies.text.strip())

State media People’s Daily today published an article, saying Hong Kong is currently under siege by two viruses, the Covid-19 virus and Apple Daily’s ‘political virus.’It criticized that Apple Daily is not only propaganda by the opposition party, it may also be a dangerous political organization.Apple Daily’s boss Jimmy Lai Chee-ying and his two sons, along with four senior staff from Next Digital were arrested earlier this month over alleged violations of the national security law.People’s Daily critic pointed out that Apple Daily has gained advantage of Hong Kong’s third wave of Covid-19 cases to create social division. The critic added any moves rolled out by the government to fight the pandemic were smeared by them. 	The Standard Channel	        IOS  Android       IOS  Android         More>>


In [13]:
len(bodies)

23

Confirmed that's the bottom but has some fluff.

**Clean | Send to DataFrame**<a id='2.5_review'></a>

First the date from string to a datetime object.

In [14]:
for day in soup.find_all('span',{'class':'pull-left'}):
    print(day.text.strip())

Local | 26 Aug 2020 4:55 pm


In [15]:
day

<span class="pull-left"><a href="https://www.thestandard.com.hk/section-news-list/section/local/">Local</a> | 26 Aug 2020 4:55 pm</span>

In [16]:
day = remove_tags(str(day))

In [17]:
len(day)

27

In [18]:
day

'Local | 26 Aug 2020 4:55 pm'

In [19]:
day_pub = day[8:19]
day_pub

'26 Aug 2020'

In [20]:
#Bringing it up to convertibility
day_pub = re.sub('Aug','08',day_pub)
day_pub = re.sub(' ','-',day_pub)

#Converting to dattime format
day_pub = datetime.strptime(day_pub, '%d-%m-%Y').date()
day_pub

datetime.date(2020, 8, 26)

In [21]:
df_date = pd.DataFrame([day_pub])

In [22]:
type(df_date)

pandas.core.frame.DataFrame

In [23]:
df_date

Unnamed: 0,0
0,2020-08-26


Now the title.

In [24]:
for title in soup.find('h1'):
    title

In [25]:
title

'State media says Hong Kong has a toxic ‘Apple’'

In [26]:
title_list = [title]

In [27]:
df_title = pd.DataFrame([title_list])

In [28]:
df_title

Unnamed: 0,0
0,State media says Hong Kong has a toxic ‘Apple’


In [29]:
type(df_title)

pandas.core.frame.DataFrame

These items are manually added.

In [30]:
country = 'China'
df_country = pd.DataFrame([country])
source = 'the Standard'
df_source = pd.DataFrame([source])
file_name = 'theStandard_13'
df_file_name = pd.DataFrame([file_name])


Finally, the news.

In [31]:
news1 = []
for body_other in soup.find_all('div',{'class','content'}):
    news1.append(body_other.text.strip())

In [32]:
len(news1)

1

In [33]:
df_news = pd.DataFrame()

In [34]:
df_news['article_body'] = news1

In [35]:
df_news.head(2)

Unnamed: 0,article_body
0,State media People’s Daily today published an ...


In [36]:
df_news['article_body'] = df_news.article_body.str.cat(sep=' ')

In [37]:
df_news = df_news.article_body[0]

In [38]:
df_news = df_news.replace(r'\\?','')

In [39]:
df_news = pd.DataFrame([df_news])

In [40]:
type(df_news)

pandas.core.frame.DataFrame

In [41]:
df_news.columns = ['Article']

In [42]:
df_news.head()

Unnamed: 0,Article
0,State media People’s Daily today published an ...


**Bringing it together.**<a id='2.5_bit'></a>

In [43]:
df_13_theStandard = pd.concat([df_file_name,df_date,df_source,df_country,df_title,df_news],axis = 1, ignore_index=False)

In [44]:
df_13_theStandard.columns = ['file_name','date','source','country','title','article']

In [45]:
df_13_theStandard.head()

Unnamed: 0,file_name,date,source,country,title,article
0,theStandard_13,2020-08-26,the Standard,China,State media says Hong Kong has a toxic ‘Apple’,State media People’s Daily today published an ...


**Saving**<a id='2.6_save'></a>

In [46]:
cd

C:\Users\rands


Saving it to Excel.

In [47]:
# df = pd.DataFrame(reviewlist)

# index=False below so that we don't get the dataframe index on the side; we just use the excel index
df_13_theStandard.to_csv('./_Capstone_Two_NLP/data/_news/theStandard_13.csv', index=False)

print('Complete')

Complete
