<font face="Verdana, cursive, sans-serif" >
<h1><center>Webpage Scrapping with Python</center></h1>

<b>Author: Siraprapa Watakit</b>
<br><b>Last edited: November 2018</b>
<p><b>Synopsis:</b></p>

<p>This python notebook is a part of a working paper <b>"The Public and Private Life of Big Data",Pavabutr and Watakit(2018)</b>. This notebook demonstrates a step-by-step data scrapping from Reuters company news. </p>

<p>The following libraries are required for data scrapping project</p>
<ul>
   <li><code>panda,numpy</code>&nbsp; For dataframe, vector and matric processing</li>
   <li><code>BeautifulSoup</code> &nbsp;  For HTML data parsing</li>
   <li><code>urllib.request</code> &nbsp; For HTML pages requesting </li>
   <li><code>os,time,datetime</code> &nbsp;   Miscellaneous helper libraries for filepath, date and datetime manipulation</li>
</ul>

<p>Note that if you have installed Anaconda 3, the above libraries would have already been installed. Otherwise, please refer to <code>pip</code> installation manual</p>

<p><b>Pre-requisite knowledge and self-paced learning references:</b></p>
<p>It is recommended that the reader have the basic knowledge of the Pyhton/HTML/CSS and BeautifulSoup. Free learning materials are provided as follows:
<ul>
    <li> <a href="https://www.youtube.com/watch?v=YYXdXT2l-Gg&list=PL-osiE80TeTt2d9bfVyTiXJA-UTHn6WwU">Python Tutorial for Beginners</a>  </li>
    <li> <a href="https://www.youtube.com/watch?v=UB1O30fR-EE&list=PLillGF-RfqbZTASqIqdvm1R5mLrQq79CU&index=1">HTML and CSS for Beginners</a>  </li>
    <li> <a href="https://www.youtube.com/watch?v=XQgXKtPSzUI">Web Scraping with Python and Beautiful Soup</a>  </li>
</ul>

<p><b>Disclaimer:</b>This material is for informational and educational purposes only. </p>



<font face="Verdana, cursive, sans-serif" >
<h2><center>End-to-End Process of Web Scrapping</center></h2>
<p>Company news are extracted from Reuters News Website. By specifying a ticket and a date, we can retrieve news of the stock at the specific date. A webpage is a collection of HTML text-based document, in which  contents and styles of a webpage is structured and controlled by HTML elements. We will use BeautifulSoup to extract specific news contents from specific HTML elements</p>

<img src="./images/rtnews.png" >

<p>In this program, we will programmatically retrieve and parse Thailand SET50 stocks during a specific period. Note that the ticker lists and period is configurable.</p>


<font face="Verdana, cursive, sans-serif">
<b>Step 1 Import libraries</b>


In [None]:
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen 
import datetime
import time
import os

<font face="Verdana, cursive, sans-serif">
<b>Step 2 Define utility functions</b>
<ul>
    <li><code>gen_daterange()</code> this function generate a sequence of daily datetime</li>
    <li><code>parse_rtnews()</code> this function request an HTML page, given an input of ticker and date. Upon receiving an HTML, it will parse the HTML content and extract only the news contents that is wrapped with HTML elements:<b>div</b> <b>class:</b><code>topStory</code>. Not that this function is a specific parser for Reuters news webpage.</li>
</ul>


In [None]:

def gen_daterange(year,month,day,numdays):
    """Generate N days until now, e.g., [20151231, 20151230,..]."""
    base = datetime.datetime(year,month,day)
    date_range = [base - datetime.timedelta(days=x) for x in range(0, numdays)]
    return [x.strftime("%Y%m%d") for x in date_range]

failed_tickers=[]
def parse_rtnews(url,ticker,date):
    """Parse topstory company news by tiecker and date"""
    #initialize
    ticker_list=[]
    date_list=[]
    title_list=[]
    body_list=[]
    url_ticker=url + ticker +"?date="+date
    
    #there maybe some hickups while open url, if so retry a few times, else just skip this one
    for i in range(5):
        try:
            url_client = urlopen(url_ticker)
            html = url_client.read()
            url_client.close()
            soupHtml = soup(html,"html.parser") 
            content=soupHtml.find_all("div", {'class': 'topStory'})
            if [ticker,date] in failed_tickers: failed_tickers.remove([ticker,date])
            break
        except Exception as e:
            if not [ticker,date] in failed_tickers: failed_tickers.append([ticker,date])
            print(e)
            print('retry...')
            time.sleep(np.random.poisson(2))
            continue
    
    #if there is news, let's parse it, if not, the function will return empty lists
    if len(content):
        print("Parsing company news: {}, {}".format(ticker,date))
        for i in range(len(content)):
            ticker_list.append(ticker)
            date_list.append(date)
            title = content[i].h2.get_text().replace(",", " ").replace("\n", " ")
            title_list.append(title)
            body = content[i].p.get_text().replace(",", " ").replace("\n", " ")
            body_list.append(body)    
    
    return ticker_list,date_list,title_list,body_list

<font face="Verdana, cursive, sans-serif">
<b>Step 3 Configure the tickers  and dates</b>


In [None]:
tickers = pd.read_csv("./input/tickersSET.csv")
tickers_list=tickers["TICKER"].tolist()
url = "https://www.reuters.com/finance/stocks/company-news/"
outfile="./output/rtnews/rtnews_"
daterange=gen_daterange(2017,12,31,numdays=365) #from this yyyy/mm/dd, crawl back numsday

<font face="Verdana, cursive, sans-serif">
<b>Step 4 Looping through tickers and dates</b>
<p>The output from this step are csv files by dates.</p>


In [None]:

for date in daterange:
    #initialize
    ticker_list_global=[]
    date_list_global=[]
    title_list_global=[]
    body_list_global=[]
    print("Crawling news date:{}".format(date))
    for ticker in tickers_list:
        print("\t" + ticker)
        datetimeobject = datetime.datetime.strptime(date,'%Y%m%d')
        newformat = datetimeobject.strftime('%m%d%Y')
        ticker_list,date_list,title_list,body_list=parse_rtnews(url,ticker,newformat)        
        if len(ticker_list):
            ticker_list_global.extend(ticker_list)
            date_list_global.extend(date_list)
            title_list_global.extend(title_list)
            body_list_global.extend(body_list)

    #if has news, save file by date
    if len(ticker_list_global)>0:
        print("Saving : {}".format(outfile+date+".csv"))
        headers=["ticker","date","title","body"]
        data=pd.DataFrame([ticker_list_global,date_list_global,title_list_global,body_list_global])
        data=data.transpose()
        data.to_csv(outfile+date+".csv",encoding='utf-8-sig',index=False,header=headers,sep="|")

#%% Checking - is there any failed attempts?, if there is, we may want to rerun some dates/tickers later
if len(failed_tickers):
    print("There are some failed attempts")
    print(failed_tickers)

<font face="Verdana, cursive, sans-serif">
<b>Step 5 Cleanup and aggregate all csv files</b>


In [None]:
outfile_path="./output/rtnews/"
filename_final_csv=outfile_path+"_rtnews_all_.csv"
files = [file for file in os.listdir(outfile_path) if file.endswith(".csv")]

df = pd.DataFrame()
for file in files:
    tmp = pd.read_csv(outfile_path + file, sep='|',encoding='utf-8-sig')
    df = df.append(tmp, ignore_index=True)

#%% save the concating result to one file
df=df.drop_duplicates(subset=['ticker','title','body'],keep='last')
df.to_csv(filename_final_csv,sep='|',index=False,encoding='utf-8-sig')