# Web Scraper
### Shell internship summer 2018 - Sanja Simonovikj

**Goal:** Build a web scraper to extract text from articles from certain websites.

**The websites we are going to scrape are the following: **
- Energy Intelligence : http://www.energyintel.com 
- IHS -Energy, energy news on demand:  https://my.ihs.com/Energy 
- World Energy News: https://www.worldenergynews.com/

Other websites to consider:
- CNBS: https://www.cnbc.com/
    - energy: https://www.cnbc.com/energy/
    - oil and gas: https://www.cnbc.com/oil-gas/
    - utilities: https://www.cnbc.com/oil-gas/
    - renawable energy: https://www.cnbc.com/renewable-energy/
- Economic Times: https://economictimes.indiatimes.com/
    - Power: https://economictimes.indiatimes.com/industry/energy/power/articlelist/msid-13358361,contenttype-a.cms
    - Oil and gas: https://economictimes.indiatimes.com/industry/energy/oil-gas/articlelist/msid-13358368,contenttype-a.cms
   
- Wall Street Journal - Energy: https://www.wsj.com/news/business/energy-oil-gas
- Reuters- Global Energy news: https://in.reuters.com/news/archive/globalEnergyNews
- Oil price: https://oilprice.com/


### Selenium Python

Selenium python is an API which can automate interaction with a browser through a python script. Selenium is generally used for automated testing, but it can easily be used for stuff like web scraping. We prefer Selenium over other API's as it provides an easy way to log in to websites which require authentication. The only exception is that is is not possible to do this if the website uses CAPTCHA, as CAPTCHA's purpose is to prevent automated bots to navigate the website. To use Selenium one needs to have python, the selenium python package and webdriver (like chromedriver).

- Documentation:  http://selenium-python.readthedocs.io/installation.html#introduction
- Tutorial: http://www.marinamele.com/selenium-tutorial-web-scraping-with-selenium-and-python

### Automated scraping
One can easily run all the python scripts simultaneously using bash script and get the text files within minutes. Furthermore, this can be automated to run periodically, so that one does not have to manually initiate the process. The runtime of the scripts varies from website to website; one can check the exact runtime at the last line of the output of running the script (given below). As of now, the runtime ranges from 1-6 minutes to scrape the homepage of a website.

## Code for scraping various websites

### Energy Intelligence
>  http://www.energyintel.com

```python
# Script for scraping text from articles from Energy Intelligence website
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from datetime import datetime
import time


startTime = datetime.now()

URL = "https://www.energyintel.com/pages/non-subscriber.aspx" # initial page with the login form
TIMEOUT = 40 # how many seconds to wait for page to load before searching for an element, increasing this might prevent errors
DATA_FOLDER = "scraped_EI" # folder to save the scraped aricles
USERNAME = "n.ghodke@shell.com"
PASSWORD = "Forintern1"

browser = webdriver.Chrome(executable_path=os.path.abspath("chromedriver")) #replace with .Firefox(), or with the browser of your choice
# The chromdriver executable needs to be in the same folder as this script
browser.get(URL) # navigate to the page

username = browser.find_element_by_id("ctl00_Header_uc_login1_UserName") #username form field
password = browser.find_element_by_id("ctl00_Header_uc_login1_pwd") #password form field


username.send_keys(USERNAME) # populate the form
if password.is_displayed():
	# if password is not visible (cached) this will not be executed
	password.send_keys(PASSWORD)


submitButton = browser.find_element_by_id("ctl00_Header_uc_login1_imglogin") 
submitButton.click()  # submit the login info


articles = browser.find_elements_by_class_name('newsletter_headline')
articles = [article for article in articles if ".pdf" not in article.get_attribute('href')]
urls = [i.get_attribute('href') for i in articles] # urls of articles

print("Number of articles to be scraped: ", len(urls))

i=0
# iterate through article links
for url in urls:
	browser.get(url)  
	scraped_text = ""
	try:
	    WebDriverWait(browser, TIMEOUT).until(EC.visibility_of_element_located((By.XPATH,'//*[@id="divArticleHpContent"]/table/tbody/tr[6]/td')))
	except TimeoutException:
	    print("Timed out waiting for page to load")
	    browser.quit()

	title = browser.find_element_by_xpath('//*[@id="divArticleHpContent"]/table/tbody/tr[6]/td')
	print("Article " + str(i+1) + " currently scraped: ", title.text)
	scraped_text += title.text + "\n"
	records = browser.find_elements_by_xpath('//*[@id="divArticleHpContent"]/table/tbody/tr[7]/td/div')
	for record in records:
		for paragraph in record.find_elements_by_xpath('.//p'):
			scraped_text += paragraph.text + "\n"
	timestr = time.strftime("%Y%m%d-%H%M%S")
	with open(os.path.join(DATA_FOLDER,"doc_" + timestr + "_" + ".txt"), 'w') as d:
		d.write(scraped_text) # write the scraped text in a document

	i+=1


print("Time report: It took " +str((datetime.now() - startTime).total_seconds())+ " seconds to run this script")
time.sleep(1)
browser.quit()
```

Output:
```
Number of articles to be scraped:  17
Article 1 currently scraped:  US Oil Lobby Backs Bill to Check Trump's Trade Authority
Article 2 currently scraped:  Denmark Plays Coy on Nord Stream 2 Permit Approval
Article 3 currently scraped:  Kuwait Energy Mulls Sale Options Amid Financial Pressures
Article 4 currently scraped:  US Crude Exports Could Suffer From Syncrude Outage
Article 5 currently scraped:  Al-Kaabi: Qatar Prepared to Comply With EU Regulations
Article 6 currently scraped:  Oil Prices Rise as Iran Doubt Overshadows Saudi Pledge
Article 7 currently scraped:  States Sue EPA Over Changes to Emissions Rules
Article 8 currently scraped:  EIA: Lower 48 Oil Output Ascends to 10.5 Million b/d
Article 9 currently scraped:  Exxon, Chevron CEOs Assail Trump Trade Policies
Article 10 currently scraped:  Judge Throws Out Landmark Climate Case
Article 11 currently scraped:  Erdogan Victory Buoys Turkey in Russia, Caspian Gas Games
Article 12 currently scraped:  Solar, Wind Quickly Catching Up on Cost
Article 13 currently scraped:  Security Scare Rattles Mozambique LNG
Article 14 currently scraped:  Editorial: Opec Makes a Move
Article 15 currently scraped:  European Offshore Wind Players Vie for US Northeast Leases
Article 16 currently scraped:  Mounting Uncertainties Require Opec to Stay Nimble
Article 17 currently scraped:  Super-Laterals, Cost Controls Help Appalachian E&Ps Thrive in $3 Market
Time report: It took 300.118485 seconds to run this script
```

### IHS Markit - Energy
> https://my.ihs.com/Energy

``` python
# Script for scraping text from articles from IHS Markit-Energy website
import os
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from datetime import datetime
import time

startTime = datetime.now()

URL = "https://my.ihs.com/Energy?ForceLogin=True" # initial page with the login form
DATA_FOLDER = "scraped_IHS" # folder to save the scraped aricles
USERNAME = "n.ghodke@shell.com"
PASSWORD = "Forintern1"


browser = webdriver.Chrome(executable_path=os.path.abspath("chromedriver")) #replace with .Firefox(), or with the browser of your choice
# The chromdriver executable needs to be in the same folder as this script
browser.get(URL) # navigate to the page

username = browser.find_element_by_id("txtUsername") #username form field
password = browser.find_element_by_id("txtPassword") #password form field


username.send_keys(USERNAME) # populate the form
if password.is_displayed():
	# if password is not visible (cached) this will not be executed
	password.send_keys(PASSWORD)


submitButton = browser.find_element_by_id("btnSubmit") 
submitButton.click()  # submit the login info

#Navigate to Energy News on Demand page
browser.get("https://penod.ihsenergy.com/ENOD/Home#")


articles = browser.find_elements_by_class_name('visitedlink')
#print("articles found: ", articles)
articles = [article for article in articles if ".pdf" not in article.get_attribute('href')]
urls = [i.get_attribute('href') for i in articles] # urls of articles

print("Number of articles to be scraped: ", len(urls))

i=0
# iterate through article links
for url in urls:

	browser.get(url)  
	scraped_text = ""
	title=browser.find_element_by_xpath("/html/body/div[3]/h1")
	print("Article " + str(i+1) + " currently scraped: ", title.text)
	scraped_text += title.text + "\n"
	paragraphs = browser.find_elements_by_class_name("MsoNormal")
	for p in paragraphs:
		scraped_text += p.text +"\n"

	timestr = time.strftime("%Y%m%d-%H%M%S")
	with open(os.path.join(DATA_FOLDER,"doc_"+ timestr + "_" + ".txt"), 'w') as d:
		d.write(scraped_text) # write the scraped text in a document

	i+=1


print("Time report: It took " +str((datetime.now() - startTime).total_seconds())+ " seconds to run this script")
time.sleep(1)
browser.quit()
```

Output: 
```
Number of articles to be scraped:  10
Article 1 currently scraped:  New horizontal drilling slated for Arkansas Pennsylvanian pool By Ed Marker
Article 2 currently scraped:  Great Plains stakes southern Cambridge Arch basement test By Ed Marker
Article 3 currently scraped:  New horizontal drilling slated for Arkansas Pennsylvanian pool By Ed Marker
Article 4 currently scraped:  Suemaur adds to northern Kansas exploration program By Ed Marker
Article 5 currently scraped:  Two Woodford tests staked on northwestern McClain County pad By Ed Marker
Article 6 currently scraped:  Strand Energy sets pipe at 9610-ft Smith County wildcat By Marc Eckhardt, Jeff Gosmano
Article 7 currently scraped:  Cobra Oil & Gas drills apparent producer in Angelina County By Jeff Gosmano
Article 8 currently scraped:  W&T Offshore to bypass Viosca Knoll test in Virgo field By Marc Eckhardt, Jeff Gosmano
Article 9 currently scraped:  U.S. Energy stakes 4500-ft eastern Williamson County test By Jeff Gosmano
Article 10 currently scraped:  EnLink plans new Delaware Basin crude oil gathering system By Marc Eckhardt
Time report: It took 89.284134 seconds to run this script

```

### World Energy News
> https://www.worldenergynews.com/

``` python
# Script for scraping text from articles from World Energy News website
import os
import time
from selenium import webdriver
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException
from datetime import datetime
import time

startTime = datetime.now()

URL = "https://www.worldenergynews.com/" # initial page with the login form
DATA_FOLDER = "scraped_WEN" # folder to save the scraped aricles
TIMEOUT = 40

browser = webdriver.Chrome(executable_path=os.path.abspath("chromedriver")) #replace with .Firefox(), or with the browser of your choice
# The chromdriver executable needs to be in the same folder as this script
browser.get(URL) # navigate to the page

urls = []
latest_news = browser.find_elements_by_class_name('grayscale')
for article in latest_news:
	#print(article.get_attribute('innerHTML'))
	a = article.find_element_by_tag_name("a")
	urls.append(a.get_attribute('href'))

articles = browser.find_elements_by_class_name("a")
for article in articles:
	urls.append(article.get_attribute('href'))

print("Number of articles to be scraped: ", len(urls))
i=0
# iterate through article links
for url in urls:

	browser.get(url)  
	scraped_text = ""
	try:
		WebDriverWait(browser, TIMEOUT).until(EC.visibility_of_element_located((By.XPATH,'//*[@id="wrapper"]/div[2]/div[1]/div/div/article/h1')))
	except TimeoutException:
		print("Timed out waiting for page to load")
		browser.quit()

	title=browser.find_element_by_xpath('//*[@id="wrapper"]/div[2]/div[1]/div/div/article/h1')
	print("Article " + str(i+1) + " currently scraped: ", title.text)
	scraped_text += title.text + "\n"
	paragraphs = browser.find_elements_by_xpath("//div[@itemprop='text']")

	for p in paragraphs:
		scraped_text += p.text +"\n"

	timestr = time.strftime("%Y%m%d-%H%M%S")
	with open(os.path.join(DATA_FOLDER,"doc_"+ timestr+ "_" + ".txt"), 'w') as d:
		d.write(scraped_text) # write the scraped text in a document

	i+=1


print("Time report: It took " +str((datetime.now() - startTime).total_seconds())+ " seconds to run this script")
time.sleep(1)
browser.quit()
```

Output:

```
Number of articles to be scraped:  44
Article 1 currently scraped:  EnerMech Obtains $24m Service Contract
Article 2 currently scraped:  The Culture Clash Behind GE's Exit from Baker Hughes
Article 3 currently scraped:  Goldman Sachs Investment Division Upbeat on Oil
Article 4 currently scraped:  Gas to Overtake Oil as Primary Energy Source by Mid-2030s
Article 5 currently scraped:  Canada Dreams of Oil Exports to Asia, but California Beckons
Article 6 currently scraped:  MAN D&T Rebrands ‘MAN Energy Solutions’
Article 7 currently scraped:  Trelleborg Solutions to FSRU-Based LNG Terminal
Article 8 currently scraped:  MOL's FSRU Challenger for Hong Kong LNG Terminal Project
Article 9 currently scraped:  Offshore Energy: LR Supports Mozambique FLNG Project
Article 10 currently scraped:  Idemitsu Founding Family to Accept Showa Shell Merger
Article 11 currently scraped:  Oil Rises on Supply Losses, U.S. Push to Isolate Iran
Article 12 currently scraped:  GE to Divest Baker Hughes Stake
Article 13 currently scraped:  U.S. Court Dismisses Climate Change Lawsuits Against Oil Majors
Article 14 currently scraped:  Maritime Decarbonization: The Path Starts in Norway
Article 15 currently scraped:  EPA Proposes 2019 Biofuel Requirements
Article 16 currently scraped:  U.S. Coal Cargo Reaches China, Beating Import Tariff Deadline
Article 17 currently scraped:  Equinor Installs World’s First Battery for Offshore Wind
Article 18 currently scraped:  Woodside Mulls Texas Sempra LNG Exit
Article 19 currently scraped:  ExxonMobil Mulls "multi-billion" Dollar Singapore Refinery Expansion
Article 20 currently scraped:  Libya's Internationally Recognized Government Pans Oil Port Decision
Article 21 currently scraped:  Airborne, Subsea 7 Launch TCP Riser Program
Article 22 currently scraped:  Oil Steady as Outages Balance Trade Dispute, OPEC
Article 23 currently scraped:  Europe Distillates-Cracks Rise Despite Wave of Imports
Article 24 currently scraped:  PLAT-I: Renewable Energy for Canada
Article 25 currently scraped:  Oil Stocks Drop by Nearly 10 mln Barrels - EIA
Article 26 currently scraped:  Chevron, Exxon CEOs Worry Global Trade Conflict Could Harm Economy
Article 27 currently scraped:  Oil Rises as Outages Balance Trade Dispute, OPEC
Article 28 currently scraped:  Oil Producer Deal May Be Short of What's Needed
Article 29 currently scraped:  Total, Singapore's Pavilion Energy Sign LNG Ship Fuel Supply Chain Deal
Article 30 currently scraped:  Russia's Novatek, S.Korea's Kogas Sign LNG Agreement
Article 31 currently scraped:  Croatia Reopens Bidding for LNG Terminal Capacity
Article 32 currently scraped:  LNG to Benefit Philippines, Says ADB Energy Expert
Article 33 currently scraped:  Oil Rises on Supply Losses, U.S. Push to Isolate Iran
Article 34 currently scraped:  Macron Approves Offshore Windpower Projects
Article 35 currently scraped:  EU Agrees Final Energy Saving, Renewables Targets
Article 36 currently scraped:  China Ramping Up Renewable Power
Article 37 currently scraped:  DNV GL Unveils Renewable Energy Data Platform
Article 38 currently scraped:  US Announces $18.5 Mln for Offshore Wind Research
Article 39 currently scraped:  GTT Studying Gravity Based Systems for LNG Projects
Article 40 currently scraped:  New Clamp Technology for Subsea Well Intervention
Article 41 currently scraped:  U.S. Pushes Allies to Halt Iran Oil Imports
Article 42 currently scraped:  Idemitsu Founding Family to Accept Showa Shell Merger
Article 43 currently scraped:  Minnesota Regulators Question Enbridge Pipeline Expansion
Article 44 currently scraped:  Venezuela's Creditors Call for Unity
Time report: It took 245.9154 seconds to run this script

```