# Web scraping stock market news for Sentiment Analysis

## 1. Introduction

Stock market news articles from 2014-2021 will be collected by dynamic web scraping from [Investing.com](https://uk.investing.com/equities/astrazeneca-news) using a combination of Selenium library to automate browser interaction enabling data extraction by Beautiful Soup.




## 2. Install/import libraries

In [1]:
!pip install htmldate
!pip install twython
!pip3 install newspaper3k

Collecting htmldate
  Downloading htmldate-1.9.1-py3-none-any.whl.metadata (10 kB)
Collecting dateparser>=1.1.2 (from htmldate)
  Downloading dateparser-1.2.0-py2.py3-none-any.whl.metadata (28 kB)
Collecting lxml<6,>=5.2.2 (from htmldate)
  Using cached lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl.metadata (3.8 kB)
Collecting tzlocal (from dateparser>=1.1.2->htmldate)
  Downloading tzlocal-5.2-py3-none-any.whl.metadata (7.8 kB)
Downloading htmldate-1.9.1-py3-none-any.whl (31 kB)
Downloading dateparser-1.2.0-py2.py3-none-any.whl (294 kB)
Using cached lxml-5.3.0-cp312-cp312-macosx_10_9_universal2.whl (8.2 MB)
Downloading tzlocal-5.2-py3-none-any.whl (17 kB)
Installing collected packages: tzlocal, lxml, dateparser, htmldate
Successfully installed dateparser-1.2.0 htmldate-1.9.1 lxml-5.3.0 tzlocal-5.2
[0mCollecting twython
  Downloading twython-3.9.1-py3-none-any.whl.metadata (20 kB)
Collecting requests-oauthlib>=0.4.0 (from twython)
  Downloading requests_oauthlib-2.0.0-py2.py3-none

In [4]:
import pandas as pd
import numpy as np
import time
import twython
import requests
import nltk
import warnings
warnings.filterwarnings('ignore')

!pip install lxml_html_clean
from htmldate import find_date
from tqdm import tqdm
from bs4 import BeautifulSoup
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.downloader.download('vader_lexicon')
from newspaper import Article

Collecting lxml_html_clean
  Downloading lxml_html_clean-0.3.1-py3-none-any.whl.metadata (2.4 kB)
Downloading lxml_html_clean-0.3.1-py3-none-any.whl (13 kB)
Installing collected packages: lxml_html_clean
Successfully installed lxml_html_clean-0.3.1


[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/yogev.na/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## 3. Data collection



In [17]:
# Set up Selenium
!pip install selenium
!apt-get update 
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
import sys
sys.path.insert(0,'/usr/lib/chromium-browser/chromedriver')

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

# Initialize Chrome options
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')  # Run in headless mode
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')

# Specify the path to the ChromeDriver executable
service = Service('/Users/yogev.na/Desktop/Technion/Fintech/Fintech-Stocks-Prediction/chromedriver.exe')  # Replace 'chromedriver' with the actual path if necessary

# Initialize the WebDriver with the 'service' and 'options' parameters
driver = webdriver.Chrome(service=service, options=chrome_options)

zsh:1: command not found: apt-get
The operation couldn’t be completed. Unable to locate a Java Runtime.
Please visit http://www.java.com for information on installing Java.

cp: /usr/lib/chromium-browser/chromedriver: No such file or directory


OSError: [Errno 8] Exec format error: '/Users/yogev.na/Desktop/Technion/Fintech/Fintech-Stocks-Prediction/chromedriver.exe'

In [None]:
def get_newslinks(company, page_number):
    """For a given URL, scroll to relevant section to load appropriate HTML into driver,
    iterate through all articles on page and append article URLs to a list

    :param company: name of company to scrape articles for
    :param page_number: page number on news website to iterate over 

    :return: list of articles URLs
    """
    
    url = f"https://uk.investing.com/equities/{company}-news/{page_number}"
    driver.get(url)

    href = []

    # scroll all the way to the bottom 

    old_position = 0
    new_position = None

    while new_position != old_position:
        # Get old scroll position
        old_position = driver.execute_script(
                ("return (window.pageYOffset !== undefined) ?"
                " window.pageYOffset : (document.documentElement ||"
                " document.body.parentNode || document.body);"))
        # Sleep and Scroll
        time.sleep(1)
        driver.execute_script((
                "var scrollingElement = (document.scrollingElement ||"
                " document.body);scrollingElement.scrollTop ="
                " scrollingElement.scrollHeight;"))
        # Get new position
        new_position = driver.execute_script(
                ("return (window.pageYOffset !== undefined) ?"
                " window.pageYOffset : (document.documentElement ||"
                " document.body.parentNode || document.body);"))
        
    cleaned_links = []

    # Iterate through all the articles on the page
    for article_number in range(1,11): 
        article = driver.find_element_by_xpath(f'/html/body/div[5]/section/div[8]/article[{article_number}]')
        article_html = article.get_attribute('innerHTML')
        soup = BeautifulSoup(article_html, "lxml")
        for link in soup.find_all('a'): 
            # Get the href
            partial_link = link.get('href')
            if 'https' in partial_link: 
                cleaned_links.append(partial_link)
            # Some links are 'internal' to the page and don't have https in them. The web page must be appended to these links
            elif partial_link[0] == '/': 
                cleaned_links.append('https://uk.investing.com/'+partial_link) 

    return np.unique(cleaned_links)

In [None]:
# Create empty list to append URLs 

all_company_urls = []
for page in range(1,119):
    results = get_newslinks('astrazeneca', page)
    all_company_urls.extend(results)
all_company_urls

['https://invst.ly/tb-6x',
 'https://invst.ly/tb-n4',
 'https://invst.ly/tbon0',
 'https://invst.ly/tbxxc',
 'https://invst.ly/tbyb8',
 'https://invst.ly/tbzzy',
 'https://invst.ly/tc0ay',
 'https://invst.ly/tc2id',
 'https://invst.ly/tcjd5',
 'https://uk.investing.com//news/stock-market-news/uk-stocksfactors-to-watch-on-jan-4-2274638',
 'https://invst.ly/ta-bv',
 'https://invst.ly/tavnc',
 'https://invst.ly/tavzk',
 'https://invst.ly/tazh-',
 'https://invst.ly/tb0n-',
 'https://invst.ly/tb0po',
 'https://invst.ly/tb0pu',
 'https://invst.ly/tbgq4',
 'https://uk.investing.com//news/stock-market-news/uk-stocksfactors-to-watch-on-dec-31-2273851',
 'https://invst.ly/tao66',
 'https://invst.ly/tapfh',
 'https://invst.ly/tapq5',
 'https://invst.ly/taq6h',
 'https://invst.ly/tar6p',
 'https://invst.ly/tara2',
 'https://invst.ly/targ5',
 'https://invst.ly/tatz3',
 'https://invst.ly/tav9t',
 'https://uk.investing.com//news/economy/asian-shares-pause-recent-rally-euro-near-212year-high-2273311',

In [None]:
# AstraZeneca stock ticker
ticker = 'AZN.L'
# Create a DataFrame to populate while iterating
article_sentiments = pd.DataFrame({'ticker':[],
                                'publish_date':[],
                                'title': [],
                                'body_text': [],
                                'url':[],
                                'neg':[],
                                'neu':[], 
                                'pos':[], 
                                'compound':[]})
# Loop over all the articles
for link in all_company_urls:
      article = Article(link)
      article.download()
      
      try:
          article.parse()
          text = article.text

      except: 
          print("I didn't get this")
          continue

      # Initialise sentiment analyser    
      sid = SentimentIntensityAnalyzer()
      # Get positive, negative, neutral and compound scores
      polarity = sid.polarity_scores(text)

      tmpdic = {'ticker': ticker, 'publish_date': find_date(link), 'title': article.title, 'body_text': article.text, 'url': link}
      # Update ticker with the new entry polarity
      tmpdic.update(polarity)
      # tmpdic now has all keys and values needed to populate the DataFrame
      article_sentiments= article_sentiments.append(pd.DataFrame(tmpdic, index=[0]))
      article_sentiments.reset_index(drop=True, inplace=True)

I didn't get this
I didn't get this
I didn't get this
I didn't get this
I didn't get this
I didn't get this
I didn't get this


In [None]:
# Show DataFrame of article sentiments data

article_sentiments

Unnamed: 0,ticker,publish_date,title,body_text,url,neg,neu,pos,compound
0,AZN.L,2021-01-04,Coronavirus: UK starts rollout of AstraZeneca/...,"The FTSE 100 firm has provided 530,000 doses r...",https://invst.ly/tb-6x,0.020,0.877,0.103,0.9796
1,AZN.L,,Optimism continues into new year as European s...,"European stocks rose on Monday, with investors...",https://invst.ly/tb-n4,0.037,0.877,0.086,0.9545
2,AZN.L,2021-01-03,AstraZeneca’s COVID-19 Vaccine Wins Emergency ...,The COVID-19 vaccine developed by AstraZeneca ...,https://invst.ly/tbon0,0.031,0.859,0.110,0.9769
3,AZN.L,2021-01-04,AstraZeneca completes sale of heart failure tr...,The buyer is German pharma company Cheplapharm...,https://invst.ly/tbxxc,0.038,0.907,0.055,0.5859
4,AZN.L,2021-01-04,UK rolls out Oxford-AstraZeneca vaccine as it ...,LONDON — The U.K. has started rolling out the ...,https://invst.ly/tbyb8,0.018,0.928,0.054,0.9247
...,...,...,...,...,...,...,...,...,...
1209,AZN.L,2014-04-30,"GSK says staying on sidelines in Astra, Pfizer...",By Ben Hirschler\n\nLONDON (Reuters) - GlaxoSm...,https://uk.investing.com//news/stock-market-ne...,0.021,0.805,0.174,0.9972
1210,AZN.L,2014-05-01,Pfizer prepares sweeter bid for AstraZeneca - ...,Pfizer prepares sweeter bid for AstraZeneca - ...,https://uk.investing.com//news/stock-market-ne...,0.011,0.942,0.048,0.5274
1211,AZN.L,2014-05-01,Pfizer’s designs on AstraZeneca stir tax envy ...,By Olivia Oran and Soyoung Kim\n\nNEW YORK (Re...,https://uk.investing.com//news/stock-market-ne...,0.038,0.910,0.052,0.8085
1212,AZN.L,2014-05-01,R&D site marks political frontline in $100 bil...,By Ben Hirschler\n\nCAMBRIDGE England (Reuters...,https://uk.investing.com//news/stock-market-ne...,0.060,0.831,0.109,0.9920


In [None]:
# Save DataFrame 

article_sentiments.to_pickle("azn_article_sentiments_20210105.pkl")

In [None]:
article_sentiments.to_csv("azn_article_sentiments_20210105.csv", sep=',', encoding='utf-8', header=True)

In [None]:
# Save URLS to text file

with open('azn_urls_20210105.txt', 'w') as f:
    for link in all_company_urls:
        f.write("%s\n" % link)