# Homework exercise 2

__Name 1:__ Serhii Horbachov, 12026116

__Name 2:__ Eyad Yusuf Solieman, 01556757

__Yahoo Finance News Crawling__

Your task is to collect and organize articles from Yahoo Finance News, available at https://finance.yahoo.com/news

This will require you to use Selenium for at least two reasons:

* The site initially loads only partially such that scrolling is necessary to access additional news articles
* The news articles themselves, if they are long, are initially shown (and downloadable) in an abbreviated form. Browser navigation is necessary to click on a button such that the whole article is displayed.

There are about 200 news articles available at a time, stemming from various sources (e.g. Bloomberg, Reuters), and covering different topic areas (e.g. Business, World).

Note that you will need to use some features of Selenium - in particular scrolling - that are not discussed in the class notebook but documented elsewhere. 

You are asked to

* Download all of the about 200 news articles available from Yahoo Finance News
* Exclude the advertisements listed in between the news articles
* Try and remove everything that is not part of the article itself (e.g. the sidebar containing a list of popular articles and other content) and save the text of each article in a text file
* Create a DataFrame containing the following information (to the extent it is available) for each article
    * Title
    * Source (e.g. Bloomberg)
    * Time and date of publication
    * Name and path of the file containing the article's text that you saved
    
Please save this DataFrame so that you can later reuse the data set you created.

In [1]:
#!pip install selenium
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import time
import json
from selenium import webdriver
import re
import random
import os

In [2]:
%%time
# wall time — time taken from the start of program to the end

from selenium.webdriver.chrome.options import Options   # library option: for headless browsing

options = Options()                                     # assigning to a variable
options.headless = True                                 # turn on the headless option
driver = webdriver.Chrome(options=options)              # headless option for chrome driver is activated

driver.get('https://finance.yahoo.com/news')
driver.implicitly_wait(3)                               # wait for the webpage to load (force the driver to wait 3 seconds)

driver.find_element_by_xpath("//button[@type='submit']").click()  # find an element with a tag button, type submit and click it


SCROLL_PAUSE_TIME = 3                                   # wait for scrolling 3 seconds


last_height = driver.execute_script("return document.documentElement.scrollHeight")    # execute what is inside and
                                                                                       # return the current scroll height

while True:
    # scroll down to bottom
    driver.execute_script("window.scrollTo(0,document.documentElement.scrollHeight);") # scrolls to a particular set of 
                                                                                       # coordinates in the document
                         # window.scrollTo(x=0, y= till the end)

    time.sleep(SCROLL_PAUSE_TIME)                       # wait to load page (3 seconds)

    # calculate the new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.documentElement.scrollHeight")
    if new_height == last_height:
        break

    last_height = new_height


html = driver.page_source                               # the site's HTML code
soup = BeautifulSoup(html)

Wall time: 56.8 s


In [3]:
# get article links

BASE_URL = 'https://finance.yahoo.com/news'

class_name = 'js-stream-content'                                # class to distinguish articles
articles_links = soup.findChildren('li', {'class': class_name}) # find the tag 'li' and find inside it the class

links = []                                                      # create a list

for article in articles_links:
    try:          
        a_tag = article.h3.a                                    # navigate these 'li' tags for the 'a' tag in  the 3rd Heading
        link = BASE_URL + a_tag['href']                         # makes the links of ads not accesable, but
                                                                # accesable for yahoo articles

        links.append(link)                                      # add the links to the list
    except:
        pass

print("Articles found:", len(links))

Articles found: 186


In [4]:
from selenium.webdriver import ActionChains
import shutil

In [7]:
articles_dir = 'yahoo articles'     # create the path for the file containing the article's text

if os.path.exists(articles_dir):    # looking if the path already exists
    shutil.rmtree(articles_dir)     # if the path already exists, it will be deleted
os.makedirs(articles_dir)           # create all unavailable/missing directories

In [8]:
for index, link in enumerate(links):

    options = Options()
    options.headless = True                         # headless browsing
    driver = webdriver.Chrome(options=options)        

    driver.get(link)                                # get the link from the links list
    driver.implicitly_wait(3)                       # wait for the webpage to load

    accept_cookies_btn = 'agree'

    try:
        button = driver.find_element_by_xpath(f'//button[@value="{accept_cookies_btn}"]')  # find an element with the tag button
                                                                                           # and value=agree
    except:
        button = None                               # when no 'accept cookies', do nothing

    if button:                                      # if button is true:
        print("cookie banner is displayed")
        button.click()                              # accept the cookies
      
    

    try:
        button = driver.find_element_by_xpath(f'//button[text()="Story continues"]')       # find the defined button
    except:
        button = None
        

    if button:                                      # scrolling to button
        actions = ActionChains(driver)              # define action chain
        actions.move_to_element(button).perform()   # move the pointer to the button

        button.click()                              # click the button
   
    # parse article
    html = driver.page_source                       # html code of the article
    soup = BeautifulSoup(html)

    title = soup.find('h1', {'data-test-locator': 'headline'}).text  # find the first header tag with class name:
                                                                     # data-test-locator with value 'headline'
    time = soup.find('time').text                                    # find the 'time' tag, the text in it
    source = soup.find('div', {'class': 'caas-logo'}).a.text         # find the text. div tag .class=caas-logo,'a' tag
    article_div = soup.find('div', {'class': 'caas-body'})           # find the tag where the article is

    text = ""                                       # create a string
    for p in article_div.find_all('p'):             # finding the 'p' tag in the object 'article_div'
        text += p.text                              # return the text in the 'p' tag and add it as text

    # save the article text in a file
    filename = f'{index}.txt'                                           # title.replace(' ', '_') + '.txt' naming the text file
                                                                        # by index
    
    import io                                                           # to avoid the UnicodeEncodeError
    
    with io.open(os.path.join('yahoo articles', filename), 'w+', encoding="utf-8") as file:  # join the path/yahoo articles/ 
                                                                                             # with the filename and
                                                                        # allow to edit it/write
        file.write('\t'.join([title, source, time, text]))              # writing/putting all these title/source/time/text in
                                                                        # one txt file

    driver.quit()                                                       # close the browser

cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
cookie banner is displayed
c

In [10]:
# read a file with article text
import csv

# create a dataframe with the following columns: # title
                                                 # source
                                                 # time and date of publication
                                                 # path to the file

            
files = os.listdir('yahoo articles')             # return a list of all the files in the path

files = [file for file in files if file != '.ipynb_checkpoints']      # checking at each checkpoint that files, in which
                                                                      # there are articles, are not repeating

df = pd.DataFrame()                                                   # create a dataframe
for file in files:
    file_df = pd.read_csv('yahoo articles' + '/' + file,              # read from the text files 
                          header=None,                                # we shall name the headers 
                          names=['title', 'source', 'time', 'text'],  # names of the columns  
                          quoting=csv.QUOTE_NONE,                     # the text is saved without quotes "_"
                          sep='\t', encoding="utf-8")                 # leave a Tab space between 
    file_df['path'] = os.path.join('yahoo articles', file)            # adding another column for path/ yahoo articles/file name
    df = df.append(file_df, ignore_index=True)                        # adding rows after each other/  indexing rows 

In [11]:
df

Unnamed: 0,title,source,time,text,path
0,Lower global mobility will continue to weigh d...,Yahoo Finance,"Fri, May 28, 2021, 6:20 PM",A Barclays report released earlier this week f...,yahoo articles\0.txt
1,U.S. Looks at Travel Pass; NYC Hits Test Miles...,Bloomberg,"Fri, May 28, 2021, 6:16 PM",(Bloomberg) -- Secretary of Homeland Security ...,yahoo articles\1.txt
2,Yahoo U: When is an economy overheating?,Yahoo Finance Video,"Fri, May 28, 2021, 5:49 PM",Yahoo Finance’s Brian Cheung explains overheat...,yahoo articles\10.txt
3,WTO says goods trade rising at accelerated pace,Reuters,"Fri, May 28, 2021, 2:00 PM","GENEVA, May 28 (Reuters) - The World Trade Org...",yahoo articles\100.txt
4,"RPT-After Colonial attack, energy companies ru...",Reuters,"Fri, May 28, 2021, 2:00 PM",(Repeats for more subscribers.)By Laura Sanico...,yahoo articles\101.txt
...,...,...,...,...,...
181,U.S. says $11.6 billion NYC-area tunnel projec...,Reuters,"Fri, May 28, 2021, 2:07 PM",By David ShepardsonWASHINGTON (Reuters) -Two U...,yahoo articles\95.txt
182,Bitcoin Slumps as Traders Brace for a Volatile...,Bloomberg,"Fri, May 28, 2021, 2:05 PM","(Bloomberg) --Bitcoin slumped 7% to near $35,5...",yahoo articles\96.txt
183,Algerian medics fear new infections as borders...,Reuters,"Fri, May 28, 2021, 2:04 PM","By Lamine ChikhiALGIERS, May 28 (Reuters) - Al...",yahoo articles\97.txt
184,Ukraine says Belarus has imposed trade barrier...,Reuters,"Fri, May 28, 2021, 2:04 PM","KYIV, May 28 (Reuters) - Belarus, under global...",yahoo articles\98.txt


In [12]:
# Saving this DataFrame to reuse the data set

df.to_csv(r'yahoo articles\export_dataframe.csv', index = False, header=True)

print (df)

                                                 title               source  \
0    Lower global mobility will continue to weigh d...        Yahoo Finance   
1    U.S. Looks at Travel Pass; NYC Hits Test Miles...            Bloomberg   
2             Yahoo U: When is an economy overheating?  Yahoo Finance Video   
3      WTO says goods trade rising at accelerated pace              Reuters   
4    RPT-After Colonial attack, energy companies ru...              Reuters   
..                                                 ...                  ...   
181  U.S. says $11.6 billion NYC-area tunnel projec...              Reuters   
182  Bitcoin Slumps as Traders Brace for a Volatile...            Bloomberg   
183  Algerian medics fear new infections as borders...              Reuters   
184  Ukraine says Belarus has imposed trade barrier...              Reuters   
185  Lira Falls to Record as Local FX Demand Adds t...            Bloomberg   

                           time  \
0    Fri, May 28