# Homework exercise 2
## Deadline: upload to Moodle by 31 May 18:00 h

__Please submit your homework either as a Jupyter Notebook or using .py files.__

If you use .py files, please also include a PDF containing the output of your code and your explanations. Either way, the code needs to be in a form that can be easily run on another computer.

__Name 1:__ Sebastian Ertner

__Name 2:__ Sebastian Krendl

__Name 3:__ Dushan Trajkovski


The name of the file that you upload should be named *Homework1_YourLastName_YourStudentID*.

Reminder: you are required to attend class on 1 June to earn points for this homework exercise unless you have a valid reason for your absence.

You are encouraged to work on this exercise in teams of up to three students. If any part of the questions is unclear, please ask on the Moodle forum.

#### Selenium


__Yahoo Finance News Crawling__

Your task is to collect and organize articles from Yahoo Finance News, available at https://finance.yahoo.com/news

This will require you to use Selenium for at least two reasons:

* The site initially loads only partially such that scrolling is necessary to access additional news articles
* The news articles themselves, if they are long, are initially shown (and downloadable) in an abbreviated form. Browser navigation is necessary to click on a button such that the whole article is displayed.

There are about 200 news articles available at a time, stemming from various sources (e.g. Bloomberg, Reuters), and covering different topic areas (e.g. Business, World).

Note that you will need to use some features of Selenium - in particular scrolling - that are not discussed in the class notebook but documented elsewhere. 

You are asked to

* Download all of the about 200 news articles available from Yahoo Finance News
* Exclude the advertisements listed in between the news articles
* Try and remove everything that is not part of the article itself (e.g. the sidebar containing a list of popular articles and other content) and save the text of each article in a text file
* Create a DataFrame containing the following information (to the extent it is available) for each article
    * Title
    * Source (e.g. Bloomberg)
    * Time and date of publication
    * Name and path of the file containing the article's text that you saved
    
Please save this DataFrame so that you can later reuse the data set you created.



## Solution

### Imports and setup

In [1]:
from bs4 import BeautifulSoup
import os
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

Create directory to save article text

In [2]:
dir_save="yahoofinance"
try:
    os.mkdir(dir_save)
    print("directory created")
except:
    print("directory exists")

directory exists


Start the webdriver

In [3]:
main_url = "https://finance.yahoo.com/news"
driver = webdriver.Chrome()
driver.get(main_url)
driver

<selenium.webdriver.chrome.webdriver.WebDriver (session="462c8560c208ea4d88c602298f2dfe51")>

Accept cookie policy if necessary

In [4]:
try:
    cookie_acc_btn = driver.find_element_by_name("agree")
    cookie_acc_btn.click()
except:
    pass

Now we are on the main page.  
The next step is to make sure the page is fully loaded.  
We do this by scrolling down to the end of the page.  
We continually send the END key until the number of articles does not change any more.

Articles currently loaded on the page

In [5]:
main_column = driver.find_element(By.ID, 'Main')
articles = main_column.find_elements(By.TAG_NAME, 'li')
len(articles)

12

Scroll down until the number of articles remains unchanged

In [6]:
wait_for_page = 1.5  # this number must be increased if the page loads too slowly
art_num = 0
while True:
    driver.find_element_by_tag_name("body").send_keys(Keys.END)
    time.sleep(wait_for_page)
    articles = main_column.find_elements(By.TAG_NAME, 'li')
    print(len(articles), end=" ")
    if len(articles) > art_num:
        art_num = len(articles)
    else:
        break

48 58 68 78 88 98 108 118 128 138 148 158 168 178 178 

Filter out articles which are ads

In [7]:
articles_no_ads = []
for article in articles:
    article_class = article.find_element(By.TAG_NAME, 'div').get_attribute('class')
    if not re.search(r'gemini-ad|native-ad', article_class):
        articles_no_ads.append(article)
len(articles_no_ads)

170

Take a look at some articles

In [8]:
for article in articles_no_ads[:3]:
    print(article.text[:100]+"\n")

BusinessBloomberg•7 minutes ago
Match, IAC Covered Up Sexual Misconduct, Legal Filing Says
(Bloomber

BusinessBloomberg•11 minutes ago
Atlantia Backs Landmark Sale of Autostrade to Italy State Lender
(B

WorldReuters•38 minutes ago
Italy reports 82 coronavirus deaths on Monday, 1,820 new cases
Italy rep




### Test the procedure for the first article

Open article in new tab  
https://stackoverflow.com/questions/27775759/send-keys-control-click-in-selenium-with-python-bindings

In [9]:
ActionChains(driver).key_down(Keys.CONTROL).click(articles_no_ads[0]).key_up(Keys.CONTROL).perform()

Switch to new tab  
https://stackoverflow.com/questions/28715942/how-do-i-switch-to-the-active-tab-in-selenium

In [10]:
driver.switch_to.window(driver.window_handles[1])
driver.current_url

'https://finance.yahoo.com/news/match-iac-covered-sexual-misconduct-182726610.html'

Press story continues button if it exists

In [11]:
try:
    driver.find_element_by_class_name("link.rapid-noclick-resp.caas-button.collapse-button").click()
except:
    print("Do Nothing")

Create beautifulSoup object from article source

In [12]:
soup = BeautifulSoup(driver.page_source, 'html.parser')

Get article data

In [13]:
title = soup.h1.text
title

'Match, IAC Covered Up Sexual Misconduct, Legal Filing Says'

In [14]:
source = soup.find('span', class_='caas-attr-provider').text
source

'Bloomberg'

In [15]:
dt = soup.find('div', class_='caas-attr-time-style').time['datetime']
dt

'2021-05-31T15:38:02.000Z'

In [16]:
art_text = soup.find('div', class_='caas-body')
art_text.prettify()[:100]

'<div class="caas-body">\n <p>\n  (Bloomberg) -- IAC/InterActiveCorp. and Match Group Inc. are being ac'

Close tab  
https://stackoverflow.com/questions/25951968/open-and-close-new-tab-with-selenium-webdriver-in-os-x

In [17]:
driver.close()

And switch back to main page

In [18]:
driver.switch_to.window(driver.window_handles[0])
driver.current_url

'https://finance.yahoo.com/news?guccounter=1'

### Full loop and collect data in files

In [23]:
titles = []
sources = []
dts = []
filenames = []
filepaths = []
cur_dir = os.path.join(os.getcwd(), '').replace(os.sep, '/')  # current directory for full path of text file 

for i in range(len(articles_no_ads)):
    # open article in new tab
    ActionChains(driver).key_down(Keys.CONTROL).click(articles_no_ads[i]).key_up(Keys.CONTROL).perform()
    
    # switch to new tab
    driver.switch_to.window(driver.window_handles[1])
    
    # press 'Story Continues' button if it exists
    try:
        driver.find_element_by_class_name("link.rapid-noclick-resp.caas-button.collapse-button").click()
        cont_clicked = True
    except:
        cont_clicked = False
        
    # create BS object from page source
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    # collect info from article
    titles.append(soup.h1.text)
    sources.append(soup.find('span', class_='caas-attr-provider').text)
    dts.append(soup.find('div', class_='caas-attr-time-style').time['datetime'])  # datetime in ISO 8601 format
    filename = f"article_{i+1:03d}.txt"
    filenames.append(filename)
    filepaths.append(cur_dir+filename)
    art_text = soup.find('div', class_='caas-body')
    
    # remove story continues button if necessary
    if cont_clicked:
        art_text.find('button').decompose()
    
    # save article text to file (utf-8 encoding to prevent errors)
    with open("./yahoofinance/"+filename, "w", encoding='utf-8') as f:
        for string in art_text.stripped_strings:
            f.write(string+'\n')
    
    # close tab and return to main page
    driver.close()
    driver.switch_to.window(driver.window_handles[0])
    
# create and save dataframe with article meta data
df = pd.DataFrame({'title': titles, 'source': sources, 'datetime': dts,
                   'filename': filenames, 'filepath': filepaths})
df.to_csv('./yahoofinance/_index.csv', index=False)  # save as csv in utf-8 encoding

Take a look at the dataframe

In [24]:
df.head()

Unnamed: 0,title,source,datetime,filename,filepath
0,"Match, IAC Covered Up Sexual Misconduct, Legal...",Bloomberg,2021-05-31T15:38:02.000Z,article_001.txt,C:/Users/Ert/Documents/GitHub/pff2_hw2/article...
1,Atlantia Backs Landmark Sale of Autostrade to ...,Bloomberg,2021-05-31T15:34:27.000Z,article_002.txt,C:/Users/Ert/Documents/GitHub/pff2_hw2/article...
2,"Italy reports 82 coronavirus deaths on Monday,...",Reuters,2021-05-31T15:07:02.000Z,article_003.txt,C:/Users/Ert/Documents/GitHub/pff2_hw2/article...
3,Norway's wealth fund unlikely to reproduce sam...,Reuters,2021-05-31T15:05:50.000Z,article_004.txt,C:/Users/Ert/Documents/GitHub/pff2_hw2/article...
4,Retailers and unions agree on 3-month extensio...,Reuters,2021-05-31T14:50:11.000Z,article_005.txt,C:/Users/Ert/Documents/GitHub/pff2_hw2/article...
