# Homework exercise 2
## Deadline: upload to Moodle by 31 May 18:00 h

__Please submit your homework either as a Jupyter Notebook or using .py files.__

If you use .py files, please also include a PDF containing the output of your code and your explanations. Either way, the code needs to be in a form that can be easily run on another computer.

__Name 1: Lisa Maria Egger 01503839__

__Name 2: Josef Wieser 01309182__

__Name 3: Wögerer David 01453054__


The name of the file that you upload should be named *Homework1_YourLastName_YourStudentID*.

Reminder: you are required to attend class on 1 June to earn points for this homework exercise unless you have a valid reason for your absence.

You are encouraged to work on this exercise in teams of up to three students. If any part of the questions is unclear, please ask on the Moodle forum.

#### Selenium


__Yahoo Finance News Crawling__

Your task is to collect and organize articles from Yahoo Finance News, available at https://finance.yahoo.com/news

This will require you to use Selenium for at least two reasons:

* The site initially loads only partially such that scrolling is necessary to access additional news articles
* The news articles themselves, if they are long, are initially shown (and downloadable) in an abbreviated form. Browser navigation is necessary to click on a button such that the whole article is displayed.

There are about 200 news articles available at a time, stemming from various sources (e.g. Bloomberg, Reuters), and covering different topic areas (e.g. Business, World).

Note that you will need to use some features of Selenium - in particular scrolling - that are not discussed in the class notebook but documented elsewhere. 

You are asked to

* Download all of the about 200 news articles available from Yahoo Finance News
* Exclude the advertisements listed in between the news articles
* Try and remove everything that is not part of the article itself (e.g. the sidebar containing a list of popular articles and other content) and save the text of each article in a text file
* Create a DataFrame containing the following information (to the extent it is available) for each article
    * Title
    * Source (e.g. Bloomberg)
    * Time and date of publication
    * Name and path of the file containing the article's text that you saved
    
Please save this DataFrame so that you can later reuse the data set you created.

In [10]:
import re
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

In [11]:
## 1. Program which presses the button "End" N times until the whole page is loaded and saves the news links into a set

url = 'https://finance.yahoo.com/news'

options = Options()
options.headless = True
driver = webdriver.Firefox(options = options)

driver.maximize_window()
driver.implicitly_wait(1)

driver.get(url)
driver.find_element_by_tag_name('button').click() ## search for the button to accept all cookies

links_set = set() ## using a set so we don't have duplicates

actions = ActionChains(driver) 
for _ in range(25): ## after trial and error: 25 times is sufficient to load the whole site
    actions = actions.send_keys(Keys.END)
    actions.pause(.5) ## short pause so the site can load after each press
actions.perform()

find_tag_link = driver.find_elements_by_tag_name('a')
links_set = set([link.get_attribute('href') for link in find_tag_link])
news_links = re.findall(r'https://finance.yahoo.com/news/[\w/!?.-]*', str(links_set))

#print(news_links)
print('Number of articles found:', len(news_links))

#driver.close()
## not closing the driver since we need it in the following steps

Number of articles found: 210


In [12]:
## 2. Search for specific informations in each article

from bs4 import BeautifulSoup
import pandas as pd
import os

## lists which we will later merge into a dataframe
title = []
source = []
timedate = []
namepath = []

# get path of our working directory
path = os.getcwd()

for link in news_links[:20]: ## code runs only over the first 20 links, but i think it can be seen that the code works
    driver.get(link)
    driver.implicitly_wait(1)
    #driver.find_element_by_tag_name('button').click()
    ## accept cookies and click button to show the whole article --> not necessary in our case

    content = driver.page_source
    soup = BeautifulSoup(content, 'lxml')

    #appending the title to the title list
    title.append(soup.find('h1', {"data-test-locator" : "headline"}).string)
    # grabbing the source from the "picture"/logo on the top left courner
    source.append(soup.find('img', {'class' : 'caas-img caas-loaded'})['alt'])
    
    # # using the date string 
    # timedate.append(soup.find('time').string)
    # ##### OR use the date as number
    timedate.append(soup.find('time')['datetime'])
    
    
    article_text = str() 

    # could also use css selector however since the text is interrupted by div at one point
    # find_elements_by_tag_name is only able to grab the p-text up until the first interruption
    # therefore beautifulsoup is used
    title_text = soup.find('h1', {"data-test-locator" : "headline"}).string
    article_text += title_text
    
    text_parent_tag = soup.find('div', {'class': 'caas-body'})
    
    # some articles have sub headlines therefore we are also looking for h2 and h3
    plain_text = text_parent_tag.find_all(['p', 'h2', 'h3'])
    for single_plain_text in plain_text: 
        article_text += single_plain_text.text
    
       
    # get the file name from the url, use grouping in regex to get rid of the .html 
    # further improvements were made to also correctly grab the name of the links that are of the form .../blabla.html?.tsrc
    file_name = re.search(r'([\w\-]*)\.*[a-zA-Z?]*\.[a-zA-Z]*$', str(link))
        
    # get rid of .htlm and only obtain the name by using the second group element
    doc = open(str(file_name.group(1))+'.txt', 'w')
    doc.write(article_text)
    
    # have name and path in one go, the name is the last part of the path, we are saving the article in the same directory as we are writing this homework
    namepath.append(path + '\\' + doc.name)

driver.close()
    
## Save the lists into a new dataframe
df = (pd.DataFrame({'Title': title, 'Source': source, 'Time & Date': timedate, 'Name & Path': namepath}))
df

Unnamed: 0,Title,Source,Time & Date,Name & Path
0,Ford Foundation president: ‘We need a new form...,Yahoo Finance,2021-05-28T20:44:33.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
1,Stock market news live updates: Stocks turn mi...,Yahoo Finance,2021-05-21T20:03:50.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
2,Goldman Sachs to double property investments i...,Reuters,2021-05-31T07:01:31.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
3,Iran Transfers Oil From Pipeline Skirting Trou...,Bloomberg,2021-05-30T15:15:14.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
4,Corporations donated millions of dollars to Re...,Yahoo Finance,2021-05-19T18:35:33.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
5,Oil Climbs Toward $67 With Market Focused on O...,Bloomberg,2021-05-31T05:29:32.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
6,UPDATE 1-U.S. drawing up targeted sanctions on...,Reuters,2021-05-29T01:58:35.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
7,Australian Copper Miner 29Metals Launches $471...,Bloomberg,2021-05-31T02:37:29.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
8,"UPDATE 1-Mexico accuses Zara, Anthropologie & ...",Reuters,2021-05-31T02:19:24.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
9,"Stocks, Futures Steady in Asia With Data in Fo...",Bloomberg,2021-05-31T06:09:40.000Z,C:\Users\david\Desktop\Uni\Python for Finance ...
