# Week 3: Webscraping
Today, we will go through scrap multiple pages, try out selenium, google news! (Make sure to have installed: selenium & pygooglenews)

Let's look at the website first. <a href="http://quotes.toscrape.com/"> Quotes! </a>

In [1]:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

quote_list = []
author_list = []

for i in range(1, 11): # from 1 to 10
    url = f'http://quotes.toscrape.com/page/{i}/'
    response = requests.get(url)
    html = response.text
    soup = bs(html, 'html.parser')

    quotes = soup.find_all('span', itemprop='text')
    authors = soup.find_all('small', class_='author')

    for quote in quotes:
        quote_list.append(quote.get_text())

    for author in authors:
        author_list.append(author.get_text())

    df1 = pd.DataFrame(data=quote_list, columns=['Quotes'])
    df2 = pd.DataFrame(data=author_list, columns=['Authors'])
    quotes = pd.concat([df1, df2], axis=1)
    quotes.to_excel('quotes_week2.xlsx')

# Selenium!

In [2]:
from selenium import webdriver
from selenium.webdriver.common.by import By

In [3]:
driver = webdriver.Chrome()
driver.get('https://www.youtube.com/c/brettconti/videos')
# time.sleep(5) # because the page might need some time to load
driver.implicitly_wait(10)  # kinda the same as the line before, but if the loading is done it will continue (not like time.sleep()
videos = driver.find_elements(By.CLASS_NAME, "style-scope ytd-grid-video-renderer")
# specify a class name that occurs multiple times and use find_elements to get a list returned

video_list = []
for video in videos:
    title = video.find_element(By.XPATH, './/*[@id="video-title"]').text  # put a dot at front!
    views = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[1]').text
    posted = video.find_element(By.XPATH, './/*[@id="metadata-line"]/span[2]').text
    video_list.append({'title': title, 'views': views, 'posted': posted})

df = pd.DataFrame(video_list)

In [4]:
driver.quit()  # just closes the tab -> always do that!

# Google News!

In [5]:
from pygooglenews import GoogleNews
gn = GoogleNews()

In [6]:
top = gn.top_news()

In [7]:
business = gn.topic_headlines('business')

In [8]:
search = gn.search('%22New+York%22')  #, when = '6m')

In [9]:
print(search.keys())

dict_keys(['feed', 'entries'])


In [10]:
print(search['feed'])

{'generator_detail': {'name': 'NFE/5.0'}, 'generator': 'NFE/5.0', 'title': '"%22New+York%22" - Google News', 'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': '"%22New+York%22" - Google News'}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://news.google.com/search?q=%2522New%2BYork%2522&ceid=US:en&hl=en-US&gl=US'}], 'link': 'https://news.google.com/search?q=%2522New%2BYork%2522&ceid=US:en&hl=en-US&gl=US', 'language': 'en-US', 'publisher': 'news-webmaster@google.com', 'publisher_detail': {'email': 'news-webmaster@google.com'}, 'rights': '2022 Google Inc.', 'rights_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': '2022 Google Inc.'}, 'updated': 'Mon, 17 Oct 2022 09:15:30 GMT', 'updated_parsed': time.struct_time(tm_year=2022, tm_mon=10, tm_mday=17, tm_hour=9, tm_min=15, tm_sec=30, tm_wday=0, tm_yday=290, tm_isdst=0), 'subtitle': 'Google News', 'subtitle_detail': {'type': 'text/html', 'language': None, 'base': '', 'valu

In [11]:
print(search['entries'])

[{'title': "The eighth subway murder this year shows NYC's public safety in deep decline - New York Post", 'title_detail': {'type': 'text/plain', 'language': None, 'base': '', 'value': "The eighth subway murder this year shows NYC's public safety in deep decline - New York Post"}, 'links': [{'rel': 'alternate', 'type': 'text/html', 'href': 'https://nypost.com/2022/10/16/the-eighth-subway-murder-this-year-shows-nycs-public-safety-in-deep-decline/'}], 'link': 'https://nypost.com/2022/10/16/the-eighth-subway-murder-this-year-shows-nycs-public-safety-in-deep-decline/', 'id': 'CAIiEFP2X9xSuHHJNL81OjEt_c4qGAgEKg8IACoHCAowhK-LAjD4ySww7NG0BQ', 'guidislink': False, 'published': 'Sun, 16 Oct 2022 23:07:00 GMT', 'published_parsed': time.struct_time(tm_year=2022, tm_mon=10, tm_mday=16, tm_hour=23, tm_min=7, tm_sec=0, tm_wday=6, tm_yday=289, tm_isdst=0), 'summary': '<a href="https://nypost.com/2022/10/16/the-eighth-subway-murder-this-year-shows-nycs-public-safety-in-deep-decline/" target="_blank">T

In [12]:
for item in search['entries']:
    print(item['title'])

The eighth subway murder this year shows NYC's public safety in deep decline - New York Post
New York City’s incredible shrinking starter home market - The Real Deal
Where the best fall leaves will be in New York this year - WXXI News
Miami Dolphins News 10/10/22: Jets Dominate Dolphins 40-17 - The Phinsider
Russia's Alrosa Discovers 22 New Diamond Deposits in Zimbabwe - Bloomberg
PODCAST: Giants are winning games, but is it sustainable? - Giants Wire
Judging by the Cover – 09/28/22 new releases • AIPT - AIPT
New York Yankees vs. Boston Red Sox Live Stream, TV Channel, Start Time - Sports Illustrated
Photos from the 'Terrifier 2' Red Carpet Premiere in New York! - Bloody Disgusting
New York City opening emergency centres for asylum seekers - Al Jazeera English
Over $1.7 million allocated from NYS Budget to upgrade some CNY libraries - The New York State Senate
Rookie Camp Roundup, Analysis & Observations - New York Islanders Hockey Now - New York Hockey Now
IMB trustees appoint 22 new 

In [13]:
def get_title(search):
    stories = []
    search = gn.search(search)
    for item in search['entries']:
        stories.append({'title': item.title, 'link': item.link})
    return stories

In [14]:
a = get_title('weather')

In [15]:
weather = pd.DataFrame(a)

In [16]:
weather

Unnamed: 0,title,link
0,Sunday Weather Briefing: Much Colder Air Ahead...,https://www.alabamawx.com/?p=247372
1,Where Can I Find A Warm-Weather Vacation Near ...,https://www.forbes.com/sites/christopherelliot...
2,Winter Weather Alerts in effect in advance of ...,https://www.foxweather.com/weather-news/octobe...
3,Northeast Ohio Monday weather forecast: Colder...,https://www.cleveland.com/weather/2022/10/nort...
4,Cooler weather ahead for WNY - 13WHAM-TV,https://13wham.com/news/local/cooler-weather-a...
...,...,...
94,City of Aiken Severe Weather Preparation Tips ...,https://www.cityofaikensc.gov/city-of-aiken-se...
95,Lehigh Valley weather: Thunderstorms arrive Th...,https://www.lehighvalleylive.com/weather/2022/...
96,"Orlando weather forecast: Beautiful, dry, slig...",https://www.fox35orlando.com/news/orlando-weat...
97,Burlington NC Weather and Radar - WGHP FOX8 Gr...,https://myfox8.com/weather/burlington/


In [17]:
weather['Title'] = weather['title'].str.split(' - ', expand=True)[0]

In [18]:
weather['Source'] = weather['title'].str.split(' - ', expand=True)[1]

In [19]:
weather.drop(['title'], axis=1)

Unnamed: 0,link,Title,Source
0,https://www.alabamawx.com/?p=247372,Sunday Weather Briefing: Much Colder Air Ahead,alabamawx.com
1,https://www.forbes.com/sites/christopherelliot...,Where Can I Find A Warm-Weather Vacation Near Me?,Forbes
2,https://www.foxweather.com/weather-news/octobe...,Winter Weather Alerts in effect in advance of ...,Fox Weather
3,https://www.cleveland.com/weather/2022/10/nort...,Northeast Ohio Monday weather forecast: Colder...,cleveland.com
4,https://13wham.com/news/local/cooler-weather-a...,Cooler weather ahead for WNY,13WHAM-TV
...,...,...,...
94,https://www.cityofaikensc.gov/city-of-aiken-se...,City of Aiken Severe Weather Preparation Tips,"City of Aiken, SC (.gov)"
95,https://www.lehighvalleylive.com/weather/2022/...,Lehigh Valley weather: Thunderstorms arrive Th...,lehighvalleylive.com
96,https://www.fox35orlando.com/news/orlando-weat...,"Orlando weather forecast: Beautiful, dry, slig...",FOX 35 Orlando
97,https://myfox8.com/weather/burlington/,Burlington NC Weather and Radar,WGHP FOX8 Greensboro


In [20]:
weather.rename(columns={'link': 'Link'}, inplace=True)

Let's reorder

In [21]:
weather = weather[['Title', 'Source', 'Link']]

In [22]:
weather.to_excel('weather.xlsx')