# Web Scraping from Scratch with Selenium & AWS

## Objective
1. Scrape top  10 trending videos on Youtube using Selenium
2. Set up a recurring job on AWS Lambda to scrape every 30 minute
3. Send the result as a CSV attachment over email (or to a spreadsheet)

### Step 1 - Create a GitHub repository
* Create a repository at https://github.com/new
* Add README, gitignore (Python) and license 
* (Optional) Clone the repository locally
* References:
    * Introduction to GitHub: https://lab.github.com/githubtraining... 
    * Git & GitHub tutorial:   

 • Git and GitHub fo...   


### Step 2 - Launch the repository on Replit
* Connect Replit with your GitHub account
* Launch the repository as a Replit project
* Set up the language and run command
* Create and execute a Python script
* Attempt to scrape the page using requests & Beautiful Soup
* References:
    * Introduction to Replit: https://docs.replit.com/tutorials/01-... 
    * Replit + GitHub: https://docs.replit.com/tutorials/06-... 
    * YouTube trending feed: https://www.youtube.com/feed/trending 
    * Beautiful soup tutorial: https://blog.jovian.ai/web-scraping-u... 


### Step 3 - Extract information using Selenium
* Install selenium and create a browser driver
* Load the page and extract information
* Create a CSV of results using Pandas
* References:
    * Selenium tutorial: https://www.browserstack.com/guide/py...
    * Pandas tutorial: https://jovian.ai/learn/data-analysis...


### Step 4 - Set up a recurring job on AWS Lambda
* Create an AWS Lambda Python function
* Deploy a sample script and observe the output
* Add layers for Selenium and Chromium
* Set up recurring job using AWS CloudWatch
* References:
    * Python on AWS Lambda tutorial: https://stackify.com/aws-lambda-with-... 
    * Chromium & Selenium on AWS Lambda: https://dev.to/awscommunity-asean/cre...
    * Recurring AWS Lambda functions: https://docs.aws.amazon.com/lambda/la... 

### Step 5 - Send results over email using SMTP
* Create email client using smtplib
* Set up SSL, TLS and authenticate with password
* Send a sample email with just text
* Send an email with text and attachment
* References:
    * Sending Email with Python: https://stackabuse.com/how-to-send-em...
    * Send email using Python: https://www.geeksforgeeks.org/send-ma...
    * Environment variables on Replit: https://docs.replit.com/programming-i...
    * https://docs.aws.amazon.com/lambda/la... 
    * Update Google sheets using Python: https://www.analyticsvidhya.com/blog/...

In [4]:
## install beautiful soup and requests library
## Selenium - is python API which interact with webrowser, you'd still need # a web driver to connect
pip install bs4 requests selenium

SyntaxError: invalid syntax (1262808197.py, line 3)

In [6]:
pip install selenium

Collecting selenium
  Downloading selenium-4.8.2-py3-none-any.whl (6.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.9/6.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m384.9/384.9 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.0-py3-none-any.whl (14 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting attrs>=19.2.0
  Downloading attrs-22.2.0-py3-none-any.whl (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.0/60.0 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [5]:
pip install pandas

Collecting pandas
  Downloading pandas-1.5.3-cp38-cp38-macosx_10_9_x86_64.whl (11.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp38-cp38-macosx_10_9_x86_64.whl (19.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.8/19.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting pytz>=2020.1
  Downloading pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m499.4/499.4 KB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pytz, numpy, pandas
Successfully installed numpy-1.24.2 pandas-1.5.3 pytz-2022.7.1
You should consider upgrading via the '/Users/mac/.pyenv/versions/3.8.15/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated package

In [2]:
import requests
from bs4 import BeautifulSoup

YOUTUBE_TRENDING_URL = 'https://www.youtube.com/feed/trending'

# request.get post javascript down but does not execute JavaScript - all those data getting dynamically is not
# happening here as there is no video on the intial page
response = requests.get(YOUTUBE_TRENDING_URL)

# status code of "200" means successful , "404" means unsuccessful
print('Status Code', response.status_code)

# w means "write to the file"
# with open('trending.html', 'w') as f:
#     f.write(response.text)

doc = BeautifulSoup(response.text, 'html.parser')

print('Page title', doc.title.text)

# Find all the video divs
video_divs = doc.find_all('div',
                          class_ = 'ytd-video-renderer')

print(f'Found {len(video_divs)} videos')

Status Code 200
Page title มาแรง - YouTube
Found 0 videos


In [3]:
chromedriver --version

NameError: name 'chromedriver' is not defined

In [6]:
# let's start fresh after installing Selenium for simplicity reasons
""" Sometimes the code will not scrape all the videos if the page is not loaded completely. 
Import the time module & use the time.sleep(5) command to load the page completely and then find the elements. """
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import pandas as pd
# from selenium.webdriver.chrome.service import Service

YOUTUBE_TRENDING_URL = 'https://www.youtube.com/feed/trending'

def get_driver():
    # get chrome driver - u can think of the driver as browser
    options = Options()
    # options.headless = True
    options.add_argument("--headless=new")
    # ser = Service(r"./chromedriver.exe")
    driver = webdriver.Chrome("./chromedriver", options=options)
    return driver

def get_videos(driver):
    
    # With the same metaphor, you can insert URL to the browser
    driver.get(YOUTUBE_TRENDING_URL)

    # Getting the video divs
    video_div_tag = 'ytd-video-renderer'
    time.sleep(5) #
    video_div = driver.find_elements(By.TAG_NAME, video_div_tag) # Source: https://stackoverflow.com/questions/69875125/find-element-by-commands-are-deprecated-in-selenium
    # video_div = driver.find_elements_by_class_name(video_div_class)
    return video_div

def parse_video(video):
    title_tag = video.find_element(By.ID, 'video-title')
    title = title_tag.text
    
    url = title_tag.get_attribute('href')

    thumbnail_tag = video.find_element(By.TAG_NAME, 'img')
    thumbnail_url = thumbnail_tag.get_attribute('src')

    channel_div = video.find_element(By.CLASS_NAME, 'ytd-channel-name')
    channel_name = channel_div.text

    description = video.find_element(By.ID, 'description-text').text

    return {
        'title': title,
        'url': url,
        'thumbnail_url': thumbnail_url,
        'channel_name': channel_name,
        'description_name': description
    }

if __name__ == "__main__":

    print('Creating Driver')
    driver = get_driver()

    # With the same metaphor, you can insert URL to the browser
    print('Fetching trending video')
    videos = get_videos(driver)
    
    print(f'Found {len(videos)} videos')

    print('Parsing the top 10 videos')
    videos_data = [parse_video(video) for video in videos[:10]]
    # title, url, thumbnail_url channel, views, uploaded, description

    print(videos_data)

    videos_df = pd.DataFrame(videos_data)
    print('Printing Dataframe')
    print(videos_df)
    print('Saving the data to a CSV file')
    videos_df.to_csv('trending.csv', index=None)



Creating Driver


  driver = webdriver.Chrome("./chromedriver", options=options)


Fetching trending video
Found 97 videos
Parsing the top 10 videos
[{'title': 'ILLSLICK - My Dad [Official Music Video]', 'url': 'https://www.youtube.com/watch?v=T3Fv4IoatxE', 'thumbnail_url': 'https://i.ytimg.com/vi/T3Fv4IoatxE/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLDS29ON86LjwAHM3JjsLaZ_PncYMQ', 'channel_name': 'Illslick thelegandary', 'description_name': 'Mastered : Warut "Sevendogs" Rintranukul Music video by Jettana Cast : Pakpoom Jongmanwattana, Pappim Suthinbut Director : jirat SOMPAKDEE Assistant Director : Raksaphon Ruamwong Director...'}, {'title': '🔴Live! ถ่ายทอดสดหวย 1 มีนาคม 2566 ถ่ายทอดสดสลากกินแบ่งรัฐบาล', 'url': 'https://www.youtube.com/watch?v=qfqXyQzGixU', 'thumbnail_url': 'https://i.ytimg.com/vi/qfqXyQzGixU/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLA1QcjCL0s3ozLLuhmxwRHoI8Rkow', 'channel_name': 'คนอ่านข่าว - TRNTV', 'description_name': 'การถ่ายทอดสดหวย 1 มีนาคม 2566 ถ่ายทอดสดสลากกินแบ่งรัฐบาล ผลการ