# Web Scraping from Scratch with Selenium & AWS

## Objective
1. Scrape top  10 trending videos on Youtube using Selenium
2. Set up a recurring job on AWS Lambda to scrape every 30 minute
3. Send the result as a CSV attachment over email (or to a spreadsheet)

### Step 1 - Create a GitHub repository
* Create a repository at https://github.com/new
* Add README, gitignore (Python) and license 
* (Optional) Clone the repository locally
* References:
    * Introduction to GitHub: https://lab.github.com/githubtraining... 
    * Git & GitHub tutorial:   

 ‚Ä¢ Git and GitHub fo...   


### Step 2 - Launch the repository on Replit
* Connect Replit with your GitHub account
* Launch the repository as a Replit project
* Set up the language and run command
* Create and execute a Python script
* Attempt to scrape the page using requests & Beautiful Soup
* References:
    * Introduction to Replit: https://docs.replit.com/tutorials/01-... 
    * Replit + GitHub: https://docs.replit.com/tutorials/06-... 
    * YouTube trending feed: https://www.youtube.com/feed/trending 
    * Beautiful soup tutorial: https://blog.jovian.ai/web-scraping-u... 


### Step 3 - Extract information using Selenium
* Install selenium and create a browser driver
* Load the page and extract information
* Create a CSV of results using Pandas
* References:
    * Selenium tutorial: https://www.browserstack.com/guide/py...
    * Pandas tutorial: https://jovian.ai/learn/data-analysis...


### Step 4 - Set up a recurring job on AWS Lambda
* Create an AWS Lambda Python function
* Deploy a sample script and observe the output
* Add layers for Selenium and Chromium
* Set up recurring job using AWS CloudWatch
* References:
    * Python on AWS Lambda tutorial: https://stackify.com/aws-lambda-with-... 
    * Chromium & Selenium on AWS Lambda: https://dev.to/awscommunity-asean/cre...
    * Recurring AWS Lambda functions: https://docs.aws.amazon.com/lambda/la... 

### Step 5 - Send results over email using SMTP
* Create email client using smtplib
* Set up SSL, TLS and authenticate with password
* Send a sample email with just text
* Send an email with text and attachment
* References:
    * Sending Email with Python: https://stackabuse.com/how-to-send-em...
    * Send email using Python: https://www.geeksforgeeks.org/send-ma...
    * Environment variables on Replit: https://docs.replit.com/programming-i...
    * https://docs.aws.amazon.com/lambda/la... 
    * Update Google sheets using Python: https://www.analyticsvidhya.com/blog/...

In [4]:
## install beautiful soup and requests library
## Selenium - is python API which interact with webrowser, you'd still need # a web driver to connect
pip install bs4 requests selenium

SyntaxError: invalid syntax (1262808197.py, line 3)

In [6]:
pip install selenium

Collecting selenium
  Downloading selenium-4.8.2-py3-none-any.whl (6.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m6.9/6.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting trio~=0.17
  Downloading trio-0.22.0-py3-none-any.whl (384 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m384.9/384.9 KB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting trio-websocket~=0.9
  Downloading trio_websocket-0.9.2-py3-none-any.whl (16 kB)
Collecting sortedcontainers
  Downloading sortedcontainers-2.4.0-py2.py3-none-any.whl (29 kB)
Collecting exceptiongroup>=1.0.0rc9
  Downloading exceptiongroup-1.1.0-py3-none-any.whl (14 kB)
Collecting async-generator>=1.9
  Downloading async_generator-1.10-py3-none-any.whl (18 kB)
Collecting attrs>=19.2.0
  Downloading attrs-

In [5]:
pip install pandas

Collecting pandas
  Downloading pandas-1.5.3-cp38-cp38-macosx_10_9_x86_64.whl (11.9 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m11.9/11.9 MB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting numpy>=1.20.3
  Downloading numpy-1.24.2-cp38-cp38-macosx_10_9_x86_64.whl (19.8 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m19.8/19.8 MB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0mm
Collecting pytz>=2020.1
  Downloading pytz-2022.7.1-py2.py3-none-any.whl (499 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m499.4/499.4 KB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: pytz, numpy, pandas
Successfully installed n

In [3]:
pip install smtplib

[31mERROR: Could not find a version that satisfies the requirement smtplib (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for smtplib[0m[31m
You should consider upgrading via the '/Users/mac/.pyenv/versions/3.8.15/bin/python -m pip install --upgrade pip' command.[0m[33m
[0mNote: you may need to restart the kernel to use updated packages.


In [2]:
import requests
from bs4 import BeautifulSoup

YOUTUBE_TRENDING_URL = 'https://www.youtube.com/feed/trending'

# request.get post javascript down but does not execute JavaScript - all those data getting dynamically is not
# happening here as there is no video on the intial page
response = requests.get(YOUTUBE_TRENDING_URL)

# status code of "200" means successful , "404" means unsuccessful
print('Status Code', response.status_code)

# w means "write to the file"
# with open('trending.html', 'w') as f:
#     f.write(response.text)

doc = BeautifulSoup(response.text, 'html.parser')

print('Page title', doc.title.text)

# Find all the video divs
video_divs = doc.find_all('div',
                          class_ = 'ytd-video-renderer')

print(f'Found {len(video_divs)} videos')

Status Code 200
Page title ‡∏°‡∏≤‡πÅ‡∏£‡∏á - YouTube
Found 0 videos


In [3]:
chromedriver --version

NameError: name 'chromedriver' is not defined

In [66]:
# let's start fresh after installing Selenium for simplicity reasons
""" Sometimes the code will not scrape all the videos if the page is not loaded completely. 
Import the time module & use the time.sleep(5) command to load the page completely and then find the elements. """
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import time
import pandas as pd
# Import smtplib for the actual sending function
import smtplib
from datetime import date
import os
# Import the email modules we'll need
from email.message import EmailMessage
# Modules responsible for adding CSV attachment
import csv
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.application import MIMEApplication
# Handling "No such element found error"
from selenium.common.exceptions import NoSuchElementException
# from selenium.webdriver.chrome.service import Service
# import json for formatting the "videos data" result
import json

YOUTUBE_TRENDING_URL = 'https://www.youtube.com/feed/trending'

def get_driver():
    # get chrome driver - u can think of the driver as browser
    options = Options()
    # options.headless = True
    options.add_argument("--headless=new") # stop showing browser window
    # ser = Service(r"./chromedriver.exe")
    driver = webdriver.Chrome("./chromedriver", options=options)
    return driver

def get_videos(driver):
    
    # With the same metaphor, you can insert URL to the browser
    driver.get(YOUTUBE_TRENDING_URL)

    # Getting the video divs
    video_div_tag = 'ytd-video-renderer'
    time.sleep(5) #
    video_div = driver.find_elements(By.TAG_NAME, video_div_tag) # Source: https://stackoverflow.com/questions/69875125/find-element-by-commands-are-deprecated-in-selenium
    # video_div = driver.find_elements_by_class_name(video_div_class)
    return video_div

def parse_video(video):
    try:
        title_tag = video.find_element(By.ID, 'video-title')
        # title = title_tag.text
        title = title_tag.get_attribute('title')
        
        url = title_tag.get_attribute('href')

        thumbnail_tag = video.find_element(By.TAG_NAME, 'img')
        thumbnail_url = thumbnail_tag.get_attribute('src')

        channel_div = video.find_element(By.CLASS_NAME, 'ytd-channel-name')
        channel_name = channel_div.text

        description = video.find_element(By.ID, 'description-text').text

        return {
            'title': title,
            'url': url,
            'thumbnail_url': thumbnail_url,
            'channel_name': channel_name,
            'description_name': description
        }
    
    except NoSuchElementException:
        pass

def send_email(body):
    SENDER_EMAIL = os.environ.get('GMAIL_USER')
    RECEIVER_EMAIL = 'rasun2300@gmail.com, rasun2600@gmail.com'
    TODAY_DATE = date.today()
    # TODAY_DATE = TODAY_DATE.strftime("%d-%m-%Y")
    GMAIL_PASS = os.environ.get('GMAIL_PASS')

    # Creating an instance of the EmailMessage class
    # msg = EmailMessage() # single part email
    msg = MIMEMultipart() # use this for multi part email such as a text part and a file attachment part.

    # Set the email subject
    msg['Subject'] = 'OMG Super Important Message'

    # Set the email content
    GMAIL_CONTENT = f"""

    Dear team,

    Hey, this is a test message sending over from python script. This email is sent on {TODAY_DATE}
    
    Best regards,
    {SENDER_EMAIL}
    """
    # msg.set_content(GMAIL_CONTENT)
    msg.attach(MIMEText(GMAIL_CONTENT, 'plain'))

    # Set the email sender and recipient
    msg['From'] = SENDER_EMAIL
    msg['To'] = RECEIVER_EMAIL

    # Open CSV file and read its content
    with open('/Users/mac/Documents/GitHub/selenium-youtube-scraper-live/trending.csv', 'r') as file:
        csv_data = file.read()

    # Attach the CSV file to the message
    attachment = MIMEApplication(csv_data.encode('utf-8'), _subtype='csv')
    attachment.add_header('Content-Disposition', 'attachment', filename='Trending Video Data.csv')
    msg.attach(attachment)

    # with open('/Users/mac/Documents/GitHub/selenium-youtube-scraper-live/trending.csv','rb') as file:
    # Attach the file with filename to the email
        # msg.attach(MIMEApplication(file.read(), Name="FILE_NAME"))


    try:
        print('Creating a connection to the SMTP server')
        server_ssl = smtplib.SMTP_SSL('smtp.gmail.com', 465)
        print('Getting secure connection')
        server_ssl.ehlo()

        # subject = 'OMG Super Important Message'
        # body = f'Hey, this is a test message sending over from python script. This email is sent on {TODAY_DATE}'

        # email_text = f"""
        # From: {SENDER_EMAIL}
        # To: {RECEIVER_EMAIL}
        # Subject: {subject}

        # {body}
        # """
        print('Logging in to gmail using filled credentials (Environment Variable)')
        server_ssl.login(SENDER_EMAIL, GMAIL_PASS) 
        print('Sending mail')
        # server_ssl.sendmail(SENDER_EMAIL, RECEIVER_EMAIL, email_text) #Uncomment the subject line to body line to use this line of code although this does not seem to return email subject properly
        server_ssl.send_message(msg)
        print('Done')
        server_ssl.close()
    except:
        print ('Something went wrong...')

if __name__ == "__main__":

    print('Creating Driver')
    driver = get_driver()

    # With the same metaphor, you can insert URL to the browser
    print('Fetching trending video')
    videos = get_videos(driver)
    
    print(f'Found {len(videos)} videos')

    print('Parsing the top 10 videos')
    videos_data = [parse_video(video) for video in videos[:10]]
    # title, url, thumbnail_url channel, views, uploaded, description

    print(videos_data)

    videos_df = pd.DataFrame(videos_data)
    # print('Printing Dataframe')
    # print(videos_df)
    print('Saving the data to a CSV file')
    videos_df.to_csv('trending.csv', index=None, encoding='utf-8-sig')

    print("Sending the email with the result")
    body = json.dumps(videos_data, indent = 2)
    send_email(body)


Creating Driver


  driver = webdriver.Chrome("./chromedriver", options=options)


Fetching trending video
Found 97 videos
Parsing the top 10 videos
[{'title': "‡∏õ‡∏£‡∏∞‡πÄ‡∏û‡∏ì‡∏µ '‡∏•‡∏á‡∏õ‡∏•‡∏≤‡∏´‡∏ß‡πà‡∏≤‡∏ô‡πÅ‡∏´' 2 ‡∏õ‡∏µ ‡∏°‡∏µ 1 ‡∏Ñ‡∏£‡∏±‡πâ‡∏á !!!!", 'url': 'https://www.youtube.com/watch?v=61crRa7L9os', 'thumbnail_url': 'https://i.ytimg.com/vi/61crRa7L9os/hqdefault.jpg?sqp=-oaymwEcCPYBEIoBSFTyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CLAb2nxwd6LHK6d8rBIxfhp2Goch8A', 'channel_name': 'DJ Poom ', 'description_name': '‡πÄ‡∏õ‡∏¥‡∏î‡πÇ‡∏•‡∏Å‡∏Å‡∏±‡∏ö‡∏≠‡∏µ‡∏Å 1 ‡∏õ‡∏£‡∏∞‡πÄ‡∏û‡∏ì‡∏µ‡∏Ç‡∏≠‡∏á‡∏ä‡∏≤‡∏ß‡∏≠‡∏µ‡∏™‡∏≤‡∏ô ‡∏ô‡∏±‡πà‡∏ô‡∏Å‡πá‡∏Ñ‡∏∑‡∏≠ ‚Äò‡∏•‡∏á‡∏õ‡∏•‡∏≤‡∏´‡∏ß‡πà‡∏≤‡∏ô‡πÅ‡∏´‚Äô ‡∏ã‡∏∂‡πà‡∏á‡πÄ‡∏õ‡πá‡∏ô‡∏Å...'}, {'title': '‡∏õ‡∏≠‡∏Å‡πÄ‡∏õ‡∏•‡∏∑‡∏≠‡∏Å ‚Äú‡∏ä‡∏π‡∏ß‡∏¥‡∏ó‡∏¢‡πå‚Äù ‡∏Ñ‡∏¥‡∏î‡∏•‡πâ‡∏°‡πÉ‡∏Ñ‡∏£‡πÅ‡∏ô‡πà | ‡∏Ñ‡∏°‡∏ä‡∏±‡∏î‡∏•‡∏∂‡∏Å | 6 ‡∏°‡∏µ.‡∏Ñ. 66 | FULL | NationTV22', 'url': 'https://www.youtube.com/watch?v=5iMxyboBWRM', 'thumbnail_url': 'https://i.ytimg.com/vi/5iMxyboBWRM/hqdefault.jpg?sqp=-oaymwEcCPYBEIoBSFTyq4qpAw4IARUAAIhCGAFwAcABBg==&rs=AOn4CL

In [9]:
from datetime import date
today = date.today()
TODAY_DATE = today.strftime("%d-%m-%Y")

print(today)
print(TODAY_DATE)

2023-03-07
07-03-2023
