# Web Scraping from Scratch with Selenium & AWS

## Objective
1. Scrape top  10 trending videos on Youtube using Selenium
2. Set up a recurring job on AWS Lambda to scrape every 30 minute
3. Send the result as a CSV attachment over email (or to a spreadsheet)

### Step 1 - Create a GitHub repository
* Create a repository at https://github.com/new
* Add README, gitignore (Python) and license 
* (Optional) Clone the repository locally
* References:
    * Introduction to GitHub: https://lab.github.com/githubtraining... 
    * Git & GitHub tutorial:   

 • Git and GitHub fo...   


### Step 2 - Launch the repository on Replit
* Connect Replit with your GitHub account
* Launch the repository as a Replit project
* Set up the language and run command
* Create and execute a Python script
* Attempt to scrape the page using requests & Beautiful Soup
* References:
    * Introduction to Replit: https://docs.replit.com/tutorials/01-... 
    * Replit + GitHub: https://docs.replit.com/tutorials/06-... 
    * YouTube trending feed: https://www.youtube.com/feed/trending 
    * Beautiful soup tutorial: https://blog.jovian.ai/web-scraping-u... 


### Step 3 - Extract information using Selenium
* Install selenium and create a browser driver
* Load the page and extract information
* Create a CSV of results using Pandas
* References:
    * Selenium tutorial: https://www.browserstack.com/guide/py...
    * Pandas tutorial: https://jovian.ai/learn/data-analysis...


### Step 4 - Set up a recurring job on AWS Lambda
* Create an AWS Lambda Python function
* Deploy a sample script and observe the output
* Add layers for Selenium and Chromium
* Set up recurring job using AWS CloudWatch
* References:
    * Python on AWS Lambda tutorial: https://stackify.com/aws-lambda-with-... 
    * Chromium & Selenium on AWS Lambda: https://dev.to/awscommunity-asean/cre...
    * Recurring AWS Lambda functions: https://docs.aws.amazon.com/lambda/la... 

### Step 5 - Send results over email using SMTP
* Create email client using smtplib
* Set up SSL, TLS and authenticate with password
* Send a sample email with just text
* Send an email with text and attachment
* References:
    * Sending Email with Python: https://stackabuse.com/how-to-send-em...
    * Send email using Python: https://www.geeksforgeeks.org/send-ma...
    * Environment variables on Replit: https://docs.replit.com/programming-i...
    * https://docs.aws.amazon.com/lambda/la... 
    * Update Google sheets using Python: https://www.analyticsvidhya.com/blog/...

In [1]:
print(5)

5


In [8]:
import requests
from bs4 import BeautifulSoup

YOUTUBE_TRENDING_URL = 'https://www.youtube.com/feed/trending'

response = requests.get(YOUTUBE_TRENDING_URL)

# status code of "200" means successful , "404" means unsuccessful
print('Status Code', response.status_code)

# w means "write to the file"
# with open('trending.html', 'w') as f:
#     f.write(response.text)

doc = BeautifulSoup(response.text,)

Status Code 200


In [6]:
response.headers

{'Content-Type': 'text/html; charset=utf-8', 'X-Content-Type-Options': 'nosniff', 'Cache-Control': 'no-cache, no-store, max-age=0, must-revalidate', 'Pragma': 'no-cache', 'Expires': 'Mon, 01 Jan 1990 00:00:00 GMT', 'Date': 'Thu, 02 Mar 2023 06:07:19 GMT', 'X-Frame-Options': 'SAMEORIGIN', 'Strict-Transport-Security': 'max-age=31536000', 'Report-To': '{"group":"youtube_main","max_age":2592000,"endpoints":[{"url":"https://csp.withgoogle.com/csp/report-to/youtube_main"}]}', 'Permissions-Policy': 'ch-ua-arch=*, ch-ua-bitness=*, ch-ua-full-version=*, ch-ua-full-version-list=*, ch-ua-model=*, ch-ua-wow64=*, ch-ua-platform=*, ch-ua-platform-version=*', 'Cross-Origin-Opener-Policy-Report-Only': 'same-origin-allow-popups; report-to="youtube_main"', 'P3P': 'CP="This is not a P3P policy! See http://support.google.com/accounts/answer/151657?hl=th for more info."', 'Content-Encoding': 'gzip', 'Server': 'ESF', 'X-XSS-Protection': '0', 'Set-Cookie': 'GPS=1; Domain=.youtube.com; Expires=Thu, 02-Mar-202