# Lesson 5

### Web Scraping with Selenium

Documentation:

https://selenium-python.readthedocs.io/

In [1]:
! pip install selenium



***Attention!!!***

Selenium requires a driver to interface with our chosen browser. Chrome, for example, requires ChromeDriver, which needs to be installed before we start scraping. The Selenium web driver speaks directly to the browser using the browser’s own engine to control it.

**Download Chrome WebDriver**


Visit https://sites.google.com/chromium.org/driver/downloads?authuser=0


Select the compatible driver for your Chrome version
    To check the Chrome version you are using, click on the three vertical dots on the top right corner
    Then go to Help -> About Google Chrome

Some help!!!

http://www.borgo7.net/4-mosse-per-installare-selenium-su-windows-e-usarlo-con-python/

https://chromedriver.chromium.org/getting-started

Extract "chromedriver.exe" from the zip file and move it in an easy folder

Mine is:       C:\Users\mchde\Desktop\TextPy_Lex5


In [3]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys 

driver = webdriver.Firefox(executable_path = 'D:\\utente\\Download\\geckodriver-v0.29.1-win64\\geckodriver.exe')

### Scraping from YouTube


We want: the video ID, video title, and video description of a particular category from YouTube


The categories we’ll be scraping are:

- Travel
- Science
- Food
- History
- Manufacturing
- Art & Dance  


In [4]:
from selenium import webdriver 
import pandas as pd 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC


Open YouTube in your browser. Type in the category you want to search videos for and set the filter to “videos”. This will display only the videos related to your search. Copy the URL after doing this.   


science

https://www.youtube.com/results?search_query=science&sp=EgIQAQ%253D%253D

In [5]:
url = "https://www.youtube.com/results?search_query=science&sp=EgIQAQ%253D%253D"

In [7]:
# This will open a new browser window for that link

driver = webdriver.Firefox() 
driver.get(url)

WebDriverException: Message: 'geckodriver' executable needs to be in PATH. 


Fetch all the video links present on that particular page. We will create a “list” to store those links

Now, go to the browser window, right-click on the page, and select ‘inspect element’

Search for the anchor tag with id = ”video-title” and then right-click on it -> Copy -> XPath. The XPath should look something like : //*[@id=”video-title”]

In [None]:
//*[@id="video-title"]/yt-formatted-string

In [7]:
# this will fetch the “href” attribute of the anchor tag we searched for

user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
    links.append(i.get_attribute('href'))

print(len(links))

20


In [8]:
# create a dataframe with 4 columns – “link”, “title”, “description”, and “category”
# We will store the details of videos for different categories in these columns

df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])


In [11]:
# scrape the video details from YouTube

wait = WebDriverWait(driver, 10)

v_category = "science"

for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located(
                           (By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located(
                                         (By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]



Let’s breakdown this code block to understand what we just did:

“wait” will ignore instances of NotFoundException that are encountered (thrown) by default in the ‘until’ condition. It will immediately propagate all others

Parameters:
driver: The WebDriver instance to pass to the expected conditions
timeOutInSeconds: The timeout in seconds when an expectation is called

v_category stores the video category name we searched for earlier

The “for” loop is applied on the list of links we created above

driver.get(x) traverses through all the links one-by-one and opens them in the browser to fetch the details

v_id stores the stripped video ID from the link

v_title stores the video title fetched by using the CSS path

Similarly, v_description stores the video description by using the CSS path


During each iteration, our code saves the extracted data inside the dataframe we created earlier


In [12]:
df

Unnamed: 0,link,title,description,category
0,zopTRZzh-Y,,,science
1,z-R3DShHbkA,EASY SCIENCE EXPERIMENTS TO DO AT HOME,,science
2,naNeSLofnW8,,,science
3,T8DSQSk,7:00 AM - All Competitive Exams | GS by Shipra...,,science
4,jIoIvPCJ8kY,,,science
5,j-NO-9uM-sM,Scientists Spent 14 Months in Antarctica. That...,,science
6,NCjf6og4qHk,29 SCIENCE TRICKS that look like real MAGIC,,science
7,d5HjXAiLWG8,Science Max |Rube Goldberg | FUN SCIENCE,,science
8,Q7yvvq-9ytE,Flat Earthers vs Scientists: Can We Trust Scie...,,science
9,U211GBuMgE,,,science


In [None]:
df_science = df

In [None]:
# repeat for all the categories

In [None]:
# the end