## Scraping YouTube Data using Python and Selenium to Classify Videos

Overview of Selenium:


    Selenium is a popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. You must have come across Selenium if you’ve worked in the IT field.

We can easily program a Python script to automate a web browser using Selenium. It gives us the freedom we need to efficiently extract the data and store it in our preferred format for future use.

Selenium requires a driver to interface with our chosen browser. Chrome, for example, requires ChromeDriver, which needs to be installed before we start scraping. The Selenium web driver speaks directly to the browser using the browser’s own engine to control it. This makes it incredibly fast.

In [8]:
!pip install selenium



we’ll be scraping the video ID, video title, and video description of a particular category from YouTube. The categories we’ll be scraping are:

    Travel
    Science
    Food
    History
    Manufacturing
    Art & Dance  


In [15]:
# First, let’s import some libraries:
from selenium import webdriver 
import pandas as pd 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

### 1. Scrapping the Travel data from youtube

In [103]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=Travel")

In [34]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

115


In [35]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])


In [36]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_travel"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 2. Scrapping the Science data from youtube

In [49]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=Science")

In [51]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

20


In [52]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [53]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_science"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 3. Scrapping the India data from youtube

In [54]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=India")

In [55]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

28


In [56]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [57]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_india"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 4. Scrapping the Food data from youtube

In [58]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=food")

In [62]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

37


In [63]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [64]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_food"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 5. Scrapping the Manufacturing data from youtube

In [86]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=manufacturing")

In [87]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

14


In [88]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [89]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_manufacturing"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 6. Scrapping the History data from youtube

In [90]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=history")

In [91]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

19


In [92]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [93]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_history"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

### 7. Scrapping the Art and dance data from youtube

In [94]:
driver = webdriver.Chrome() 
driver.get("https://www.youtube.com/results?search_query=art+and+dance")

In [95]:
user_data = driver.find_elements_by_xpath('//*[@id="video-title"]')
links = []
for i in user_data:
            links.append(i.get_attribute('href'))

print(len(links))

21


In [96]:
# we need to create a dataframe with 4 columns – “link”, “title”, “description”, and “category”. 
# We will store the details of videos for different categories in these columns:
df = pd.DataFrame(columns = ['link', 'title', 'description', 'category'])

In [97]:
# to scrape the video details from YouTube
wait = WebDriverWait(driver, 10)
v_category = "df_artndance"
for x in links:
            driver.get(x)
            v_id = x.strip('https://www.youtube.com/watch?v=')
            v_title = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"h1.title yt-formatted-string"))).text
            v_description =  wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div#description yt-formatted-string"))).text
            df.loc[len(df)] = [v_id, v_title, v_description, v_category]

In [102]:
df_artndance

NameError: name 'df_artndance' is not defined

In [100]:
frames = [df_travel, df_science, df_india, df_food, df_manufacturing, df_history, df_artndance]
df_copy = pd.concat(frames, axis=0, join='outer', join_axes=None, ignore_index=True,keys=None, levels=None, names=None, verify_integrity=False, copy=True)

NameError: name 'df_travel' is not defined