# What is Web Scraping

Web scraping is a technique used in the field of data extraction and web automation. It involves programmatically fetching and extracting information from websites. Web scraping enables the automated retrieval of data, such as text, images, tables, or other structured content, from web pages, and it can be useful for a wide range of applications, including data analysis, research, content aggregation, price monitoring, and more.

### Selenium

One popular tool for web scraping is Selenium, which is a web automation framework. Selenium allows you to interact with websites in a way that simulates human browsing behavior. It can be used to automate tasks like filling out forms, clicking buttons, and navigating through web pages, making it a powerful tool for web scraping.

Selenium can be integrated with multiple programming languages like Python, Java, C#, Ruby, and JavaScript.

However, it's important to note that web scraping, including the use of Selenium, should be conducted ethically and in compliance with the website's terms of service and legal regulations.

### Selenium Python Tutorial

Install The Necessary Libraries

In [1]:
!pip install selenium
!pip install webdriver-manager



DEPRECATION: colab 1.13.5 has a non-standard dependency specifier pytz>=2011n. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of colab or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063




DEPRECATION: colab 1.13.5 has a non-standard dependency specifier pytz>=2011n. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of colab or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


Ensure that your web browser is up to date to its latest version.

In [24]:
#Importing the necessary libraries

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException

### Google Chrome

In [25]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))

### Firefox

In [26]:
# from selenium import webdriver
# from selenium.webdriver.firefox.service import Service as FirefoxService
# from webdriver_manager.firefox import GeckoDriverManager

# driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

Please provide the URL of the webpage you'd like to Automate/Scrape.

In [35]:
url = 'https://www.daraz.pk/'

Give the URL to the Driver

In [36]:
driver.get(url)

Getting the Daraz Search Bar 

In [37]:
search_bar = driver.find_element(By.XPATH, '/html/body/div[1]/div/div/div[1]/div/div/div[2]/div/div[2]/form/div/div[1]/input[1]')
search_bar.send_keys('Samsung' + Keys.RETURN)

Alternative methods for finding elements besides XPath.

i) By.ID<br>
ii) By.NAME<br>
iii) By.LINK_TEXT<br>
iv) By.TAG_NAME<br>
v) By.CLASS_NAME<br>
vi) By.XPATH

Example of Finding Elements by Tag Name

In [30]:
driver.find_element(By.TAG_NAME, 'div').text

'Become a Seller Daraz Affiliate Program Help & Support\nSave More on App\nSEARCH\nLogin\n|\nSign Up\nEN\n⌄\nCategories'

Finding Multiple Elements with same HTML Tag

In [8]:
driver.find_elements(By.TAG_NAME, 'div')

[<selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_118")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_119")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_120")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_121")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_122")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f336762a9dcbf", element="E470D15CCDEFFEFD1A18CAD32BCEC482_element_123")>,
 <selenium.webdriver.remote.webelement.WebElement (session="4af62bfe8f4ff2bb343f33

Extracting Text From Specific Element

In [9]:
driver.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div/div/div[1]/div[2]/div[1]/div/div/div[2]/div[2]/a').text

'Daraz Like New Tablets - Samsung Galaxy Tab A 8.0" 2GB 32GB, Black Wi-Fi Supported - FREE TABLET COVER'

### Relative Path

In [31]:
ad_grid = driver.find_element(By.XPATH, '/html/body/div[3]/div/div[3]/div/div/div[1]/div[2]')
ad_cards = ad_grid.find_elements(By.CLASS_NAME, 'gridItem--Yd0sa')

for i in ad_cards:
    print(i.text)

Daraz Like New Tablets - Samsung Galaxy Tab A 8.0" 2GB 32GB, Black Wi-Fi Supported - FREE TABLET COVER
Rs. 16,299
Rs. 24,999-35%
(19)
Free Shipping
Samsung Galaxy Tab A6 - 1.5GB RAM - 8GB ROM - Android 7
Rs. 11,475
Rs. 12,500-8%
(15)
Pakistan
Samsung Galaxy Tab E - 1.5GB RAM - 16GB ROM - Android 7 - FREE TABLET COVER
Rs. 13,899
Rs. 24,999-44%
(11)
Free Shipping
SAMSUNG GALAXY TAB A7 LITE T220 3GB,32GB,8.7INCH,ONLY WIFI
Rs. 39,999
Rs. 45,000-11%
(20)
Free Shipping
SAMSUNG TAB A7 LITE T220 4GB RAM/64GB STORAGE/WIFI /8.7 INCH(BRAND NEW)
Rs. 45,999
Rs. 49,999-8%
(16)
Free Shipping
SAMSUNG GALAXY TAB A7 LITE T225 3GB,32GB,8.7INCH,SIM PLUS WIFI PTA APPROVED
Rs. 49,999
Rs. 59,999-17%
(15)
Free Shipping
Samsung A04 3GB+32GB
Rs. 27,499
Rs. 37,500-27%
(24)
Pakistan
Daraz Like New Tablets - Samsung Galaxy Tab A 8.0" 2GB 32GB, Black Wi-Fi Supported - Tablets - FREE TABLET COVER - PUBG SUPPORTE - MODEL - T387
Rs. 16,499
Rs. 24,999-34%
(10)
Free Shipping
Samsung Galaxy Tab A6 - 1.5GB RAM - 8GB ROM -

Clicking on an Element on Web Page

In [32]:
ad_cards[0].click()

### Common Exceptions

#### NoSuchElementException
When driver is unable to find the element specified in the find_element() function.<br>
-Check if you have specified the right path<br>
-Chek if driver is on the same page as the specified path

#### StaleElementReferenceException
The Element you're trying to access is being blocked by some other element.<br>
-Try to access the element in loop

### Code for Scraping Reviews of a Product on Daraz

In [12]:
url = 'https://www.daraz.pk/products/daraz-like-new-tablets-samsung-galaxy-tab-a-80-2gb-32gb-black-wi-fi-supported-free-tablet-cover-i411369268-s1965198370.html?spm=a2a0e.searchlist.list.1.3c784ce9COfUfG&search=1'

-The time.sleep() function can be used to introduce a brief pause, allowing the webpage to load successfully.<br>
-For automated scrolling within a webpage, you can utilize driver.execute_script("window.scrollTo(0, 500);").

In [33]:
import time
import pandas as pd

# driver.get(url)

temp = dict()
temp['Name'] = []
temp['Review text'] = []

driver.execute_script("window.scrollTo(0, 500);")
time.sleep(1)
driver.execute_script("window.scrollTo(500, 800);")
time.sleep(1)
driver.execute_script("window.scrollTo(800, 1000);")
time.sleep(2)
driver.execute_script("window.scrollTo(1000, 1300);")
time.sleep(2)
driver.execute_script("window.scrollTo(1300, 1600);")
time.sleep(1)

for page in range(0,3):
    try:
        reviews_container = driver.find_element(By.XPATH, '//*[@id="module_product_review"]/div/div/div[3]/div[1]')
    except NoSuchElementException:
        print('No Reviews')
        break
    reviews = reviews_container.find_elements(By.CLASS_NAME,'item')
    
    for i in range(len(reviews)):
        if reviews[i].find_element(By.CSS_SELECTOR, f"#module_product_review > div > div > div:nth-child(3) > div.mod-reviews > div:nth-child({i+1}) > div.item-content > div.content").text != '':
            temp['Name'].append(reviews[i].find_element(By.CLASS_NAME,'middle').find_element(By.TAG_NAME,'span').text[3:])
            temp['Review text'].append(reviews[i].find_element(By.CSS_SELECTOR, f"#module_product_review > div > div > div:nth-child(3) > div.mod-reviews > div:nth-child({i+1}) > div.item-content > div.content").text)
        
    try:
        if(page != 2):
            nxt_button = driver.find_element(By.XPATH, '//*[@id="module_product_review"]/div/div/div[3]/div[2]/div/button[2]')
            driver.execute_script("arguments[0].click();", nxt_button)
            time.sleep(5)
    except NoSuchElementException:
        print('No More Pages')
        break
        
reviews = pd.DataFrame(temp)

reviews

Unnamed: 0,Name,Review text
0,Salah,"Ordered two, received on time and well packed...."
1,Dr U.,Excellent Quality.Just like new. Needs eagle e...
2,R***.,Cover was missing and charger was not original...
3,Waqasraza R.,thanks saller
4,3***2,"Marvelous quality,All tabs are 10/10 , charger..."
5,Salah,"Ordered two, received on time and well packed...."
6,Dr U.,Excellent Quality.Just like new. Needs eagle e...
7,R***.,Cover was missing and charger was not original...
8,Waqasraza R.,thanks saller
9,3***2,"Marvelous quality,All tabs are 10/10 , charger..."


# Beautiful Soup / Requests

In [20]:
!pip install beautifulsoup4
!pip install requests



DEPRECATION: colab 1.13.5 has a non-standard dependency specifier pytz>=2011n. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of colab or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063





DEPRECATION: colab 1.13.5 has a non-standard dependency specifier pytz>=2011n. pip 23.3 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of colab or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063


In [21]:
from bs4 import BeautifulSoup
import requests
url = "https://en.wikipedia.org/wiki/Muhammad_Ali_Jinnah"
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print(soup.title)

<title>Muhammad Ali Jinnah - Wikipedia</title>


In [24]:
soup.body

<body class="skin-vector skin-vector-search-vue mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Muhammad_Ali_Jinnah rootpage-Muhammad_Ali_Jinnah skin-vector-2022 action-view"><a class="mw-jump-link" href="#bodyContent">Jump to content</a>
<div class="vector-header-container">
<header class="vector-header mw-header">
<div class="vector-header-start">
<nav aria-label="Site" class="vector-main-menu-landmark" role="navigation">
<div class="vector-dropdown vector-main-menu-dropdown vector-button-flush-left vector-button-flush-right" id="vector-main-menu-dropdown">
<input aria-haspopup="true" aria-label="Main menu" class="vector-dropdown-checkbox" data-event-name="ui.dropdown-vector-main-menu-dropdown" id="vector-main-menu-dropdown-checkbox" role="button" type="checkbox"/>
<label aria-hidden="true" class="vector-dropdown-label cdx-button cdx-button--fake-button cdx-button--fake-button--enabled cdx-button--weight-quiet cdx-button--icon-only" for="vector-main-menu-dropdown-che

In [26]:
print(soup.text)





Muhammad Ali Jinnah - Wikipedia



































Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload file





Languages

Language links are at the top of the page across from the title.



















Search











Search








Create accountLog in






Personal tools





 Create account Log in





		Pages for logged out editors learn more



ContributionsTalk



























Contents
move to sidebar
hide




(Top)





1Early years



Toggle Early years subsection





1.1Family and childhood







1.2Education in England









2Legal and early political career



Toggle Legal and early political career subsection





2.1Barrister







2.2Trade unionist







2.3Rising leader







2.4Farewell to Congress









3Wilderness years; interlude 

In [28]:
for link in soup.find_all('a'):
    print(link.get('href'))

#bodyContent
/wiki/Main_Page
/wiki/Wikipedia:Contents
/wiki/Portal:Current_events
/wiki/Special:Random
/wiki/Wikipedia:About
//en.wikipedia.org/wiki/Wikipedia:Contact_us
https://donate.wikimedia.org/wiki/Special:FundraiserRedirector?utm_source=donate&utm_medium=sidebar&utm_campaign=C13_en.wikipedia.org&uselang=en
/wiki/Help:Contents
/wiki/Help:Introduction
/wiki/Wikipedia:Community_portal
/wiki/Special:RecentChanges
/wiki/Wikipedia:File_upload_wizard
/wiki/Main_Page
/wiki/Special:Search
/w/index.php?title=Special:CreateAccount&returnto=Muhammad+Ali+Jinnah
/w/index.php?title=Special:UserLogin&returnto=Muhammad+Ali+Jinnah
/w/index.php?title=Special:CreateAccount&returnto=Muhammad+Ali+Jinnah
/w/index.php?title=Special:UserLogin&returnto=Muhammad+Ali+Jinnah
/wiki/Help:Introduction
/wiki/Special:MyContributions
/wiki/Special:MyTalk
#
#Early_years
#Family_and_childhood
#Education_in_England
#Legal_and_early_political_career
#Barrister
#Trade_unionist
#Rising_leader
#Farewell_to_Congress
#Wil

/wiki/Category:Alumni_of_the_Inns_of_Court_School_of_Law
/wiki/Category:Cathedral_and_John_Connon_School_alumni
/wiki/Category:Church_Mission_School_alumni
/wiki/Category:Converts_to_Sunni_Islam_from_Shia_Islam
/wiki/Category:Expatriates_from_British_India_in_the_United_Kingdom
/wiki/Category:Governors-General_of_Pakistan
/wiki/Category:Indian_National_Congress_politicians
/wiki/Category:Indian_newspaper_founders
/wiki/Category:Infectious_disease_deaths_in_Sindh
/wiki/Category:Jinnah_family
/wiki/Category:Lawyers_from_Karachi
/wiki/Category:Members_of_the_Central_Legislative_Assembly_of_India
/wiki/Category:Members_of_Lincoln%27s_Inn
/wiki/Category:Members_of_the_Fabian_Society
/wiki/Category:Members_of_the_Imperial_Legislative_Council_of_India
/wiki/Category:Members_of_the_Pakistan_Philosophical_Congress
/wiki/Category:National_symbols_of_Pakistan
/wiki/Category:Pakistani_barristers
/wiki/Category:Pakistani_former_Shia_Muslims
/wiki/Category:Pakistani_MNAs_1947%E2%80%931954
/wiki/Cate