#  WEEK - 3 DATA SCIENCE 
#  Web Scraping with Different Tools and Techniques 

# 1. Scraping Data Using Requests and BeautifulSoup

# 1.1 Installing Necessary Libraries 

In [2]:
!pip install requests
!pip install beautifulsoup4




# 1.2 Importing Necessary Libraries

In [28]:
import requests
from bs4 import BeautifulSoup
import csv

# 1.3 Get Request to Website 

In [37]:
# Sending  a GET request to the website

url = 'https://www.nationalgeographic.com/'
response = requests.get(url)

# 1.4 Parse the HTML Content

In [38]:
# Parse the HTML content of the page

soup = BeautifulSoup(response.text, 'html.parser')

# 1.5 Extracting Relevant Data (All Headings)

In [39]:
# Extract relevant data (e.g., all headings)

headings = soup.find_all('h2') 

# 1.6 Storing Extracted Data into a CSV File

In [40]:
#  Storing the extracted data into a CSV file

with open('scraped_data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Heading']) 
    for heading in headings:
        writer.writerow([heading.get_text()])

print("Data has been saved to scraped_data.csv")

Data has been saved to scraped_data.csv


# 2 Scrape Data Using Selenium for Dynamic Content

# 2.1  Installation of Required Libraries

In [15]:
pip install selenium


Collecting selenium
  Downloading selenium-4.27.1-py3-none-any.whl.metadata (7.1 kB)
Collecting trio~=0.17 (from selenium)
  Downloading trio-0.27.0-py3-none-any.whl.metadata (8.6 kB)
Collecting trio-websocket~=0.9 (from selenium)
  Downloading trio_websocket-0.11.1-py3-none-any.whl.metadata (4.7 kB)
Collecting websocket-client~=1.8 (from selenium)
  Downloading websocket_client-1.8.0-py3-none-any.whl.metadata (8.0 kB)
Collecting attrs>=23.2.0 (from trio~=0.17->selenium)
  Downloading attrs-24.2.0-py3-none-any.whl.metadata (11 kB)
Collecting outcome (from trio~=0.17->selenium)
  Downloading outcome-1.3.0.post0-py2.py3-none-any.whl.metadata (2.6 kB)
Collecting wsproto>=0.14 (from trio-websocket~=0.9->selenium)
  Downloading wsproto-1.2.0-py3-none-any.whl.metadata (5.6 kB)
Downloading selenium-4.27.1-py3-none-any.whl (9.7 MB)
   ---------------------------------------- 0.0/9.7 MB ? eta -:--:--
   ---------------------------------------- 0.0/9.7 MB 1.3 MB/s eta 0:00:08
   ----------------

# 2.2 Importing Necessary Libraries

In [16]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import csv

# 2.3 Setting Up Selenium WebDriver

In [17]:
# Set up Selenium WebDriver

driver = webdriver.Chrome()

# 2.4 URL of the Website

In [20]:
url = 'https://www.tripadvisor.com/'
driver.get(url)

# 2.5 Scrapping Data

In [21]:
# Scraping data (e.g., all headings)

headings = driver.find_elements(By.TAG_NAME, 'h2')

# 2.6 Storing Extracted Data into a CSV file

In [22]:
# Store the extracted data into a CSV file

with open('scraped_data_selenium.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Heading'])  # Column header
    for heading in headings:
        writer.writerow([heading.text])

# 2.7 Closing the Browser 

In [23]:
#  Close the browser

driver.quit()

print("Data has been saved to scraped_data_selenium.csv")

Data has been saved to scraped_data_selenium.csv


# Scrapy Method

**I have done scrapy method on VS code and shared the file of it as well**

# Comparison of Methods

**After performing the scraping with each of the three methods, here's my take on scraping data on different methods:**

**1. Ease of Use:**

**requests + BeautifulSoup:** Simple to use, requires minimal setup, and is ideal for static content.

**Scrapy:** More complex to set up but offers better scalability and control. Scrapy is faster for large scraping tasks.

**Selenium:** Best for handling dynamic websites (JavaScript-rendered content), but slower than other methods because it interacts with a real browser.

**2. Speed:**

**requests + BeautifulSoup:** Fastest because it does not need to load a browser.

**Scrapy:** Fast, especially for large-scale scraping tasks.

**Selenium:** Slowest due to the overhead of browser interaction.

**3. Handling Dynamic Content:**

**requests + BeautifulSoup:** Does not handle JavaScript-generated content.

**Scrapy:** Can be extended with middlewares (e.g., Splash) to handle JavaScript, but is limited out-of-the-box.

**Selenium:** Best for scraping JavaScript-heavy websites (renders dynamic content).


#  Conclusion

**Each scraping technique has its strengths and weaknesses, depending on the complexity of the website you're trying to scrape.**

**For static websites, requests + BeautifulSoup is often the easiest and fastest solution.**

**For more complex tasks, Scrapy provides a powerful, scalable approach.**

**When dealing with JavaScript-heavy websites, Selenium shines despite its slower performance.**