# Web Scraping Analyze

## What is Web Scraping?
Web Scraping is the process where data id colected automatically from websites. It is like a copy paste action from a webpage, but instead of collecting manually the desired data, a program does it for you, already in the needed format. This program visits the website, looks at the code that is behind the page (you can do it with Ctrl + Shift + i) and picks the needed information (like text, images, or links) and saves it for later using.

## Why Web Scraping?
- **Efficient Data Gathering**: Web scrapers can collect and process large amounts of data quickly, unlike browsing websites manually.
- **Beyond Search Engines**: They can access specific data from multiple websites, even where search engines like Google can't reach.
- **API Alternative**: Web scraping is useful when APIs don't exist, are limited, or don't provide the exact data needed.
- **Versatile Applications**: Scraped data can be used for market analysis, trend tracking, and creative projects, providing nearly unlimited data access.
- **Business and Innovation**: Web scraping can guide business decisions, boost productivity, and open new creative avenues.

## Methods of Web Scraping

1. **Manual Copy-Pasting**

- Description: The simplest form of web scraping, where data is manually copied from a website and pasted into a file or spreadsheet.
- Use Case: Suitable for very small datasets or when automation isn't worth the effort.

2. **HTTP Requests (Using Libraries like requests)**

- Description: This method involves sending HTTP requests to a website's server to retrieve the raw HTML content of a webpage.
- Tools: Python's requests library.
- Use Case: Ideal for scraping static websites where the content is directly available in the HTML. In this case you receive all the HTML content. With next methods you can actually parse the content and find the needed information.

3. **Parsing HTML (Using Libraries like BeautifulSoup or lxml)**

- Description: After retrieving the HTML, this method involves parsing the HTML content to extract specific data points.
- Tools: Python's BeautifulSoup or lxml libraries.
- Use Case: Suitable for extracting structured data like tables, lists, or specific elements (e.g., text within certain tags).

4. **Browser Automation (Using Tools like Selenium or Playwright)**

- Description: This method involves automating a web browser to interact with dynamic websites that use JavaScript to load content.
- Tools: Selenium, Playwright.
- Use Case: Best for scraping websites that require interaction (e.g., clicking buttons, filling forms) or where data is loaded dynamically.

5. **Headless Browsers (Using Tools like Puppeteer)**

- Description: Similar to browser automation, but using a headless browser that runs without a graphical user interface, making it faster and more resource-efficient.
- Tools: Puppeteer.
- Use Case: Ideal for scraping dynamic websites while reducing overhead from a full browser UI.

6. **API Scraping**

- Description: If a website provides an API, you can send requests directly to the API to retrieve structured data in a format like JSON or XML.
- Tools: Python's requests library or specific API clients.
- Use Case: Preferable when an API is available, as it is often more reliable and faster than scraping HTML.

7. **Scraping with Headless CMS Crawlers**

- Description: Some tools are designed to scrape content from headless CMS platforms that serve content via APIs.
- Tools: Custom-built crawlers or specialized tools.
- Use Case: Useful for scraping content-rich websites built on modern content management systems.

8. **Web Scraping Frameworks (Using Tools like Scrapy)**

- Description: Frameworks like Scrapy provide a comprehensive environment for building and managing web scraping projects, including scheduling, data storage, and more.
- Tools: Scrapy.
- Use Case: Best for large-scale scraping projects where you need to manage multiple spiders, handle complex data extraction, and store data efficiently.

9. **Data Extraction Services**

- Description: Third-party services or tools that provide web scraping as a service, where you specify the data you need, and they handle the scraping.
- Tools: Import.io, Octoparse.
- Use Case: Suitable for users who want to avoid the technical aspects of web scraping and focus solely on the data.

10. **Custom Scrapers**

- Description: Writing custom scripts tailored to specific websites, using a combination of HTTP requests, HTML parsing, and automation as needed.
- Tools: Python, Node.js, etc.
- Use Case: Useful when dealing with unique websites that require specific handling or when pre-built tools don't offer the needed flexibility.

# Beautiful Soup

In [1]:
%pip install requests beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


In [2]:
import requests
from bs4 import BeautifulSoup

In [3]:
# URL of the website you want to scrape
url = 'https://www.wikipedia.org/'

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request was successful!")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Request was successful!


In [4]:
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Print the title of the webpage
print(soup.title.string)

Wikipedia


In [5]:
# Find all paragraph tags in the HTML
paragraphs = soup.find_all('p')

# Print the text of each paragraph
for i, paragraph in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {paragraph.get_text()}\n")

Paragraph 1: 
Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.


Paragraph 2: 
This page is available under the Creative Commons Attribution-ShareAlike License
Terms of Use
Privacy Policy




In [6]:
# Save the paragraphs to a text file
with open('scraped_paragraphs.txt', 'w') as file:
    for paragraph in paragraphs:
        file.write(paragraph.get_text() + '\n')

# LXML

In [7]:
%pip install requests lxml

Note: you may need to restart the kernel to use updated packages.


In [8]:
import requests
from lxml import html

In [9]:
# URL of the website you want to scrape
url = 'https://www.wikipedia.org/'

# Send an HTTP GET request to the URL
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Request was successful!")
else:
    print(f"Failed to retrieve the page. Status code: {response.status_code}")

Request was successful!


In [10]:
# Parse the HTML content using lxml
tree = html.fromstring(response.content)

# Print the title of the webpage
title = tree.xpath('//title/text()')
print("Page Title:", title[0] if title else "No title found")

Page Title: Wikipedia


In [11]:
# Extract all paragraphs using XPath
paragraphs = tree.xpath('//p/text()')

# Print the text of each paragraph
for i, paragraph in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {paragraph}\n")

Paragraph 1: 
Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.


Paragraph 2: 


Paragraph 3: 


Paragraph 4: 


Paragraph 5: 




In [12]:
# Save the paragraphs to a text file
with open('scraped_paragraphs_lxml.txt', 'w') as file:
    for paragraph in paragraphs:
        file.write(paragraph + '\n')

# Scrapy

In [13]:
%pip install scrapy

Note: you may need to restart the kernel to use updated packages.


In [1]:
import scrapy
from scrapy.crawler import CrawlerProcess

In [2]:
# Define the Spider
class ExampleSpider(scrapy.Spider):
    name = 'example_spider'
    start_urls = ['https://www.wikipedia.org/']

    def parse(self, response):
        # Extract the title of the page
        title = response.xpath('//title/text()').get()
        print(f"Page Title: {title}")

        # Extract all paragraphs on the page
        paragraphs = response.xpath('//p/text()').getall()
        for i, paragraph in enumerate(paragraphs, 1):
            print(f"Paragraph {i}: {paragraph}\n")

        # Example: Return data to Scrapy's pipeline (can be processed later)
        yield {
            'title': title,
            'paragraphs': paragraphs
        }

In [3]:
# Create a CrawlerProcess with the settings
process = CrawlerProcess(settings={
    "FEEDS": {
        "output.json": {"format": "json"},  # Save output to a JSON file
    },
})

# Start the crawling process with the defined spider
process.crawl(ExampleSpider)
process.start()  # The script will block here until the spider finishes

2024-08-25 20:55:27 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2024-08-25 20:55:27 [scrapy.utils.log] INFO: Versions: lxml 4.9.2.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.4 | packaged by Anaconda, Inc. | (main, Jul  5 2023, 13:38:37) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.2.0 (OpenSSL 1.1.1w  11 Sep 2023), cryptography 41.0.2, Platform Windows-10-10.0.19045-SP0
2024-08-25 20:55:27 [scrapy.crawler] INFO: Overridden settings:
{}




See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-08-25 20:55:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2024-08-25 20:55:27 [scrapy.extensions.telnet] INFO: Telnet Password: c16d9eecacac7d19
2024-08-25 20:55:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2024-08-25 20:55:27 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.down

Page Title: Wikipedia
Paragraph 1: 
Save your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.


Paragraph 2: 


Paragraph 3: 


Paragraph 4: 


Paragraph 5: 




In [4]:
import json

# Load and display the scraped data
with open('output.json') as f:
    data = json.load(f)

data

[{'title': 'Wikipedia',
  'paragraphs': ['\nSave your favorite articles to read offline, sync your reading lists across devices and customize your reading experience with the official Wikipedia app.\n',
   '\n',
   '\n',
   '\n',
   '\n']}]

# Selenium

In [5]:
%pip install selenium

Note: you may need to restart the kernel to use updated packages.


In [13]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.service import Service
import time

In [26]:
# Specify the correct path to chromedriver.exe
chrome_driver_path = 'C:/Program Files/Google/Chrome/Application/chrome.exe' #https://storage.googleapis.com/chrome-for-testing-public/128.0.6613.84/win32/chrome-win32.zip
#the

# Create a Service object with the path to ChromeDriver
service = Service(executable_path=chrome_driver_path)

# Initialize the WebDriver using the Service object
driver = webdriver.Chrome(service=service)

# Open the webpage
url = "https://www.wikipedia.org/"
driver.get(url)

# Extract the title of the page
title = driver.title
print("Page Title:", title)

# Extract all paragraph texts
paragraphs = driver.find_elements(By.TAG_NAME, 'p')
for i, paragraph in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {paragraph.text}\n")


2024-08-25 22:06:56 [selenium.webdriver.common.service] DEBUG: Started executable: `C:/Program Files/Google/Chrome/Application/chrome.exe` in a child process with pid: 11588


KeyboardInterrupt: 

# Playwright

In [27]:
%pip install playwright

Collecting playwright
  Obtaining dependency information for playwright from https://files.pythonhosted.org/packages/ba/27/b5f21695ee2ea32fdf826e531066e5633e1056171e217bac3daeefa46017/playwright-1.46.0-py3-none-win_amd64.whl.metadata
  Downloading playwright-1.46.0-py3-none-win_amd64.whl.metadata (3.5 kB)
Collecting greenlet==3.0.3 (from playwright)
  Obtaining dependency information for greenlet==3.0.3 from https://files.pythonhosted.org/packages/47/79/26d54d7d700ef65b689fc2665a40846d13e834da0486674a8d4f0f371a47/greenlet-3.0.3-cp311-cp311-win_amd64.whl.metadata
  Downloading greenlet-3.0.3-cp311-cp311-win_amd64.whl.metadata (3.9 kB)
Collecting pyee==11.1.0 (from playwright)
  Obtaining dependency information for pyee==11.1.0 from https://files.pythonhosted.org/packages/16/cc/5cea8a0a0d3deb90b5a0d39ad1a6a1ccaa40a9ea86d793eb8a49d32a6ed0/pyee-11.1.0-py3-none-any.whl.metadata
  Downloading pyee-11.1.0-py3-none-any.whl.metadata (2.8 kB)
Downloading playwright-1.46.0-py3-none-win_amd64.whl 

In [28]:
from playwright.sync_api import sync_playwright # problem with async environment of jupyter

# Define the scraping function
def scrape_website(url):
    with sync_playwright() as p:
        # Launch the browser
        browser = p.chromium.launch(headless=True)  # You can set headless=False to see the browser window
        # Create a new browser context
        context = browser.new_context()
        # Open a new page
        page = context.new_page()
        # Navigate to the desired URL
        page.goto(url)
        
        # Extract text from the page (for example, the title)
        title = page.title()
        print(f"Page title: {title}")
        
        # Take a screenshot and save it
        page.screenshot(path='screenshot.png')
        
        # Close the browser
        browser.close()

# Example usage
scrape_website('https://www.wikipedia.org/')


Error: It looks like you are using Playwright Sync API inside the asyncio loop.
Please use the Async API instead.

In [30]:
%pip install nest_asyncio

Note: you may need to restart the kernel to use updated packages.


In [32]:
import asyncio
from playwright.async_api import async_playwright

async def scrape_website(url):
    async with async_playwright() as p:
        # Launch the browser
        browser = await p.chromium.launch(headless=True)  # Set headless=False to see the browser window
        # Create a new browser context
        context = await browser.new_context()
        # Open a new page
        page = await context.new_page()
        # Navigate to the desired URL
        await page.goto(url)
        
        # Extract text from the page (for example, the title)
        title = await page.title()
        print(f"Page title: {title}")
        
        # Take a screenshot and save it
        await page.screenshot(path='screenshot.png')
        
        # Close the browser
        await browser.close()

# Run the async function
if __name__ == "__main__":
    asyncio.run(scrape_website('https://www.wikipedia.org/'))


2024-08-25 22:16:23 [asyncio] ERROR: Task exception was never retrieved
future: <Task finished name='Task-3' coro=<Connection.run() done, defined at c:\Users\Admin\anaconda3\Lib\site-packages\playwright\_impl\_connection.py:265> exception=NotImplementedError()>
Traceback (most recent call last):
  File "c:\Users\Admin\anaconda3\Lib\site-packages\IPython\core\interactiveshell.py", line 3508, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "C:\Users\Admin\AppData\Local\Temp\ipykernel_19348\3696586122.py", line 27, in <module>
    asyncio.run(scrape_website('https://www.wikipedia.org/'))
  File "c:\Users\Admin\anaconda3\Lib\site-packages\nest_asyncio.py", line 35, in run
    return loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\Admin\anaconda3\Lib\site-packages\nest_asyncio.py", line 90, in run_until_complete
    return f.result()
           ^^^^^^^^^^
  File "c:\Users\Admin\anaconda3\Lib\asyncio\futures.py", line 203, in r

NotImplementedError: 