# Lab 5: Web Scraping & API Interaction

This laboratory covers the extraction of data from web pages (Scraping) and interaction with external web services via REST and SOAP APIs.

| Cell # | Purpose | Key Details |
| :--- | :--- | :--- |
| **1-3** | **Setup** | Imports libraries and defines the target URL and headers. |
| **4** | **HTML Parsing** | Initializes the BeautifulSoup object with `lxml`. |
| **5** | **Title Scraping** | Extracts the `<title>` tag from the HTML. |
| **6** | **Table Scraping** | Finds and prints metadata from a Wikipedia infobox. |
| **7-9** | **Quote Scraping** | Loops through 'quote' classes to extract text and authors. |
| **10** | **CSS Selectors** | Uses `.select()` for more advanced tag targeting. |
| **11-14** | **REST (GET)** | Fetches and parses JSON data from a remote endpoint. |
| **15-16** | **REST (POST)** | Demonstrates sending new data to a server. |
| **17-19** | **SOAP API** | Connects to a WSDL service and calls a remote function. |

## 1. Imports
We import `requests` to fetch raw HTML and `BeautifulSoup` to parse it.

In [42]:
import requests
import sys
from bs4 import BeautifulSoup


## 2. Defining Target URL
We set the standard Wikipedia page for Web Scraping as our demonstration target.

In [43]:
TARGET_URL = "https://en.wikipedia.org/wiki/Web_scraping"

## 3. Configuring Headers
Defining a 'User-Agent' prevents the script from being flagged as a basic bot.

In [44]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}


## 4. Initializing BeautifulSoup
We convert the fetched HTML content into a searchable object using the fast `lxml` parser.

In [45]:
parser = 'lxml'
soap = BeautifulSoup(html_content, parser)

## 5. Extracting the Page Title
A simple check to ensure the connection was successful and the correct page was retrieved.

In [46]:
page_title = soap.find('title').text
print(f"\nPage Title: {page_title}")


Page Title: Web scraping - Wikipedia


## 6. Table Data Extraction
This cell locates specific 'infobox' tables and extracts key metadata formatted in rows.

In [47]:
tavle = soap.find('table', class_='infobox vevent')
if tavle:
    rows = tavle.find_all('tr')
    for row in rows:
        header = row.find('th')
        data = row.find('td')
        if header and data:
            print(f"{header.text.strip()}: {data.text.strip()}")
else:
    print("No infobox table found on the page.")

No infobox table found on the page.


## 7. Finding Quote Elements
We prepare to store structured objects and search for all divs with the `quote` class.

In [48]:
quotes_data = []


quote_divs = soap.find_all('div', class_='quote')

## 8. Scraping Details via Loops
For each quote, we extract the main text, the author, and any associated tags.

In [49]:
for quote_div in quote_divs:
    
    text_element = quote_div.find('span', class_='text')
    quote_text = text_element.text if text_element else "N/A"
    
    author_element = quote_div.find('small', class_='author')
    author_name = author_element.text if author_element else "N/A"
    
    tag_list = []
    for tag_item in quote_div.find('div', class_='tags').find_all('a', class_='tag'):
        tag_list.append(tag_item.text)
        
    quotes_data.append({
        'quote': quote_text,
        'author': author_name,
        'tags': tag_list
    })


## 9. Printing Extracted List
A clean output to verify the scraping results of the first few entries.

In [50]:
print("\n--- Extracted Quotes (First 3) ---")
for item in quotes_data[:3]:
    print(f"Author: {item['author']}\nQuote: {item['quote']}\nTags: {', '.join(item['tags'])}\n")


--- Extracted Quotes (First 3) ---


## 10. Using CSS Selectors
This cell shows how to use `.select()` to target nested elements directly using CSS path syntax.

In [51]:
quote_texts = soap.select('div.quote span.text')

print(f"Found {len(quote_texts)} quotes using CSS selector.")

Found 0 quotes using CSS selector.


## 11. REST API Introduction
Importing libraries for handling JSON-based network requests.

In [52]:
import requests
import sys

## 12. Defining API Endpoint
We use the JSONPlaceholder 'Todos' endpoint for testing GET requests.

In [53]:
api_url="https://jsonplaceholder.typicode.com/todos"

## 13. Executing GET Request
Fetching data from the remote server.

In [54]:
response= requests.get(api_url)

## 14. Processing JSON Data
Checking for success (200) and printing the first 5 records as Python dictionaries.

In [55]:
if response.status_code == 200:
    data = response.json()
    print("\n--- API Response Data ---")
    for item in data[:5]:
        print(item)
else:
    print(f"API request failed with status code: {response.status_code}")


--- API Response Data ---
{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}
{'userId': 1, 'id': 2, 'title': 'quis ut nam facilis et officia qui', 'completed': False}
{'userId': 1, 'id': 3, 'title': 'fugiat veniam minus', 'completed': False}
{'userId': 1, 'id': 4, 'title': 'et porro tempora', 'completed': True}
{'userId': 1, 'id': 5, 'title': 'laboriosam mollitia et enim quasi adipisci quia provident illum', 'completed': False}


## 15. Defining New POST Data
Setting up a task object to send to the server.

In [56]:
data={
    "userId": 1,
    "title": "yoo",
    "completed": False
}

## 16. Executing POST Request
Sending the data and verifying the response ID provided by the server.

In [57]:
post_response = requests.post(api_url)
print(f"\nPOST Request Status Code: {post_response.status_code}")
print(f"POST Response Content: {post_response.text}")


POST Request Status Code: 201
POST Response Content: {
  "id": 201
}


## 17. SOAP API Setup
Introduction to simple object access protocol using the `zeep` client.

In [58]:
from zeep import Client


## 18. Connecting to WSDL URL
Initializing the client with a remote calculator service definition.

In [59]:
wsdl_url = "http://www.dneonline.com/calculator.asmx?WSDL"
client = Client(wsdl=wsdl_url)

## 19. Remote Function Call
Executing the `Add` operation on the remote server and printing the result.

In [60]:
result = client.service.Add(intA=10, intB=20)
print(f"\nSOAP Add Operation Result: {result}")


SOAP Add Operation Result: 30
