# Data Hunting and Gathering (Part 1)

## INTRO

![Web Scraping](http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg)

Welcome to the first part of our journey into the world of web scraping. Web scraping, also known as web harvesting or web data extraction, is a technique used for extracting data from websites. This process involves fetching the web page and then extracting data from it.

### Why Learn Web Scraping?
Understanding how to scrape data from the web is a valuable skill for any data professional. In the digital era, data is the new gold, and web scraping is the mining equipment. Here's why it's essential:

- **Data Availability**: The internet is a vast source of data for all kinds of analyses, from market trends to academic research.
- **Automation**: Web scraping can automate the process of collecting data, saving time and effort.
- **Competitive Advantage**: In many fields, having timely and relevant data can be a game-changer.

### Real-World Applications
- **Market Research**: Analyzing competitors, understanding customer sentiments, and identifying market trends.
- **Price Comparison**: Aggregating pricing data from various websites for comparison shopping.
- **Social Media Analysis**: Gathering data from social networks for sentiment analysis or trend spotting.

### Ethical Considerations in Web Scraping

Web scraping, while a powerful technique for data extraction, comes with significant ethical and legal responsibilities. As budding data scientists and web scrapers, it's crucial to navigate this landscape with a deep understanding and respect for these considerations.

### Respecting Website Policies and Laws

- **Adhering to Terms of Service**: Every website has its own set of rules, usually outlined in its Terms of Service (ToS). It's important to read and understand these rules before scraping, as violating them can have legal implications.

- **Following Copyright Laws**: The data you scrape is often copyrighted. Ensure that your use of scraped data complies with copyright laws and respects intellectual property rights.

- **Privacy Concerns**: Be mindful of personal data. Scraping and using personal information without consent can breach privacy laws and ethical standards.

### Example: Understanding Google's `robots.txt`

Google's `robots.txt` file is an excellent example of how websites communicate their scraping policies. Accessible at [Google's robots.txt](https://www.google.com/robots.txt), this file provides directives to web crawlers about which pages they can or cannot scrape.

#### Implications of Google's `robots.txt`

- **Selective Access**: Google allows certain parts of its site to be crawled while restricting others. For instance, crawling the search results pages is generally disallowed.

- **Dynamic Nature**: The content of `robots.txt` files can change, reflecting the website's evolving stance on web scraping. Regular checks are necessary for compliance.

- **Respecting the Limits**: Even if a `robots.txt` file allows scraping of some pages, it does not automatically mean all scraping activities are legally or ethically acceptable. It's a guideline, not a blanket permission.

### 1. Introduction to Data Hunting in the Digital Age

#### The Evolution of Data Sourcing

In this course, we focus on data as our foundational element. Traditionally, data has been sourced from structured formats like spreadsheets from scientific experiments or records in relational databases within organizations. But with the digital revolution, particularly the advent of the internet, our approach to data collection must evolve. The internet is a vast reservoir of unstructured data, presenting both challenges and opportunities for data retrieval and analysis.

#### Understanding the Landscape of Web Data

When seeking data from the internet, it's essential to first consider how the website in question provides access to its data. Many large-scale websites like Google, Facebook, and Twitter offer an **Application Programming Interface (API)**. APIs are designed to facilitate easy access to a website's data in a structured format, simplifying the process of data extraction.

##### The Role of APIs

- **APIs as a Primary Tool**: An API acts as a bridge between the data seeker and the website's database, allowing for streamlined data retrieval.
- **Limitations**: However, not all websites provide an API. Additionally, even when an API is available, it may not grant access to all the data a user might need.

##### The Need for Web Scraping

In cases where an API is absent or insufficient, we turn to **web scraping**. Web scraping involves extracting raw data directly from a website's frontend - essentially, the same information presented to users in their web browsers.

###### Diving into Scraping

- **Dealing with Unstructured Data**: Scraping requires us to interact with unstructured data, necessitating custom coding and data parsing techniques.
- **Legal and Ethical Considerations**: It's crucial to approach web scraping with an awareness of the legal and ethical implications, respecting website policies and user privacy.

## Starting Our Journey

Our first practical step in this journey will be to explore how to connect to the internet and retrieve a basic webpage. We'll begin by using Python's `urllib.request` module, a powerful tool for interacting with URLs and handling web requests.

Join us as we embark on this exciting journey to master the art of data hunting in the digital era, where we'll navigate the complexities of APIs, web scraping, and the ethical considerations that come with them.

In [None]:
# Import the 'urlopen' function from the 'urllib.request' module.
# This function is used for opening URLs, which is the first step in web scraping.
from urllib.request import urlopen

# Use the 'urlopen' function to open the URL 'http://www.google.com/'.
# The function returns a response object which can be used to read the content of the page.
# Here, 'source' is a variable that holds the response object from the URL.
source = urlopen("http://www.google.com/")

# Print the response object.
# This command does not print the content of the webpage.
# Instead, it prints a representation of the response object, 
# which includes information like the URL, HTTP response status, headers, etc.
print(source)

## Exploring the Content Retrieved by `urlopen`

This code snippet demonstrates the basic usage of the `urlopen` function for accessing a webpage. However, it is important to note that `print(source)` will not display the HTML content of the webpage but rather the HTTP response object's representation. To view the actual content of the page, you would need to read from the `source` object using methods like `source.read()`.

After opening a URL using the `urlopen` function from the `urllib.request` module, we typically want to access the actual content of the webpage. This is where `source.read()` comes into play.

### Understanding `source.read()`

When you call `urlopen`, it returns an HTTPResponse object. This object, which we've named `source` in our example, holds various data and metadata about the webpage. To extract the actual HTML content of the page, we use the `read` method on this object.

### What Does `source.read()` Do?

- **Retrieves Webpage Content**: `source.read()` reads the entire content of the webpage to which the URL points. This content is usually in HTML format, which is the standard language for creating webpages.

- **Binary Format**: The data retrieved is in binary format. To work with it as a string in Python, you might need to decode it using a method like `.decode('utf-8')`.

- **One-time Operation**: It's important to note that you can read the content of the response only once. After `source.read()` is executed, the response object does not retain the content in a readable form. If you need to access the content again, you must reopen the URL.

Here's a simple example to illustrate this:

In [None]:
#Let us check what is in
something = source.read()
print(something)

## DEMO

Let's get our hands-on with some initial exercises to get warmed up with web scraping!

### Exercises

1. **Python.org Content Check**: Does [https://www.python.org](https://www.python.org) contain the word `Python`?  
   _Hint: You can use the `in` keyword to check._

2. **Google.com Image Search**: Does [http://google.com](http://google.com) contain an image?  
   _Hint: Look for the `<img>` tag._

3. **First Characters of Python.org**: What are the first ten characters of [https://www.python.org](https://www.python.org)?

4. **Keyword Check in Pyladies.com**: Is there the word 'python' in [https://pyladies.com](https://pyladies.com)?

In [None]:
# EX1: Check if 'Python' is in the content of http://www.python.org/

# Import the urlopen function from the urllib.request module
# This function is used to open a URL and retrieve its contents
from urllib.request import urlopen

# Use the urlopen function to access the webpage at http://www.python.org/
# The function returns an HTTPResponse object which is stored in the variable 'source'
source = urlopen("http://www.python.org/")

# Read the content of the response object using the read() method
# The read() method retrieves the content of the webpage in binary format
# The binary content is then decoded to a string using the 'latin-1' encoding
# The decoded string is stored in the variable 'something'
something = source.read().decode('latin-1')

# Check if the word "Python" is in the decoded string
# This is done using the 'in' keyword, which checks for the presence of a substring in a string
# The result is a boolean value: True if "Python" is found, False otherwise
"Python" in something

# Note: The choice of 'latin-1' for decoding might not always be appropriate
# It's often better to use 'utf-8', which is a more common encoding for webpages
# For example: something = source.read().decode('utf-8')

## Definitions: Request, Crawling and Scrapping

### Using `urlopen` vs. `Request` in Web Scraping

When performing web scraping tasks in Python, you have the option to use either the `urlopen` function from the `urllib.request` module or the `Request` object in combination with `urlopen`. Here, we'll explain why you might choose one approach over the other.

### Using `urlopen` Directly

**Advantages**:

- **Simplicity**: It's a straightforward way to access a webpage and retrieve its content without the need for additional objects or customization.
  
- **Default Behavior**: `urlopen` uses default settings for the HTTP request, which is suitable for many common use cases.

- **Convenience**: For simple web scraping tasks, it provides a concise and readable solution.

### Using `Request` with `urlopen`

**Advantages**:

- **Customization**: You can set custom headers, use different HTTP methods (e.g., POST, PUT), and configure advanced options like handling redirects, cookies, and timeouts.

- **Fine-Grained Control**: It offers greater flexibility for handling complex scenarios.

In summary, the choice between using `urlopen` directly and creating a `Request` object depends on the complexity of your web scraping task. For simple tasks like fetching webpage content, `urlopen` is often sufficient and more straightforward. However, if you need to customize headers, use non-GET HTTP methods, or handle advanced scenarios, creating a `Request` object allows for fine-grained control over your HTTP requests.


### Crawling and Scraping: Unveiling the Web's Secrets

Crawling and scraping are two fundamental techniques in the world of web data acquisition. They form the backbone of many data-driven applications and are crucial skills for data analysts and web developers.

### Crawling: Navigating the Web

Crawling, often referred to as web crawling or web scraping, is the process of systematically navigating the World Wide Web to retrieve web pages. Think of it as a web robot or spider, tirelessly traversing the internet to discover and index web content. This technique is at the heart of search engines like Google and Bing.

### Why Do We Crawl?

Crawling serves several important purposes:

- **Indexing**: It allows search engines to index and catalog web pages, making them searchable by users.
  
- **Link Discovery**: Crawlers extract links from web pages, helping build a vast network of interconnected web resources. This link structure is crucial for understanding the web's architecture.
  
- **Data Retrieval**: Crawlers may scrape or extract data from web pages, but their primary goal is to discover and navigate to other web pages.

### Scraping: Harvesting Data

Scraping is the process of extracting specific data or information from a single web page. Unlike crawling, which focuses on navigating the web, scraping zooms in on a single webpage to harvest valuable data.

### Use Cases of Scraping

Scraping is used for a variety of purposes, such as:

- **Data Extraction**: It allows us to extract structured data like product prices, news headlines, or stock market information from websites.

- **Content Monitoring**: Scraping can be employed to track changes in content on specific web pages, such as monitoring price changes on e-commerce sites or tracking news updates.

- **Competitor Analysis**: Businesses often use scraping to gather data on competitors, such as pricing strategies or product listings.

- **Research and Analysis**: Data analysts and researchers use scraping to collect data for studies, reports, and data-driven insights.

### Crawling and Scraping Synergy

In practice, crawling and scraping often work together. Crawlers traverse the web to find new pages, and once they reach a page of interest, scraping techniques are applied to extract valuable data. This synergy is what powers search engines, news aggregators, and data-driven applications on the internet.

### Conclusion

Understanding the concepts of crawling and scraping is essential for anyone looking to work with web data. Whether you want to build a search engine, gather market research, or simply automate data collection, these techniques are your gateway to unlocking the wealth of information available on the web.

## Requests vs Urllib

### Url Lib

In [None]:
import urllib.request

# Define the URL to scrape
url = 'https://www.pyladies.com'

# Set up the request with a custom user-agent header
req = urllib.request.Request(url, headers={'User-Agent': 'Magic Browser'})

# Open the URL and retrieve the HTML content
con = urllib.request.urlopen(req)
html = con.read().decode()

# Check if 'Python' is in the HTML content
print('Python' in html)


### Requests

In [None]:
# the main library you will need for webscraping is called Beautiful Soup
from bs4 import BeautifulSoup
# the second package we will need we already know it
import requests


url = "https://en.wikipedia.org/wiki/Marie_Curie"

response = requests.get(url)
response

![HTTPStatus](https://www.whatismyip.com/static/51e6afd43d8a39f7a6e03805c1328e11/https-codes.webp)

In [None]:
## ANALYZE THE RESPONSE METHODS
#response.
response.content

## This is not very easy to analyze...

### Beautiful Soup

In [None]:
# turning the response into a beautiful soup object
soup = BeautifulSoup(response.content)
# prettify the soup to then copy it to a text editor and study its structure
print(soup.prettify())

## BREAK: Html

### 1. "Making Your Own API": Web Scraping

### Understanding Web Scraping
Web scraping becomes essential when data is available on the web but isn't accessible through an API, or the existing API lacks certain functionalities or has restrictive terms of service. In such scenarios, **Web Scraping** is the technique that enables automated extraction of this data, replicating the access a human would have visually.

### Why Web Scraping?
- **Data Accessibility**: Sometimes, the only way to access certain data is directly from the web pages where it is displayed.
- **Flexibility**: Web scraping allows you to tailor data extraction to specific needs, bypassing limitations of existing APIs.

### Preparing for Web Scraping: Understanding Web Page Structure
Before delving into scraping, it's crucial to have a basic understanding of web page structure and how data is stored and presented. This session covers:

#### Basic HTML and CSS Static Pages
- **HTML (HyperText Markup Language)**: The standard markup language used to create web pages. Understanding HTML is key to identifying the data you want to scrape.
- **CSS (Cascading Style Sheets)**: Used for describing the presentation of a document written in HTML. Knowing CSS helps in pinpointing specific elements on a page.

#### Dynamic HTML
- **Basic JavaScript Example Using JQuery**: Websites often use JavaScript to load data dynamically. Understanding how this works is crucial for scraping data from such dynamic pages.

### Understanding the Foundation of Web Pages

The most fundamental web pages are constructed using HTML and CSS. These technologies serve two primary purposes: **HTML (Hypertext Markup Language)** structures and stores the content, making it the primary target for web scraping, while **CSS (Cascading Style Sheets)** formats and styles the content, highlighting visual elements like fonts, colors, borders, and layout.

#### HTML: The Structure of the Web
HTML is a markup language typically rendered by web browsers. It uses 'tags' to define elements on a web page. A typical tag format includes a tag name, attributes (if any), and the content between opening and closing tags.

#### Key Components of an HTML File

- **DOCTYPE Declaration**: 
  - Begins with `<!DOCTYPE html>`, indicating the use of HTML5.
  - Earlier HTML versions had different DOCTYPEs.

- **HTML Tag**: 
  - The `html` tag (and its closing `/html` tag) encloses the entire web page content.

- **Head and Body**: 
  - The `head` section often includes the `title` tag, defining the webpage's name, links to CSS stylesheets, and JavaScript files for dynamic behavior.
  - The `body` contains the visible webpage content.

- **Common HTML Elements**:
  - **Headings and Paragraphs**: Use `h#` (where # is a number) for headings and `p` for paragraphs.
  - **Hyperlinks**: Defined with the `href` attribute in `a` (anchor) tags.
  - **Images**: Embedded using `img` tags with the `src` attribute. Note: `img` is self-closing.

#### Exercise: Build a Basic HTML Web Page

Let's put your HTML knowledge into practice:

- Create a file named 'example.html' in your favorite text editor.
- Build a basic HTML web page containing elements like `title`, `h1`, `p`, `img`, and `a` tags. Remember that nearly all tags need to be closed with a `/tag`.

This exercise aims to familiarize you with the basic structure of HTML and how various elements come together to form a web page.

In [None]:
%%html

<!-- Start of the HTML head section -->
<head>
    <!-- Title of the webpage -->
    <title>
        Basic knowledge for web scraping.
    </title>	
</head>
<!-- Start of the HTML body section -->
<body>
    <!-- Header 1 indicating the subject of the content -->
    <h1>About HTML
    </h1>

    <!-- Image of a rubber ducky; this one is not clickable -->
    <p>

    </p>
</body>

In [None]:
%%html

<!-- Start of the HTML head section -->
<head>
    <!-- Title of the webpage -->
    <title>
        Basic knowledge for web scraping.
    </title>	
</head>
<!-- Start of the HTML body section -->
<body>
    <!-- Header 1 indicating the subject of the content -->
    <h1>About HTML
    </h1>
    <!-- Paragraph explaining what HTML is and providing a link for further information -->
    <p>Html (Hypertext markdown language) is the basic language to provide contents in the web. It is a tagged language. You can check more about it in <a href="http://www.w3.org/community/webed/wiki/HTML">World Wide Web Consortium.</a></p>
    
    <!-- Paragraph indicating that one of the following images is clickable -->
    <p> One of the following rubberduckies is clickable
    </p>
    <!-- Image of a rubber ducky; this one is not clickable -->
    <p>
        <img src = "files/rubberduck.jpg"/>
    
        <!-- Clickable image (hyperlinked) of a rubber ducky -->
        <a href="http://www.pinterest.com/misscannabliss/rubber-duck-mania/"><img src = "files/rubberduck.jpg"/></a>
    </p>
</body>

## Back to Requests and Beautiful Soup

### Titles, Paragraphs and Tables

In [None]:
# now that we have the html code inside a soup object -> we can explore it's attributes
# I can call the title tag of the webpage -> this brings the tag and the content
soup.title

In [None]:
# imagine you only wanted the content
soup.title.string

In [None]:
# imagine I want paragraphs (p tag)
soup.p
# this is no good, clearly there are many p tags which we want

paragraphs = soup.find_all('p')
paragraphs

for element in paragraphs:
  print(element.text)

In [None]:
# you can search both by the tag but also by other attributes, such as the class name
tables = soup.find_all('table', attrs= {'class' : 'infobox biography vcard'})

#this is very helpful to identify boxes that use the same css styling, for which an attrivute is already defined

# finds all the text elements inside the table
table = tables[0]
table

In [None]:

# inside the first level of my table, there are still many many tags
# you can find more tags within your table

# the table itself has many tags inside -> it is a soup object itself
for line in table.find_all('li'):
  print(line.text)


In [None]:
# do it yourself:
# find all the bio fields category names for Mdme Curie



### Web Scraping Exercise: Extracting News Headlines from BBC Technology

#### Objective
Write a Python script to scrape headlines from BBC's Technology news section and categorize them based on keywords.

#### Task Details

1. **Website to Scrape**:
   - Target the BBC's 'Technology' section: [BBC Technology News](https://www.bbc.co.uk/news/technology).

2. **Scraping Requirement**:
   - Scrape the main headlines from the page, typically found in `h3` tags or a specific class.

3. **Categorization**:
   - Categorize the headlines based on predefined keywords like 'Apple', 'Microsoft', 'Google', etc.
   - Count the number of headlines that fall into each category.

4. **Output**:
   - Print each headline along with its respective category.
   - Summarize with the count of headlines in each category.

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC technology news section
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Define categories and associated keywords
categories = {
    'Apple': ['Apple', 'iPhone', 'iPad'],
    'Microsoft': ['Microsoft', 'Windows', 'Bill Gates'],
    'Google': ['Google', 'Android', 'Alphabet']
    # Add more categories as needed
}

# Function to determine the category of a headline
def categorize_headline(headline):
    # Logic to determine the category based on keywords
    # Return the category name if a keyword is found, else return 'Other'
    pass

# Scrape and process the headlines
# Look for 'h3' tags or other relevant tags
# Use the categorize_headline function to categorize each headline
# Print each headline and its category

# Print the count of headlines in each category

## Future work: Alternatices like Selenium and XPath


### 1.3 Selecting Elements with XPath

XPath, or XML Path Language, is a versatile and robust tool for navigating and selecting elements within HTML documents. While Beautiful Soup and requests are commonly used libraries for web scraping, XPath offers a unique and powerful approach to extracting data from web pages.

### What is XPath?

XPath was originally designed for navigating XML documents, but it is equally applicable to HTML, which shares a structural similarity with XML. XPath allows you to specify the precise location of elements or data within an HTML document using a concise and expressive syntax.

### Key Differentiators:

Here are some key differentiators that set XPath apart from other web scraping approaches:

1. **Granular Selection**: XPath provides granular control over element selection. Unlike Beautiful Soup, which often requires multiple iterations and filtering, XPath allows you to pinpoint elements directly based on their attributes, tags, or positions within the document.

2. **Hierarchical Navigation**: XPath excels at navigating the hierarchical structure of HTML documents. It enables you to traverse the document tree, moving up, down, or across branches with ease.

3. **Precise Queries**: With XPath, you can create precise queries to extract specific data. For example, you can target elements with specific attributes, such as selecting all `<a>` elements with a particular class or locating elements within specific parent elements.

4. **Text Extraction**: XPath's `text()` function simplifies the extraction of text content from elements. This is particularly useful for scraping text data, such as headlines, paragraphs, or product descriptions.

### How to Use XPath:

To utilize XPath for web scraping, you typically follow these steps:

1. **Send an HTTP Request**: Use a library like requests to send an HTTP GET request to the webpage you want to scrape. This retrieves the HTML content of the page.

2. **Parse the HTML**: Once you have the HTML content, parse it using a library like lxml or lxml.html. This step constructs a structured representation of the webpage that you can navigate with XPath.

3. **Construct XPath Expressions**: Formulate XPath expressions that target the specific elements or data you wish to extract. XPath expressions can vary in complexity, allowing you to adapt to different webpage structures.

4. **Apply XPath Expressions**: Apply your XPath expressions to the parsed HTML document to select the desired elements or data. This process effectively filters the HTML content to capture only what you need.

5. **Retrieve and Process Data**: Retrieve the selected elements or data using the XPath queries and process them as needed for your scraping task.

In summary, XPath is a powerful tool for web scraping that offers precise and efficient element selection within HTML documents. While libraries like Beautiful Soup and requests are valuable, XPath provides an additional layer of control and flexibility, making it a valuable choice for advanced scraping projects.


### Understanding XPath Syntax

- **Absolute Path (`/`)**: 
  - Using a single slash indicates an absolute path from the root element.
  - Example: `xpath('/html/body/p')` selects all paragraph (`<p>`) elements directly under the `<body>` within the `<html>` root element.

- **Relative Path (`//`)**:
  - Double slashes indicate a relative path, meaning the selection can start anywhere in the document hierarchy.
  - Example: `xpath('//a/div')` finds all `<div>` elements that are descendants of `<a>` tags, regardless of their specific location in the document.

- **Wildcards (`*`)**:
  - The asterisk acts as a wildcard, representing any element.
  - Example: `xpath('//a/div/*')` selects all elements that are children of `<div>` tags under `<a>` tags, anywhere in the document.
  - Another example: `xpath('/*/*/div')` finds `<div>` elements that are at the second level of the hierarchy from the root.

- **Selecting Specific Elements (Using Brackets)**:
  - If a selection returns multiple elements, you can specify which one to select using brackets.
  - Example: `xpath('//a/div[1]')` selects the first `<div>` in the set; `xpath('//a/div[last()]')` selects the last `<div>`.

### Working with Attributes

- **Selecting Attributes (`@`)**:
  - The `@` symbol is used to work with element attributes.
  - Example: `xpath('//@name')` selects all attributes named 'name' in the document.
  - To select `<div>` elements with a 'name' attribute: `xpath('//div[@name]')`.
  - To select `<div>` elements without any attributes: `xpath('//div[not(@*)]')`.
  - To find `<div>` elements with a specific 'name' attribute value: `xpath('//div[@name="chachiname"]')`.

### Utilizing Built-in Functions

- XPath comes with several built-in functions to aid in element selection.
  - `contains()`: Selects elements containing a specific substring. Example: `xpath('//*[contains(name(),'iv')]')`.
  - `count()`: Used for conditional selection based on child count. Example: `xpath('//*[count(div)=2]')`.

### Combining Paths and Selecting Relatives

- **Combining Paths (`|`)**:
  - Use the pipe symbol to combine paths, functioning like an OR operator.
  - Example: `xpath('/div/p|/div/a')` selects elements matching either `div/p` or `div/a`.

- **Selecting Relatives**:
  - You can refer to various relational aspects like parent, ancestors, children, or descendants.
  - Example: `xpath('//div/div/parent::*')` selects the parent elements of `div/div` paths.

Understanding XPath is essential for effective web scraping, as it allows precise targeting and extraction of data based on the structure of a webpage.

## Selenium

### 2.0. Starting with Selenium 

Selenium is a powerful tool primarily used for automating web browsers. It's widely utilized in areas such as web scraping, automated testing, and automating web-based administration tasks.

### Introduction to Selenium Without Geckodriver

Traditionally, Selenium works in conjunction with a driver specific to each browser, like geckodriver for Firefox or chromedriver for Chrome. However, recent developments have enabled certain browsers to be controlled directly by Selenium without the need for an additional driver:

- **Chrome**: Recent versions of Google Chrome can be controlled by Selenium directly through the Chrome DevTools Protocol. This simplifies the setup process as you don't need to download and set up chromedriver separately.

- **Microsoft Edge**: Similar to Chrome, the Edge browser (Chromium version) can also be automated directly using Selenium with its built-in driver capabilities. 

This approach of using Selenium without an additional driver streamlines browser automation tasks, making it more accessible and easier to configure, especially for beginners and those looking to quickly set up automated browser interactions.

### 2.1 Basic Concepts of Selenium WebDriver

### Understanding WebDriver

WebDriver is a key component of the Selenium suite. It acts as an interface to interact with the web browser, allowing you to control it programmatically. WebDriver can perform operations like opening web pages, clicking buttons, entering text in forms, and extracting data from web pages.

#### Key Functions of WebDriver
- **Opening a Web Page**: WebDriver can navigate to a specific URL.
- **Locating Elements**: It can find elements on a web page based on their attributes (like ID, name, XPath).
- **Interacting with Elements**: WebDriver can simulate actions like clicking buttons, typing text, and submitting forms.

### Interacting with Web Elements

You can locate and interact with elements on a web page using various methods provided by WebDriver. The choice of method depends on the attributes of the HTML elements you're targeting.

- **find_element_by_id**: Locates an element by its unique ID.
- **find_element_by_name**: Finds an element by its name attribute.
- **find_element_by_xpath**: Uses XPath queries to locate elements, providing a powerful way to navigate the DOM.

```python
### Selenium WebDriver Python Examples
#### Example 1: Opening a Web Page

#This example demonstrates how to open a web page using Selenium WebDriver.

from selenium import webdriver

# Initialize the Chrome WebDriver
driver = webdriver.Chrome()

# Open a web page
driver.get("https://www.python.org")

# Close the browser
driver.quit()

## APIS (Aplication Programing Interface)

### Definitions

#### URLs

**U**niform **R**esource **L**ocator

Contains the information about a resource we (the CLIENT) are requesting from a SERVER

http://www.google.com/search?q=puppies

http://127.0.0.1:306/invocations

- Protocol: http
- Top Level Domain: com
- Domain: google
- Subdomain: www
- IP: 127.0.0.1
- Port: 306
- Route/Folder/Path: search/invocations
- Query Parameters: q=puppies

#### HTTP
**H**yper **T**ext **T**transfer **P**rotocol (**S**ecure)   

HTTP(S) is a protocol that provides a structure for request between a client and a server.
For example, the web browser of a user (the client) uses HTTP to request information from a server that hoist a website

#### Response
The response is usually dependent on the functionality you are looking for:
 * a JSON  
 * an image
 * a video
 * an HTML
 * ...

#### Request
**Requests** are the questions we (clients) ask of a server to get some information (the **response**).        
Types of request (verbs):
 * GET: read info from a resource and do not change it in any way. This is the standard request that in most sites gets the HTML+CSS of the page as a response.
 * POST: send data that creates/updates a resource, or triggers some process.
 * PUT
 * DELETE
 * PATCH
 * ...


In [None]:
TFL = requests.get('https://api.tfl.gov.uk/AirQuality')

TFL.headers
TFL.content


In [None]:
## Content is typically in a JSON format - What does a json look like?
TFL.json()

![json](https://www.convertsimple.com/wp-content/uploads/2022/05/json-example.png)

### Onboarding API data into Pandas


In [None]:
import pandas as pd

weather_data = pd.DataFrame.from_dict(TFL.json())
weather_data.head()

In [None]:
#Not ideal, part of the request is still in json. This is a nested json...
weather_data['currentForecast'][1]

# There is a function in pandas to un-nest jsons, but it makes some assumptions and sometimes we have to unpack hierarchical structures ourselves
# beware this usually involves a lot of for loops and apply functions
pd.json_normalize(weather_data['currentForecast'])

### Parameters

In [None]:
r = requests.get('https://v2.jokeapi.dev/joke/programming')
r.json()


# Sometimes we want to pass parameters to the endpoint, just like we pass arguments to functions in python  
# We pass parameters via the url as `?param1=value1&param2=value2...` at the end of the url
r = requests.get('https://v2.jokeapi.dev/joke/programming?contains=python&amount=3')
r.json()

params_dict = {"contains":"python","amount":"3"}
r = requests.get('https://v2.jokeapi.dev/joke/programming',params=params_dict)
r.json()

### Business Challange API

#### Example: Fetching Weather Data Using OpenWeatherMap API in Python

This example demonstrates how to use the OpenWeatherMap API to fetch current weather data for a specific city using Python.

#### Prerequisites
- An API key from OpenWeatherMap.
- Python's `requests` library installed. (Install via `pip install requests` if needed.)

#### Steps to Follow
1. **Sign Up for OpenWeatherMap API**:
   - Register for an account at [OpenWeatherMap](https://openweathermap.org/api).
   - Obtain your free API key (note that there might be an activation delay).

2. **Python Script for Weather Data Retrieval**:
   - The script uses the `requests` library to make an API call.
   - Replace `'YOUR_API_KEY'` with your actual OpenWeatherMap API key.
   - Replace `'CITY_NAME'` with your desired city name.

### Business Challenge II - Optional

#### Challenge: Analyzing Instagram Hashtag Trends with Instaloader

## Objective
Leverage `Instaloader`, a Python library, to download posts associated with a specific hashtag on Instagram. Analyze the collected data to identify trends, popular content, and user engagement.

## Steps

### 1. Install Instaloader
- Ensure Python is installed on your system.
- Install `Instaloader` using pip: `pip install instaloader`


### 2. Data Collection
- Choose a hashtag relevant to a topic of interest (e.g., #nature, #travel, #food).
- Use `Instaloader` to download posts tagged with the chosen hashtag. Consider limitations like the number of posts to avoid overwhelming the API.

```python
import instaloader

L = instaloader.Instaloader()
posts = instaloader.Hashtag.from_name(L.context, 'YOUR_HASHTAG').get_posts()

for post in posts:
    # Add code to process and store post details
```
### 3. Data Analysis
Analyze the downloaded data for:
- Popular trends in the hashtag.
- Common themes or subjects in images or captions.
- Levels of user engagement (likes, comments).

### 4. Reporting
- Compile your findings into a report.
- Include visual representations (graphs, word clouds) to illustrate key trends.

### Important Notes
- Respect Instagram's terms of service and ethical guidelines in data scraping.
- Be mindful of privacy and consent, especially with user-generated content.
- The scope of data collection should be limited for educational purposes.

### Expected Outcome
This challenge aims to provide practical experience with Instaloader, develop data analysis skills, and offer insights into social media trends and user behavior.

## SOLUTIONS

In [None]:
url = "https://en.wikipedia.org/wiki/Marie_Curie"

response = requests.get(url)
soup = BeautifulSoup(response.content)
tables = soup.find_all('table', attrs= {'class' : 'infobox biography vcard'})
for line in tables[0].find_all('th'):
  print(line.text)

for line in tables[0].find_all('td'):
  print(line.text)

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the BBC technology news section
url = "https://www.bbc.co.uk/news/technology"

# Send a GET request and parse the HTML content
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Define categories and associated keywords
categories = {
    'Apple': ['Apple', 'iPhone', 'iPad'],
    'Microsoft': ['Microsoft', 'Windows', 'Bill Gates'],
    'Google': ['Google', 'Android', 'Alphabet']
    # Add more categories as needed
}

# Function to determine the category of a headline
def categorize_headline(headline):
    for category, keywords in categories.items():
        for keyword in keywords:
            if keyword in headline:
                return category
    return 'Other'

# Scrape and process the headlines
# Look for 'h3' tags or other relevant tags
headlines = soup.find_all('h3')
category_counts = {category: 0 for category in categories.keys()}
category_counts['Other'] = 0

for h in headlines:
    headline_text = h.get_text().strip()
    category = categorize_headline(headline_text)
    category_counts[category] += 1
    print(f"Headline: {headline_text}\nCategory: {category}\n")

# Print the count of headlines in each category
print("Headline Counts by Category:")
for category, count in category_counts.items():
    print(f"{category}: {count}")

In [None]:
import requests

def get_weather(api_key, city):
    base_url = "http://api.openweathermap.org/data/2.5/weather?"
    city_name = city
    complete_url = f"{base_url}appid={api_key}&q={city_name}"
    response = requests.get(complete_url)
    return response.json()

# Replace 'YOUR_API_KEY' with your actual API key and 'CITY_NAME' with your city
api_key = 'YOUR_API_KEY'
city_name = 'CITY_NAME'
weather_data = get_weather(api_key, city_name)

print(f"Weather in {city_name}:")
print(weather_data)