# Web Scraping Using Python

## Overview
Data scraping refers to the process of extracting data from websites or other sources using automated tools or scripts. The scraped data can then be analyzed and used for various purposes, including data wrangling. Data scraping can play a critical role in the data wrangling process by providing a means to collect large amounts of data quickly and efficiently. 

One example of when data scraping is useful is in sentiment analysis, where web scraping tools can be used to collect large amounts of social media data to analyze public opinion and sentiment about a particular topic, brand, or product. This information can help companies understand their customers' preferences and attitudes, identify areas for improvement, and make data-driven decisions. In this case, data scraping can save significant time and effort compared to manual data collection, while providing more comprehensive and accurate data for analysis.

In this module, we will cover the following topics:

I. Rest API's: A web-based application programming interface to exchange data between client applications and web servers.
II. Web scraping: A process of automatically extracting data from websites using software tools, usually in an automated manner.

## Learning Objectives
In this module, the learners will:

* Extract data from websites using scraping tools such as BeautifulSoup
* Parse raw data extracted from online sources and extract useful information
* Interact with REST APIs using a Python script
Let's get started!



# Rest API's

## What are REST APIs?
REST APIs (Representational State Transfer Application Programming Interfaces) are a type of web-based application programming interface that uses HTTP protocols to exchange data between client applications and web servers.

## Why are they important?
They are used to provide a standardized way for developers to access and interact with web-based resources and services. For example, let's consider a weather forecasting application that provides real-time weather data for a particular location. The application can use a REST API to access the data from a web server that provides weather data for various locations. The REST API would allow the application to send requests to the server and receive responses in a standardized data format such as JSON or XML. This way, developers can build applications that consume and utilize data from remote sources without having to worry about the underlying protocols and communication mechanisms.

## Making HTTP requests
The requests library in Python is a popular library for making HTTP requests. To make a GET request using the requests library, you can use the 'requests.get()' function. The function takes the URL of the resource as its first argument and any query parameters as keyword arguments. Using a GET request we can call a REST API endpoint and fetch the response.

In [3]:
import requests
endpoint = 'https://httpbin.org/get'
response = requests.get(endpoint)
print("Response status: ", response)

Response status:  <Response [200]>


We can see that the response code is 200, which means our GET request was successful. There are other types of HTTP requests that we can send, and each of them has its use case. 

For example, we can send a PUT request when we want to add a file or value to the server. Note that you might receive a 500 response code, which means the server is down since this is a free API and does not guarantee availability, unlike paid options

## Parsing HTTP response
Once we make an HTTP request, we receive an HTTP response object that contains a lot of information on how our request was handled. For a GET request, the most relevant part of the response object is usually the JSON or XML object that was sent since we are looking to fetch some information from the server. We can extract the response payload in Python by calling the 'json()' function on the response object. Here is an example of how we can access the JSON response:

In [4]:
import requests

#Send the HTTP GET request
endpoint = 'https://httpbin.org/get'
response = requests.get(endpoint)

#Print out response JSON file
print("Response payload: ", response.json())

Response payload:  {'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.2', 'X-Amzn-Trace-Id': 'Root=1-66edb185-30d370260a32e5286d80c206'}, 'origin': '8.37.96.42', 'url': 'https://httpbin.org/get'}


As you can see, there is a lot of information contained in the response payload. Most of these fields are not meant for users to see and are meant for your browser to figure out if the response is correct. You can parse this response object as a regular Python dictionary and save it for further use by using the 'json.loads()' method.

## Authenticating HTTP requests
Authenticating HTTP requests is the process of verifying the identity of a client or user before allowing access to protected resources or data. This is an important security measure in web scraping applications where sensitive information may be involved.

One common way to authenticate HTTP requests is by using an API key. An API key is a unique identifier that is assigned to a client or user, and it is used to authenticate API requests. In web scraping, an API key is typically generated by the website or service that is being scraped, and it must be included in every HTTP request made by the scraper.

Here is an example code snippet that demonstrates how to authenticate an HTTP request using an API key in Python:

In [5]:
# import requests

# api_key = 'YOUR_API_KEY'
# url = 'https://api.example.com/data'
# headers = {'Authorization': 'Bearer ' + api_key}

# response = requests.get(url, headers=headers)
# print(response.json())

# NOTE

# The code provided above is meant to serve as an example of how 
# to authenticate HTTP requests using an API key in Python. 
# To successfully run this code, you will need to obtain an API key 
# from the website or service that you are attempting to scrape. 
# To use this code, you will need to replace the 'YOUR_API_KEY' 
# placeholder in the 'api_key' variable with your actual API key. 
# Additionally, you will need to replace the 'https://api.example.com/data' 
# URL in the 'url' variable with the relevant URL that 
# you are attempting to access. 
# The 'headers' variable should be left unchanged, 
# containing the necessary Authorization header with the 
# Bearer token prefix and the API key value.

In this example, we first define the 'api_key' variable to store the API key that we obtained from the website. Then we define the 'url' variable to store the URL of the API endpoint that we want to access. We also define a 'headers' variable that contains the Authorization header with the Bearer token prefix and the API key value. 

Next, we use the requests library to send a GET request to the API endpoint, passing in the 'url' and 'headers' parameters. Finally, we print the response in JSON format using the 'json()' method.

By including the API key in the Authorization header, we are authenticating our HTTP request and indicating to the server that we have permission to access the protected data or resources. This helps to prevent unauthorized access and ensure the security of the web scraping application


## Rate limiting requests
When working with APIs, it's important to respect any rate limits or throttling restrictions imposed by the API provider to avoid overloading their servers. A rate limit defines how many requests you are allowed to send in a certain frame of time, usually defined per second or per minute. By respecting the rate limit, we can avoid overwhelming the API provider's servers and ensure reliable and consistent access to their data. Here's an example of how to make a rate-limited request to an API endpoint using Python and the time module:

In [6]:
import requests
import time

# Define the API endpoint URL
url = 'https://httpbin.org/get'

 # Set the number of requests per second allowed by the API
requests_per_second = 5
request_counter = 0
max_number_of_requests = 3

 # Make a rate-limited request to the API endpoint
def make_request():
    response = requests.get(url)
    if response.status_code == 200:
        data = response.json()
        print(data)
    else:
        print('Error:', response.status_code)

 # Loop to make requests at the specified rate
while request_counter < max_number_of_requests:
    start_time = time.time()
    make_request()
    request_counter = request_counter+1
    elapsed_time = time.time() - start_time
    
  # Calculate the time to wait before making the next request
    time_to_wait = 1 / requests_per_second - elapsed_time
    if time_to_wait > 0:
        time.sleep(time_to_wait)

{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.2', 'X-Amzn-Trace-Id': 'Root=1-66edb185-63cd487765098bea3a57d1e6'}, 'origin': '8.37.96.42', 'url': 'https://httpbin.org/get'}
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.2', 'X-Amzn-Trace-Id': 'Root=1-66edb186-20878e9960ce95a10eef0397'}, 'origin': '8.37.96.42', 'url': 'https://httpbin.org/get'}
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.28.2', 'X-Amzn-Trace-Id': 'Root=1-66edb189-3746b7943c95e34b16df2713'}, 'origin': '8.37.96.42', 'url': 'https://httpbin.org/get'}


In this example, we define the API endpoint URL and set the number of requests per second allowed by the API. We then define a function 'make_request()' that sends a GET request to the API endpoint and processes the response. We use a while loop to make requests at the specified rate, measuring the elapsed time for each request and calculating the time to wait before making the next request. We use the time module to sleep for the appropriate amount of time before making the next request, to respect the rate limit set by the API. 

It is important to respect rate limits to avoid being blocked or blacklisted from the endpoint you are accessing.



# Web Scraping

## What is web scraping?
Web scraping is the process of automatically extracting data from websites using software tools, usually in an automated manner. Web scraping is useful when there is a need to gather large amounts of data from different sources, such as news articles, product information, or user reviews.

## Why is it important?
Web scraping is used in research, journalism, and academia to gather data for analysis and reporting. For example, researchers may use web scraping to gather data on social media platforms to analyze trends and behaviors, or journalists may use web scraping to gather data on public records and government documents to uncover stories and insights.

## Fetching and parsing HTML pages
The first step to web scraping is to fetch the HTML page that we would like to scrape. We need to fetch the page so that we can inspect it for patterns, this will be crucial to extracting the data that we need from the web page. You can also go to the website in your browser and open developer mode, which is mapped to the shortcut Ctrl+U in most browsers, and inspect the HTML page from there as well. For this exercise, we'll be scraping Wikipedia. Here is how you can fetch an HTML page in Python:

In [7]:
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup 

try:

  # Opens the connection to the Wikipedia homepage and downloads the html page
    url = "https://www.wikipedia.org/"
    uClient = uReq(Request(url))
    
 # The html page is parsed and stored in a BeautifulSoup data structure
    web_page = soup(uClient.read(), "html.parser")
    uClient.close()
    print(web_page)
except:
    print(f"The page {url} returned an error, the site might not be available")

<!DOCTYPE html>

<html class="no-js" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia</title>
<meta content="Wikipedia is a free online encyclopedia, created and edited by volunteers around the world and hosted by the Wikimedia Foundation." name="description"/>
<script>
document.documentElement.className = document.documentElement.className.replace( /(^|\s)no-js(\s|$)/, "$1js-enabled$2" );
</script>
<meta content="initial-scale=1,user-scalable=yes" name="viewport"/>
<link href="/static/apple-touch/wikipedia.png" rel="apple-touch-icon"/>
<link href="/static/favicon/wikipedia.ico" rel="shortcut icon"/>
<link href="//creativecommons.org/licenses/by-sa/4.0/" rel="license"/>
<style>
.sprite{background-image:linear-gradient(transparent,transparent),url(portal/wikipedia.org/assets/img/sprite-de847d1a.svg);background-repeat:no-repeat;display:inline-block;vertical-align:middle}.svg-Commons-logo_sister{background-position:0 0;width:47px;height:47px}.svg-MediaWiki-logo_sister{background

This code is an example of how to use Python to download and parse HTML pages from a website using the urllib and BeautifulSoup libraries. If the connection is successful, the code uses the soup function to parse the HTML page returned from the 'uReq()' function and stores it in a BeautifulSoup data structure. Finally, the 'uClient' connection is closed and the parsed HTML page is printed. If you see an error, it is possible that Wikipedia is down at the moment so wait a while and try again later.

## Searching for HTML tags
Once we have fetched our HTML page, the next step is to search for the data that we are interested in scraping. This process involves manually going through the HTML page and searching for patterns and HTML tags that we can leverage for extracting the data programmatically. In the Wikipedia example, if we wanted to extract the content of a certain page we can look at the structure of Wikipedia and discover that all the text is encased in 'p' tags, we can use this information and fetch all 'p' tags in the page. Here is a code sample for this scenario:

In [8]:
from urllib.request import urlopen as uReq, Request
from bs4 import BeautifulSoup as soup 

search_term = "Microsoft_Azure"
url = "https://en.wikipedia.org/wiki/search_term"

try:

# Opens the connection to the Wikipedia page and downloads the html page
    uClient = uReq(Request(url))

 # The html page is parsed and stored in a BeautifulSoup data structure
    web_page = soup(uClient.read(), "html.parser")
    uClient.close()
except:
    print(f"The page {url} returned an error, the site might not be available")

 # The content is returned as a list when using find_all(), 
 # we can index the list to get specific parts of the list
content = web_page.find_all("p")
print(content[3])

<p>Most commercial web search engines do not disclose their search logs, so information about what users are searching for on the Web is difficult to come by.<sup class="reference" id="cite_ref-5"><a href="#cite_note-5"><span class="cite-bracket">[</span>5<span class="cite-bracket">]</span></a></sup> Nevertheless, research studies started to appear in 1998.<sup class="reference" id="cite_ref-6"><a href="#cite_note-6"><span class="cite-bracket">[</span>6<span class="cite-bracket">]</span></a></sup><sup class="reference" id="cite_ref-7"><a href="#cite_note-7"><span class="cite-bracket">[</span>7<span class="cite-bracket">]</span></a></sup> A 2001 study,<sup class="reference" id="cite_ref-8"><a href="#cite_note-8"><span class="cite-bracket">[</span>8<span class="cite-bracket">]</span></a></sup> which analyzed the queries from the <a href="/wiki/Excite_(web_portal)" title="Excite (web portal)">Excite</a> search engine, showed some interesting characteristics of web searches:
</p>


First, a URL is constructed for the Wikipedia page corresponding to the search term using an f-string, and then the code attempts to open a connection to the page and download its HTML content using the 'urlopen()' function from the urllib.request module. Next, the HTML content is parsed and stored in a BeautifulSoup data structure using the soup function from the bs4 module. The parsed content is stored in the 'web_page' variable. After successfully parsing the HTML content, the code extracts the text content from the 'web_page' variable by calling the 'find_all()' method to find all the HTML elements with the 'p' tag, which typically contains paragraphs of text. The resulting content is stored in a list, which we can then go through and extract the information that we need.

