$\Large \text{Web Scraping for Data Science - Ralph Tambala}$

## What is Web Scraping?

Web scraping is a process of automating the extraction of data in an efficient and fast way from the web. With the help of web scraping, you can extract data from any website, no matter how large is the data, on your computer.

On the other hand, APIs give you direct access to the data you want.

## Why Web Scraping?

**Cost-effective**

Web scraping services provide an essential service at a competitive cost. The data will have to be collected back from websites and analyzed so that the internet functions regularly.

**Data accuracy**

Simple errors in data extraction can lead to major issues. Hence it is needed to ensure that the data is correct. Data scraping is not only a fast process, but its accurate too.

**Easy to implement**

Once a website scraping service starts collecting data, you can rest assured that you are getting data from not just a single page but from the whole domain. With a one time investment, it can have a high volume of data.

## Challenges of Web Scraping

**Data analysis of data retrieved**

Data need to be treated first, before it can be analysed. This often becomes a time-consuming work.

**Difficult to analyze**

For those who are not much into programming, web scrapers can be confusing.

**Speed and protection policies**

Most of the web scraping services are slower than API calls. Also many websites do not allow screen scraping. Also, if any code of the target website gets changed, web scrapers stops capture the data.

## HTML Basics

Hypertext Markup Language, a standardized system for tagging text files to achieve font, colour, graphic, and hyperlink effects on World Wide Web pages.

I have provided a sample HTML file for a quick summary of what HTML is.

## Steps Involved in Web Scraping

1. Install 3rd party libraries
2. Access the HTML content from webpage
3. Parse the HTML content
4. Prepare for your data science project
5. Data cleaning
5. Save data

### Step 1: Install the required libraries

We will install the following libraries:
- <code>requests</code>: allows us to send HTTP/1.1 requests using Python.
- <code>html5lib</code>: it is a pure-python library for parsing HTML. It is designed to conform to the WHATWG HTML specification, as is implemented by all major web browsers. *(The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, leading Web browser vendors, in 2004.)*
- <code>bs4</code>: bs4 is an acronym for Beautiful Soup. Beautiful Soup is a library for parsing HTML and XML documents (including having malformed markup, i.e. non-closed tags). It creates a parse tree for parsed pages that can be used to extract data from HTML.

These libraries can be installed using pip as shown below or one can manually and install them using links above

    pip install requests
    pip install html5lib # lxml, bleach, etc
    pip install bs4

### Step 2: Import all the required libraries

In [None]:
import requests
import html5lib
from bs4 import BeautifulSoup

### Step 3: Accessing the HTML content from webpage

1. Specify the URL of the webpage
2. Send a HTTP request to the specified URL and save the response from server
3. Check if response is OK - 200
4. If OK, then print raw content of the webpage

In [None]:
# here's the URL of interest
url = "https://en.wikipedia.org/wiki/Template:COVID-19_pandemic_data/Malawi_medical_cases_chart"
# we use get to send the requests and stores the result
response = requests.get(url)
# now we check if the webpage was returned successfully
print(response.status_code)

**Status codes**
- 200: The HTTP 200 *OK success* status response code indicates that the request has succeeded.
- 404: The HTTP 404, 404 not found, 404, 404 error, *page not found* or file not found error message is a hypertext transfer protocol (HTTP) standard response code, in computer network communications, to indicate that the browser was able to communicate with a given server, but the server could not find what was requested.
- 400: The HTTP 400 *Bad Request response* status code indicates that the server cannot or will not process the request due to something that is perceived to be a client error 

Now, let's view the content.

In [None]:
# let's print the raw HTML content
print(response.text)

### Step 3: Parse the HTML content

We create a Beautiful Soup object to represent the parsed document as a whole. For most purposes, you can treat it as a Tag object. This means it supports most of the methods described in Navigating the tree and Searching the HTML tree.

A BeautifulSoup object can be created by passing two arguments:
- <code>response.text</code>: It is the raw HTML content
- <code>html5lib</code>: It is a pure-python library for parsing HTML

<code>soup.prettify()</code> is used to get the visual representation of the parse tree.

In [None]:
soup = BeautifulSoup(response.text, 'html5lib')
print(soup.prettify())

### Step 4: Searching and navigating through the parse tree

To extract the data of interest to us we will need to navigate through the nested structure.

In [None]:
# use find() to capture first p with class course-listing-title
block = soup.find('tr', class_ = 'mw-collapsible')
block

In [None]:
# capture all p with class class course-listing-title
table = soup.find_all('tr', class_ = 'mw-collapsible')
table

In [None]:

for item in table:
    key = item.find('td', {'class':'bb-c', 'colspan':'2'}).text
    print(key)

In [None]:

for item in table:
    key = item.find('td', {'class':'bb-c', 'colspan':'2'}).text
    total_cases = item.find('span', {'class':'mcc-rw'}).text
    print('Date: {}\t\tTotal Cases: {}'.format(key, total_cases))

In [None]:
dates = []
cases = []
deaths = []
for item in table:
    date = item.find('td', {'class':'bb-c', 'colspan':'2'}).text

    cases_and_deaths = item.find_all('span', {'class':'mcc-rw'})
    if len(cases_and_deaths) == 2:
        total_cases = cases_and_deaths[0].text
        total_deaths = cases_and_deaths[1].text
    else:
        total_cases = cases_and_deaths[0].text
        total_deaths = 0
    
    dates.append(date)
    cases.append(total_cases)
    deaths.append(total_deaths)
    #print('Date: {}\t\tTotal Cases: {}\t\t Deaths: {}'.format(date, total_cases, total_deaths))

### Step 5: Data cleaning

In [None]:
cleaned_total_cases = []
for item in cases:
    cleaned_total_cases.append(int(str(item).replace(',','')))
    
cleaned_total_deaths = []
for item in deaths:
    cleaned_total_deaths.append(int(str(item).replace(',','')))

In [None]:
import pandas as pd 

In [None]:
df = pd.DataFrame({'Date':pd.Series(dates), 'Total Cases':pd.Series(cleaned_total_cases), 'Total Deaths':pd.Series(cleaned_total_deaths)})
df.head()

In [None]:
df.shape

In [None]:
df[df['Date'] == '⋮'].count()

In [None]:
df[df['Date'] == '⋮']

In [None]:
missing_dates = df.index[df['Date'] == '⋮'].to_list()

In [None]:
df.drop(index=missing_dates, inplace=True)

In [None]:
df.shape

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

In [None]:
df

We will do some simple visualizations just to appreciate what we have achieved so far!

In [None]:
import pandas as pd
import altair as alt

In [None]:
base = alt.Chart(df[:120]).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=420,
    height=200
)

In [None]:
red = alt.value('#f54242')
base.encode(y='Total Cases').properties(title='Total Confirmed') | base.encode(color=red, y='Total Deaths').properties(title='Total deaths') 

We will add some new columns that may be useful. In this case, new cases confirmed and new deaths.

In [None]:
df['New Cases'] = df['Total Cases'] - df['Total Cases'].shift(1).fillna(0).astype(int)

In [None]:
df['New Deaths'] = df['Total Deaths'] - df['Total Deaths'].shift(1).fillna(0).astype(int)

In [None]:
df.head(10)

In [None]:
df = df[['Date', 'New Cases', 'New Deaths', 'Total Cases', 'Total Deaths']]
df.head(3)

In [None]:
df['New Cases'] = pd.to_numeric(df['New Cases'])
df['New Deaths'] = pd.to_numeric(df['New Deaths'])

In [None]:
base = alt.Chart(df[:120]).mark_bar().encode(
    x='monthdate(Date):O',
).properties(
    width=420,
    height=200
)

base.encode(y='New Cases').properties(title='New confirmed') | base.encode(color=red, y='New Deaths').properties(title='New deaths')

### Step 6: Saving to file

Finally, we can save the dataframe into a new dataset.

In [None]:
df.to_csv('malawi_covid_data.csv')

La fin.