## Downloading the Web page for web scraping

In [1]:
import requests 
#The requests library will make a GET request to a web server, which will download the HTML contents of a given web page for us.

In [2]:
page = requests.get("https://www.worldometers.info/coronavirus")

In [3]:
page.status_code
#  A status_code of 200 means that the page downloaded successfully. A status code starting with a 2 generally indicates success

200

In [4]:
page.content

b'\n<!DOCTYPE html>\n<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->\n<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->\n<!--[if !IE]><!-->\n<html lang="en">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\n<meta name="viewport" content="width=device-width, initial-scale=1">\n<title>COVID Live - Coronavirus Statistics - Worldometer</title>\n<meta name="description" content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates">\n\n<link rel="shortcut icon" href="/favicon/favicon.ico" type="image/x-icon">\n<link rel="apple-touch-icon" sizes="57x57" href="/favicon/apple-icon-57x57.png">\n<link rel="apple-touch-icon" sizes="60x60" href="/favicon/appl

## HTML Parsing
Parsing simply means breaking up sentence structure into components under the direction of grammar. So, ‘HTML parsing’ means taking in HTML code and extracting relevant information from its various tags. A computer program that parses content is called a parser. We will be using ‘BeautifulSoup’ library.

Here, the ‘lxml’ parser was used since it works with broken html and widely used.

In [5]:
#importing the BeautifulSoup library
from bs4 import BeautifulSoup

#Initiating the BeautifulSoup Class
#Where soup is a variable containing the HTML of the webpage
soup = BeautifulSoup(page.content, 'lxml')

In [6]:
#Lets format it nicely, using the prettify method as contrasts as cell 3
print(soup.prettify())


<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]-->
<!--[if IE 9]> <html lang="en" class="ie9"> <![endif]-->
<!--[if !IE]><!-->
<html lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <title>
   COVID Live - Coronavirus Statistics - Worldometer
  </title>
  <meta content="Live statistics and coronavirus news tracking the number of confirmed cases, recovered patients, tests, and death toll due to the COVID-19 coronavirus from Wuhan, China. Coronavirus counter with new cases, deaths, and number of tests per 1 Million population. Historical data and info. Daily charts, graphs, news and updates" name="description"/>
  <link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/>
  <link href="/favicon/apple-icon-60x60.png" rel="app

## Extraction of Table
"table id="main_table_countries_today" class="table table-bordered table-hover main_table_countries dataTable no-footer" style="width: 100%; margin-top: 0px !important;"

Succeeding the inspection, the attributes — ‘id’ was identified and will be use to filter the HTML document to get the required table elements.

In [7]:
table = soup.find('table', attrs={'id' : "main_table_countries_today"})
table

<table class="table table-bordered table-hover main_table_countries" id="main_table_countries_today" style="width:100%;margin-top: 0px !important;display:none;">
<thead>
<tr>
<th width="1%">#</th>
<th width="100">Country,<br/>Other</th>
<th width="20">Total<br/>Cases</th>
<th width="30">New<br/>Cases</th>
<th width="30">Total<br/>Deaths</th>
<th width="30">New<br/>Deaths</th>
<th width="30">Total<br/>Recovered</th>
<th width="30">New<br/>Recovered</th>
<th width="30">Active<br/>Cases</th>
<th width="30">Serious,<br/>Critical</th>
<th width="30">Tot Cases/<br/>1M pop</th>
<th width="30">Deaths/<br/>1M pop</th>
<th width="30">Total<br/>Tests</th>
<th width="30">Tests/<br/>
<nobr>1M pop</nobr>
</th>
<th width="30">Population</th>
<th style="display:none" width="30">Continent</th>
<th width="30">1 Case<br/>every X ppl</th><th width="30">1 Death<br/>every X ppl</th><th width="30">1 Test<br/>every X ppl</th>
<th width="30">New Cases/1M pop</th>
<th width="30">New Deaths/1M pop</th>
<th width

## Getting text out of the extracted table
tag td ,tr and th represents table column, table rows and table headers respectively.

In [43]:
rows = table.find_all("tr", attrs={"style": ""})

bs4.element.ResultSet

In [117]:
data = []
for i,item in enumerate(rows):
    
    if i == 0:
        
        data.append(item.text.strip().split("\n")[:22])
        
    else:
        data.append(item.text.strip().split("\n")[:21])

In [118]:
data

[['#',
  'Country,Other',
  'TotalCases',
  'NewCases',
  'TotalDeaths',
  'NewDeaths',
  'TotalRecovered',
  'NewRecovered',
  'ActiveCases',
  'Serious,Critical',
  'Tot\xa0Cases/1M pop',
  'Deaths/1M pop',
  'TotalTests',
  'Tests/',
  '1M pop',
  '',
  'Population',
  'Continent',
  '1 Caseevery X ppl1 Deathevery X ppl1 Testevery X ppl',
  'New Cases/1M pop',
  'New Deaths/1M pop',
  'Active Cases/1M pop'],
 ['World',
  '298,409,911',
  '+259,043',
  '5,484,286',
  '+2,233',
  '256,929,159',
  '+133,960',
  '35,996,466',
  '91,893',
  '38,283',
  '703.6',
  '',
  '',
  '',
  'All'],
 ['1',
  'USA',
  '58,805,186',
  '',
  '853,612 ',
  '',
  '41,999,896',
  '',
  '15,951,678',
  '20,938',
  '176,098',
  '2,556',
  '826,586,018',
  '2,475,292',
  '333,934,783 ',
  'North America',
  '63910',
  '',
  '',
  '47,769'],
 ['2',
  'India',
  '35,109,286',
  '',
  '482,876 ',
  '',
  '34,342,255',
  '',
  '284,155',
  '8,944',
  '25,069',
  '345',
  '685,305,751',
  '489,331',
  '1,400,494

## Converting to a Dask dataframe
The next step is to convert the list into a Dask dataframe to enable data manipulation and cleaning. As pointed earlier on. since the data is increasing daily, it is advisable to use a Dask dataframe which handles big data more efficiently.

In [120]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import dask.dataframe as dd

dt = pd.DataFrame(data)
dt = pd.DataFrame(data[1:], columns=data[0][:20]) #Formatting the header
df = dd.from_pandas(dt,npartitions=1)
#pd.set_option("display.max_rows", None, "display.max_columns", None)


In [121]:
df.head(10)

Unnamed: 0,#,"Country,Other",TotalCases,NewCases,TotalDeaths,NewDeaths,TotalRecovered,NewRecovered,ActiveCases,"Serious,Critical",Tot Cases/1M pop,Deaths/1M pop,TotalTests,Tests/,1M pop,Unnamed: 16,Population,Continent,1 Caseevery X ppl1 Deathevery X ppl1 Testevery X ppl,New Cases/1M pop
0,World,298409911,259043,5484286.0,2233,256929159.0,133960,35996466.0,91893,38283,703.6,,,,All,,,,,
1,1,USA,58805186,,853612,,41999896,,15951678,20938,176098.0,2556.0,826586018.0,2475292.0,333934783,North America,63910.0,,,47769.0
2,2,India,35109286,,482876,,34342255,,284155,8944,25069.0,345.0,685305751.0,489331.0,1400494834,Asia,4029002.0,,,203.0
3,3,Brazil,22351104,,619559,,21567845,,163700,8318,104035.0,2884.0,63776166.0,296852.0,214841765,South America,103473.0,,,762.0
4,4,UK,13835334,,149284,,10567672,,3118378,911,202198.0,2182.0,414403831.0,6056346.0,68424733,Europe,54580.0,,,45574.0
5,5,France,10921757,,124809,,8335903,,2461045,3333,166765.0,1906.0,188795159.0,2882732.0,65491742,Europe,65250.0,,,37578.0
6,6,Russia,10601300,15316.0,313817,802.0,9623677,22949.0,663806,2300,72597.0,2149.0,242300000.0,1659259.0,146029016,Europe,144651.0,105.0,5.0,4546.0
7,7,Turkey,9718861,,83075,,9192167,,443619,1128,113393.0,969.0,120595094.0,1407020.0,85709562,Asia,910321.0,,,5176.0
8,8,Germany,7342216,,113902,,6626500,44700.0,601814,4636,87211.0,1353.0,89622218.0,1064540.0,84188702,Europe,117391.0,,,7148.0
9,9,Spain,6922466,,89837,,5124221,,1708408,2005,147972.0,1920.0,66213858.0,1415366.0,46782142,Europe,75211.0,,,36518.0


In [116]:
df.to_csv('../Extracted_data/data-*.csv')


['c:/Gateway/Data Analyst/Git/Extracted_data/data-0.csv']